Nutch+lucene Search engine Development Practice

Last Update:2014-10-14 Source: Internet

Author: User

Tags web database

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Network topology

Figure 1 Network topology diagram

Installing the Java JDK

First check to see if the system has other versions of the JDK installed, and if so, uninstall the other versions of the JDK first.

# Rpm-qa|grep GCJ

Display content contains the following two lines of information

# java-1.6.0-openjdk-1.6.0.0-1.57.1.11.9.el6_4.i686

#java -1.7.0-openjdk-1.7.0.9-2.3.8.0.el6_4.i686
Unloading
#yum-y-Remove java-1.6.0-openjdk
#yum-y-Remove java-1.7.0-openjdk

Go to official website http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html download Jdk-7u60-linux-i586.tar.gz.

Create a Java directory under the/USR directory.

#mkdir Java

Unzip the jdk-7u60-linux-i586.tar.gz to/usr/java.

#tar-ZVXF jdk-7u60-linux-i586.tar.gz-c/usr/java

To add an environment variable:

# Vi/etc/profile

Enter insert mode, add in last line

Java_home=/usr/java/jdk1.7.0_60

Export Java_home

Path= $JAVA _home/bin: $PATH

Export PATH

Classpath= $JAVA _home/lib/tools.jar: $JAVA _home/lib/dt.jar: $CLASSPATH

Export CLASSPATH

Wq Save exit.

Input command
#source/etc/profile

Make the environment variable effective in the current SSH client.

Test:

#echo $JAVA _home Test environment variable configuration is in effect

#java-version View Java version information

#java

#javac if both commands can print out information, the installation is successful.

Deploying Tomcat

To the official website http://tomcat.apache.org/download-70.cgi, download apache-tomcat-7.0.54.tar.gz.

Unzip the apache-tomcat-7.0.54.tar.gz to/usr/local.

# tar ZXVF apache-tomcat-7.0.54.tar.gz-c/usr/local

# cd/usr/local/

# MV apache-tomcat-7.0.54 Tomcat

At this point, you cannot access tomcat from outside, and you need to open port 8080 on the Linux default firewall.

#vi/etc/sysconfig/iptables

Enter insert mode, add

-ainput-p tcp-m State--state new-m TCP--dport 8080-j ACCEPT

Wq Save exit.

Restart the firewall.

#service iptables Restart

Start Tomcat.

#/usr/local/tomcat/bin/startup.sh

Enter http://10.1.1.95:8080/in the user-side browser to see the Tomcat Welcome screen! Indicates that the deployment was successful.

Deploying Nutch

Since we do not need to deploy nutch in a distributed environment, we use an earlier version of nutch-0.9. Download nutch-0.9.tar.gz, unzip to/usr/local.

# tar zxvf nutch-0.9.tar.gz-c/usr/local

To deploy the Nutch search page:

Rename the/usr/local/tomcat/webapps/root folder to Root_back.

#cd/usr/local/tomcat/webapps/

#mvROOT Root_back

Copy Nutch-0.9.war to/usr/local/tomcat/webapps/under the Nutch root directory.

#cp/usr/local/nutch-0.9/nutch-0.9.war/usr/local/tomcat/webapps/

Start Tomcat.

#/usr/local/tomcat/bin/startup.sh

After Tomcat successfully started, we found a nutch-0.9 folder in the/usr/local/tomcat/webapps/folder. Rename the nutch-0.9 to root.

#mvnutch -0.9 ROOT

Restart Tomcat.

#/usr/local/tomcat/bin/shutdown.sh

#/usr/local/tomcat/bin/startup.sh

Access the http://10.1.1.95:8080/in the user-side browser. The results are shown in Page 1. Description The search page was successfully deployed.

Figure 2 Nutch default Search Entry page

Data fetching and content retrieval

Enter the Nutch root directory.

#cd/usr/local/nutch-0.9

New file Multiurls.txt.

#vimultiurls. txt

Input

http://sports.sina.com.cn/

http://sports.sohu.com/

http://sports.qq.com/

http://sports.cntv.cn/

http://sports.ifeng.com/

http://sports.163.com/

http://sports.uusee.com/

http://www.titan24.com/

Wq Save exit.

Modify Craw-urlfilter.txt to allow download of any site.

#vi/usr/local/nutch-0.9/conf/craw-urlfilter.txt

The original filter is commented out, instead of accepting any URL.

# Accept hosts in MY. Domain.name

##+^http://([a-z0-9]*\.) *my. domain.name/

Wq Save exit.

Enter the Nutch directory to start the download task.

#cd/usr/local/nutch-0.9

#bin/nutch Crawl multiurls.txt–dirsports–depth 10–topn 100–threads 16

The meaning of the parameters is described below:

-DIR Specifies the directory where crawling results are stored, and the data of this fetch is stored in the sports directory;

-depth indicates the depth of the page to be crawled, the depth of this crawl is 10 layers;

-TOPN indicates that only the first n URLs are fetched, and this fetch is the first 100 pages of each layer;

-THREADS Specifies the number of threads that crawl takes to download, this time specifying 16 threads to download.

The download task starts executing, 2. Wait 5 minutes or so, download task completed, 3.

Figure 3 Starting the download task

Figure 4 Download Task end

As you can see from the download process, the process of Nutch crawling Web pages and building an index library is as follows:

1) the insertion device (Injector) adds the starting root URL to the Web page database;

2) generate the task to be downloaded with the generator (Generator) According to the number of layers required to crawl;

3) Call the Picker (fetcher) to actually download the page according to the specified number of threads;

4) Call the page parser (parsesegment) to analyze the download;

5) Call the Web Database management tool (CRAWLDB), add the level two link to the library and wait for the download;

6) Call the link analysis tool (LINKDB) to establish a reverse link;

7) Call Indexer (Indexer), make use of web database, linked database and specific downloaded page content, create the current data index;

8) Call Data deduplication (deleteduplicates) to delete duplicate data;

9) Call the index Consolidator (indexmerger) and merge the data into the historical index library.

Modify the Nutch-site.xml file under the Nutch directory to increase the index directory property to specify the directory where the retriever reads the data.

#vi/usr/local/nutch-0.9/conf/nutch-site.xml

Add between <configuration></configuration>

<name>http.agent.name</name>

<value>sports.com</value>

<description>sports.com</description>

</property>

<name>searcher.dir</name>

<value>/usr/local/nutch-0.9/sports</value>

</property>

Wq Save exit.

Test the retrieval under the Terminal Command window.

#cd/usr/local/nutch-0.9

#bin/nutchorg.apache.nutch.searcher.nutchbean Brazil

A total of 213 results were found in the search results 5.

Using the READDB Tool Summary description

#bin/nutch readdb Sports/crawldb–stats

Get the summary information 6, a total of 15,917 links, successfully downloaded 601 pages.

After the above steps, search engine retrieval preparation work has been completed. The results are then deployed to the Tomcat server, allowing users to retrieve them in the browser. The process is as follows:

Modify the Nutch-site.xml file under the Tomcat/webapps/root/web-inf/classes folder to specify the retrieve path attribute parameter value.

#vi/usr/local/tomcat/webapps/root/web-inf/classes/nutch-site.xml

Add between <configuration></configuration>

<name>http.agent.name</name>

<value>sports.com</value>

<description>sports.com</description>

</property>

<name>searcher.dir</name>

<value>/usr/local/nutch-0.9/sports</value>

</property>

Restart Tomcat (start directly if it is not started).

#/usr/local/tomcat/bin/shutdown.sh

#/usr/local/tomcat/bin/startup.sh

Figure 5 Performing a retrieval command under a Terminal command window

Figure 6 Getting a summary description using READDB

Access the http://10.1.1.95:8080/in the user-side browser. Search "World Cup" error: AttributeValue is quoted with "which must was escaped when used within the value. Because the Tomcat version is upgraded (more than 6.0), this error can occur if double quotation marks are involved in the handling mechanism of double quotes. The workaround is to modify the Conf/catalina.properties file.

#vi/usr/local/tomcat/conf/catalina.properties

At the end of the add

Org.apache.jasper.compiler.parser.strict_quote_escaping=false

Wq Save exit.

Search again, there is garbled in Chinese, the solution is to modify the Conf/server.xml file.

#vi/usr/local/tomcat/conf/server.xml

Locate the connector tag and add the attribute uriencoding= "UTF-8".

Wq Save exit.

Restarting Tomcat will allow you to search on the client side. Search for "Brazil World Cup", as shown in result 7.

Figure 7 Results of the search on the client side

Reference documents

[1]. Wang Xuesong, Lucene+nutch search engine development, people's post and Telecommunications press, 2008.

[2]. http://www.coreservlets.com/Apache-Tomcat-Tutorial/

[3]. http://wiki.apache.org/nutch/NutchTutorial

Nutch+lucene Search engine Development Practice

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More