Nutch+lucene Search engine Development Practice

Last Update:2014-10-24 Source: Internet

Author: User

Tags web database

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Network topology

Figure 1 Network topology diagram

Installing the Java JDK

First check to see if the system has a different version number of the JDK installed, fake, first to the other version number of the JDK uninstall.

# Rpm-qa|grep GCJ

The following two lines of information are included in the display:

# java-1.6.0-openjdk-1.6.0.0-1.57.1.11.9.el6_4.i686

#java -1.7.0-openjdk-1.7.0.9-2.3.8.0.el6_4.i686
Unloading
#yum-y-Remove java-1.6.0-openjdk
#yum-y-Remove java-1.7.0-openjdk

Go to official website http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html download Jdk-7u60-linux-i586.tar.gz.

Create a Java folder under the/Usr folder.

#mkdir Java

Unzip the jdk-7u60-linux-i586.tar.gz to/usr/java.

#tar-ZVXF jdk-7u60-linux-i586.tar.gz-c/usr/java

Add Environment variables:

# Vi/etc/profile

Go into insert mode and join in the last line?

Java_home=/usr/java/jdk1.7.0_60

Export Java_home

Path= $JAVA _home/bin: $PATH

Export PATH

Classpath= $JAVA _home/lib/tools.jar: $JAVA _home/lib/dt.jar: $CLASSPATH

Export CLASSPATH

Wq Save exit.

Input command
#source/etc/profile

Causes the environment variable to take effect at the current sshclient.

Test:

#echo $JAVA _home Test environment variable configuration is in effect

#java-version View Java version number information

#java

#javac assume both commands can print out a message stating that the installation was successful.

Deploying Tomcat

To the official website http://tomcat.apache.org/download-70.cgi, download apache-tomcat-7.0.54.tar.gz.

Unzip the apache-tomcat-7.0.54.tar.gz to/usr/local.

# tar ZXVF apache-tomcat-7.0.54.tar.gz-c/usr/local

# cd/usr/local/

# MV apache-tomcat-7.0.54 Tomcat

At this point, you cannot access tomcat from outside, and you need to open 8080port on the Linux default firewall.

#vi/etc/sysconfig/iptables

Enter insert mode, join?

-ainput-p tcp-m State--state new-m TCP--dport 8080-j ACCEPT

Wq Save exit.

Once again, start the firewall.

#service iptables Restart

Start Tomcat.

#/usr/local/tomcat/bin/startup.sh

Enter http://10.1.1.95:8080/in the client browser to see the Tomcat Welcome screen! Indicates that the deployment was successful.

Deploying Nutch

Because we do not need to deploy nutch in a distributed environment, we use an earlier version number of nutch-0.9. Download nutch-0.9.tar.gz, unzip to/usr/local.

# tar zxvf nutch-0.9.tar.gz-c/usr/local

To deploy the Nutch search page:

Rename the/usr/local/tomcat/webapps/root directory to Root_back.

#cd/usr/local/tomcat/webapps/

#mvROOT Root_back

Copy the Nutch-0.9.war to/usr/local/tomcat/webapps/under the Nutch root folder.

#cp/usr/local/nutch-0.9/nutch-0.9.war/usr/local/tomcat/webapps/

Start Tomcat.

#/usr/local/tomcat/bin/startup.sh

After Tomcat successfully started, we found a nutch-0.9 directory in the/usr/local/tomcat/webapps/directory. Rename the nutch-0.9 to root.

#mvnutch -0.9 ROOT

Start Tomcat again.

#/usr/local/tomcat/bin/shutdown.sh

#/usr/local/tomcat/bin/startup.sh

Visit http://10.1.1.95:8080/in the client browser. Get the results as seen on page 1. Description The search page was successfully deployed.

Figure 2 Nutch default Search Entry page

Data fetching and content retrieval

Go to the Nutch root folder.

#cd/usr/local/nutch-0.9

New file Multiurls.txt.

#vimultiurls. txt

Input

http://sports.sina.com.cn/

http://sports.sohu.com/

http://sports.qq.com/

http://sports.cntv.cn/

http://sports.ifeng.com/

http://sports.163.com/

http://sports.uusee.com/

http://www.titan24.com/

Wq Save exit.

Change Craw-urlfilter.txt, agree to download free website.

#vi/usr/local/nutch-0.9/conf/craw-urlfilter.txt

Filter the original to stare out, instead of accepting random URLs.

# Accept hosts in MY. Domain.name

##+^http://([a-z0-9]*\.) *my. domain.name/

Wq Save exit.

Go to the Nutch folder and start the download task.

#cd/usr/local/nutch-0.9

#bin/nutch Crawl multiurls.txt–dirsports–depth 10–topn 100–threads 16

The meaning of the parameters is illustrated by the following:

-DIR Specifies the folder where the crawl results are stored, and the data is stored in the sports folder;

-depth indicates the depth of the page that needs to be crawled, the depth of this crawl is 10 layers;

-TOPN indicates that only the first n URLs are fetched, this fetch is the first 100 pages of each layer;

-THREADS Specifies the number of threads to be removed from the crawl, this time specifying 16 threads to download.

The download task starts running, 2. Wait 5 minutes or so, download task run complete, 3.

Figure 3 Starting the download task

Figure 4 Download Task end

As you can see from the download process, the steps for Nutch crawling a Web page and building an index library include the following:

1) Plug-in (Injector) to the Web database to join the starting root URL;

2) According to the number of layers required to crawl, with the generator (Generator) to generate the task to download;

3) Call the Picker (fetcher) to actually download the corresponding page according to the specified number of threads;

4) Call the page parser (parsesegment) to analyze the download;

5) Call the Web Database management tool (CRAWLDB), add the level two link to the library and wait for the download;

6) Call the link analysis tool (LINKDB) to establish a reverse link;

7) Call Indexer (Indexer), make use of web database, linked database and detailed download page content, create the current data index;

8) Call the data deduplication (deleteduplicates) repeatedly, delete the repeated data;

9) Call the index Consolidator (indexmerger) and merge the data into the historical index library.

Change the Nutch-site.xml file under the Nutch folder, add the Index folder property specifies the folder where the retriever reads the data.

#vi/usr/local/nutch-0.9/conf/nutch-site.xml

Join in between <configuration></configuration>?

<name>http.agent.name</name>

<value>sports.com</value>

<description>sports.com</description>

</property>

<name>searcher.dir</name>

<value>/usr/local/nutch-0.9/sports</value>

</property>

Wq Save exit.

Test the search under the Terminal commands form.

#cd/usr/local/nutch-0.9

#bin/nutchorg.apache.nutch.searcher.nutchbean Brazil

Search Results 5 found 213 related results.

Using the READDB tool summary descriptive narrative

#bin/nutch readdb Sports/crawldb–stats

Get summary information 6 to see, together with 15,917 links, successfully downloaded 601 pages.

After the above steps, search engine retrieval preparation work has been completed. The results are then deployed to Tomcatserver, enabling users to retrieve them in the browser. Steps such as the following:

Change the Nutch-site.xml file in the Tomcat/webapps/root/web-inf/classes directory to specify the search path attribute parameters.

#vi/usr/local/tomcat/webapps/root/web-inf/classes/nutch-site.xml

Join in between <configuration></configuration>?

<name>http.agent.name</name>

<value>sports.com</value>

<description>sports.com</description>

</property>

<name>searcher.dir</name>

<value>/usr/local/nutch-0.9/sports</value>

</property>

Start Tomcat again (assuming it starts without booting).

#/usr/local/tomcat/bin/shutdown.sh

#/usr/local/tomcat/bin/startup.sh

Figure 5 Running the retrieve command under the Terminal commands form

Figure 6 using READDB to get a summary descriptive narrative

Visit http://10.1.1.95:8080/in the client browser. Search "World Cup" error: AttributeValue is quoted with "which must was escaped when used within the value. Because the Tomcat version number is upgraded (more than 6.0), this error can occur if the double-cited argument includes a double-cited case. The workaround is to change the conf/catalina.properties file.

#vi/usr/local/tomcat/conf/catalina.properties

At the end of the join?

Org.apache.jasper.compiler.parser.strict_quote_escaping=false

Wq Save exit.

Search again, there is garbled in Chinese, the solution is to change the Conf/server.xml file.

#vi/usr/local/tomcat/conf/server.xml

Find the connector tag, add the property uriencoding= "UTF-8".

Wq Save exit.

When you start Tomcat again, you are able to search on the client side. Search for "Brazil World Cup" and get the result 7 see.

Figure 7 Results of the search on the client side

References

[1]. Wang Xuesong, Lucene+nutch search engine development, people's post and Telecommunications press, 2008.

[2]. http://www.coreservlets.com/Apache-Tomcat-Tutorial/

[3]. http://wiki.apache.org/nutch/NutchTutorial

Nutch+lucene Search engine Development Practice

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More