Network topology
Figure 1 Network topology diagram
Installing the Java JDK
First check to see if the system has a different version number of the JDK installed, fake, first to the other version number of the JDK uninstall.
Log in to the system with the root user.
# Rpm-qa|grep GCJ
The following two lines of information are included in the display:
# java-1.6.0-openjdk-1.6.0.0-1.57.1.11.9.el6_4.i686
#java -1.7.0-openjdk-1.7.0.9-2.3.8.0.el6_4.i686
Unloading
#yum-y-Remove java-1.6.0-openjdk
#yum-y-Remove java-1.7.0-openjdk
Go to official website http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html download Jdk-7u60-linux-i586.tar.gz.
Create a Java folder under the/Usr folder.
#mkdir Java
Unzip the jdk-7u60-linux-i586.tar.gz to/usr/java.
#tar-ZVXF jdk-7u60-linux-i586.tar.gz-c/usr/java
Add Environment variables:
# Vi/etc/profile
Go into insert mode and join in the last line?
Java_home=/usr/java/jdk1.7.0_60
Export Java_home
Path= $JAVA _home/bin: $PATH
Export PATH
Classpath= $JAVA _home/lib/tools.jar: $JAVA _home/lib/dt.jar: $CLASSPATH
Export CLASSPATH
Wq Save exit.
Input command
#source/etc/profile
Causes the environment variable to take effect at the current sshclient.
Test:
#echo $JAVA _home Test environment variable configuration is in effect
#java-version View Java version number information
#java
#javac assume both commands can print out a message stating that the installation was successful.
Deploying Tomcat
To the official website http://tomcat.apache.org/download-70.cgi, download apache-tomcat-7.0.54.tar.gz.
Unzip the apache-tomcat-7.0.54.tar.gz to/usr/local.
# tar ZXVF apache-tomcat-7.0.54.tar.gz-c/usr/local
# cd/usr/local/
# MV apache-tomcat-7.0.54 Tomcat
At this point, you cannot access tomcat from outside, and you need to open 8080port on the Linux default firewall.
#vi/etc/sysconfig/iptables
Enter insert mode, join?
-ainput-p tcp-m State--state new-m TCP--dport 8080-j ACCEPT
Wq Save exit.
Once again, start the firewall.
#service iptables Restart
Start Tomcat.
#/usr/local/tomcat/bin/startup.sh
Enter http://10.1.1.95:8080/in the client browser to see the Tomcat Welcome screen! Indicates that the deployment was successful.
Deploying Nutch
Because we do not need to deploy nutch in a distributed environment, we use an earlier version number of nutch-0.9. Download nutch-0.9.tar.gz, unzip to/usr/local.
# tar zxvf nutch-0.9.tar.gz-c/usr/local
To deploy the Nutch search page:
Rename the/usr/local/tomcat/webapps/root directory to Root_back.
#cd/usr/local/tomcat/webapps/
#mvROOT Root_back
Copy the Nutch-0.9.war to/usr/local/tomcat/webapps/under the Nutch root folder.
#cp/usr/local/nutch-0.9/nutch-0.9.war/usr/local/tomcat/webapps/
Start Tomcat.
#/usr/local/tomcat/bin/startup.sh
After Tomcat successfully started, we found a nutch-0.9 directory in the/usr/local/tomcat/webapps/directory. Rename the nutch-0.9 to root.
#mvnutch -0.9 ROOT
Start Tomcat again.
#/usr/local/tomcat/bin/shutdown.sh
#/usr/local/tomcat/bin/startup.sh
Visit http://10.1.1.95:8080/in the client browser. Get the results as seen on page 1. Description The search page was successfully deployed.
Figure 2 Nutch default Search Entry page
Data fetching and content retrieval
Go to the Nutch root folder.
#cd/usr/local/nutch-0.9
New file Multiurls.txt.
#vimultiurls. txt
Input
http://sports.sina.com.cn/
http://sports.sohu.com/
http://sports.qq.com/
http://sports.cntv.cn/
http://sports.ifeng.com/
http://sports.163.com/
http://sports.uusee.com/
http://www.titan24.com/
Wq Save exit.
Change Craw-urlfilter.txt, agree to download free website.
#vi/usr/local/nutch-0.9/conf/craw-urlfilter.txt
Filter the original to stare out, instead of accepting random URLs.
# Accept hosts in MY. Domain.name
##+^http://([a-z0-9]*\.) *my. domain.name/
+^
Wq Save exit.
Go to the Nutch folder and start the download task.
#cd/usr/local/nutch-0.9
#bin/nutch Crawl multiurls.txt–dirsports–depth 10–topn 100–threads 16
The meaning of the parameters is illustrated by the following:
-DIR Specifies the folder where the crawl results are stored, and the data is stored in the sports folder;
-depth indicates the depth of the page that needs to be crawled, the depth of this crawl is 10 layers;
-TOPN indicates that only the first n URLs are fetched, this fetch is the first 100 pages of each layer;
-THREADS Specifies the number of threads to be removed from the crawl, this time specifying 16 threads to download.
The download task starts running, 2. Wait 5 minutes or so, download task run complete, 3.
Figure 3 Starting the download task
Figure 4 Download Task end
As you can see from the download process, the steps for Nutch crawling a Web page and building an index library include the following:
1) Plug-in (Injector) to the Web database to join the starting root URL;
2) According to the number of layers required to crawl, with the generator (Generator) to generate the task to download;
3) Call the Picker (fetcher) to actually download the corresponding page according to the specified number of threads;
4) Call the page parser (parsesegment) to analyze the download;
5) Call the Web Database management tool (CRAWLDB), add the level two link to the library and wait for the download;
6) Call the link analysis tool (LINKDB) to establish a reverse link;
7) Call Indexer (Indexer), make use of web database, linked database and detailed download page content, create the current data index;
8) Call the data deduplication (deleteduplicates) repeatedly, delete the repeated data;
9) Call the index Consolidator (indexmerger) and merge the data into the historical index library.
Change the Nutch-site.xml file under the Nutch folder, add the Index folder property specifies the folder where the retriever reads the data.
#vi/usr/local/nutch-0.9/conf/nutch-site.xml
Join in between <configuration></configuration>?
<property>
<name>http.agent.name</name>
<value>sports.com</value>
<description>sports.com</description>
</property>
<property>
<name>searcher.dir</name>
<value>/usr/local/nutch-0.9/sports</value>
<description></description>
</property>
Wq Save exit.
Test the search under the Terminal commands form.
#cd/usr/local/nutch-0.9
#bin/nutchorg.apache.nutch.searcher.nutchbean Brazil
Search Results 5 found 213 related results.
Using the READDB tool summary descriptive narrative
#bin/nutch readdb Sports/crawldb–stats
Get summary information 6 to see, together with 15,917 links, successfully downloaded 601 pages.
After the above steps, search engine retrieval preparation work has been completed. The results are then deployed to Tomcatserver, enabling users to retrieve them in the browser. Steps such as the following:
Change the Nutch-site.xml file in the Tomcat/webapps/root/web-inf/classes directory to specify the search path attribute parameters.
#vi/usr/local/tomcat/webapps/root/web-inf/classes/nutch-site.xml
Join in between <configuration></configuration>?
<property>
<name>http.agent.name</name>
<value>sports.com</value>
<description>sports.com</description>
</property>
<property>
<name>searcher.dir</name>
<value>/usr/local/nutch-0.9/sports</value>
<description></description>
</property>
Start Tomcat again (assuming it starts without booting).
#/usr/local/tomcat/bin/shutdown.sh
#/usr/local/tomcat/bin/startup.sh
Figure 5 Running the retrieve command under the Terminal commands form
Figure 6 using READDB to get a summary descriptive narrative
Visit http://10.1.1.95:8080/in the client browser. Search "World Cup" error: AttributeValue is quoted with "which must was escaped when used within the value. Because the Tomcat version number is upgraded (more than 6.0), this error can occur if the double-cited argument includes a double-cited case. The workaround is to change the conf/catalina.properties file.
#vi/usr/local/tomcat/conf/catalina.properties
At the end of the join?
Org.apache.jasper.compiler.parser.strict_quote_escaping=false
Wq Save exit.
Search again, there is garbled in Chinese, the solution is to change the Conf/server.xml file.
#vi/usr/local/tomcat/conf/server.xml
Find the connector tag, add the property uriencoding= "UTF-8".
Wq Save exit.
When you start Tomcat again, you are able to search on the client side. Search for "Brazil World Cup" and get the result 7 see.
Figure 7 Results of the search on the client side
References
[1]. Wang Xuesong, Lucene+nutch search engine development, people's post and Telecommunications press, 2008.
[2]. http://www.coreservlets.com/Apache-Tomcat-Tutorial/
[3]. http://wiki.apache.org/nutch/NutchTutorial
Nutch+lucene Search engine Development Practice