Environmental oraclelinux-r7-u2-server-x86_64
tomcat8.5 official website Download: http://apache.opencas.org/tomcat/tomcat-8/v8.5.0/bin/apache-tomcat-8.5.0.tar.gz
nutch1.0 Load: http://archive.apache.org/dist/nutch/nutch-1.0.tar.gz
JDK-8U77 official website Download: http://download.oracle.com/otn-pub/java/jdk/8u77-b03/jdk-8u77-linux-x64.rpm
Copy the download file to the/server directory
1. Installing the JDK
[Email protected] ~]# Cd/server
[Email protected] server]# RPM-IVH jre-8u77-linux-x64.rpm
[Email protected] server]# java-version
Java Version "1.8.0_77"
Java (TM) SE Runtime Environment (build 1.8.0_77-b03)
Java HotSpot (TM) 64-bit Server VM (build 25.77-b03, Mixed mode)
Configuring Environment variables
[Email protected] server]# Vi/etc/profile
Add the following at the end of the file
Export java_home=/usr/java/jdk1.8.0_77
Export Java_bin=/usr/java/jdk1.8.0_77/bin
Export path= $PATH: $JAVA _home/bin
Export classpath=.: $JAVA _home/lib/dt.jar: $JAVA _home/lib/tools.jar
Save exit
Use source to make parameters effective
[Email protected] server]# Source/etc/profile
2, Installation compat-libstdc++
[email protected] server]# Yum install compat-libstdc++*
3, Installation Nutch
Login Nutch User
[@Nutch server]# Su-nutch
Give Nutch user Rights
[Email protected] ~]$ chown-r nutch.nutch/server/
Decompression Nutch
[Email protected] ~]$ cd/server/
[Email protected] server]$ tar zxvf nutch-1.0.tar.gz
Modified filename called Nutch after decompression
[Email protected] server]$ MV nutch-1.0 Nutch
4. Install Tomcat
Unzip Tomcat
[Email protected] server]$ tar zxvf apache-tomcat-8.5.0.tar.gz
Modified file name after decompression tomcat
[Email protected] server]$ MV apache-tomcat-8.5.0 Tomcat
Start Tomcat (firewall 8080 is open)
[Email protected] server]$ tomcat/bin/startup.sh
Browser login http://<ip>:8080 See if successful
5. Configuring Tomcat
Delete all files under Tomcat/webpaas/root
Copy the Nutch folder under Nutch1.0.war to Tomcat/weapps/root
[Email protected] ~]$ Cp/server/nutch/nutch-1.0.war/server/tomcat/webapps/root/nutch.war
Enter the root directory to unzip the Nutch.war
[Email protected] ~]$ Cd/server/tomcat/webapps/root
[email protected] root]$ jar XVF Nutch.war
Launch Tomcat into the browse area to see if you can access the Nutch search interface
[Email protected] root]$/server/tomcat/bin/startup.sh
Configuring the Nutch-site.xml File
[Email protected] root]$ cd/server/tomcat/webapps/root/web-inf/classes/
[Email protected] classes]$ VI nutch-site.xml
Add the following between <configuration></configuration>
<configuration>
<property>
<name>searcher.dir</name>
<value>/server/crawl.demo</value>//value value points to the saved directory of the page Nutch crawled
</property>
<property>
<name>http.agent.name</name>
<value>nutch-1.0</value>
<description>http ' User-agent ' </description>
</property>
</configuration>
Configuring the Server.xml File
[Email protected] classes]$ cd/server/tomcat/conf/
[Email protected] conf]$ VI server.xml
Find connector port= "8080" statement and add the last two sentences
<connector port= "8080" protocol= "http/1.1"
connectiontimeout= "20000"
Redirectport= "8443"
Uriencoding= "UTF-8"
Usebodyencodingforuri= "true"/>
Save exit
6, Configuration Nutch
[Email protected] conf]$ cd/server/nutch/conf/
Configuring the Crawl-urlfilter.txt File
[Email protected] conf]$ VI crawl-urlfilter.txt
Take a look at the content
# Accept hosts in MY. Domain.name
+^http://([a-z0-9]*\.) *my. domain.name/
Modify to (change according to the content of your search)
# Accept hosts in MY. Domain.name
+^http://([a-z0-9]*\.) *com/
+^http://([a-z0-9]*\.) *cn/
+^http://([a-z0-9]*\.) *net/
Configuring the Regex-urlfilter.txt File
[Email protected] conf]$ VI regex-urlfilter.txt
Comment out the last line and add the following at the end
# Accept anything else
#+.
+^http://([a-z0-9]*\.) *com/
+^http://([a-z0-9]*\.) *cn/
+^http://([a-z0-9]*\.) *net/
Configuring the Nutch-site.xml File
Fill in the statement in the middle of <configuration></configuration>
<configuration>
<property>
<name>http.agent.name</name>
<value>nutch Nutch agent</value>
</property>
<property>
<name>http.agent.version</name>
<value>1.0</value>
</property>
</configuration>
Configure the URLs directory
Create a new URLs directory under/server
[Email protected] conf]$ cd/server/
[Email protected] server]$ mkdir URLs
Create a new URL file and fill in the domain name of the website you want to search (i filled in http://www.qq.com)
[[Email protected] urls]$ VI URL
Configure the Save directory for Nutch crawled pages
[Email protected] conf]$ cd/server/
[Email protected] server]$ mkdir Crawl.demo
Execute FETCH command
[Email protected] server]$ Cd/server/nutch
[Email protected] nutch]$ bin/nutch crawl/server/urls-dir/server/crawl.demo-depth 2-threads 4-topn >&/se Rver/crawl.demo/crawl.log
If the error:java_home is not set can be executed in the shell with Nutch once export java_home=/usr/java/jdk1.8.0_77
#/server/urls is the folder directory where the URLs are stored
#-dir/server/crawl.demo is the repository of crawled pages, corresponding to the set search directory in 3.1.2
#-depth refers to the depth of the crawl, here is the purpose of the test, the choice of depth of 2, complete crawling can be set to 10 or so
#-THREADS Specifies the concurrent process which is set to 4
#-TOPN refers to the maximum number of pages to crawl at the depth of each layer, and a full crawl can be set to 10,000 to 1 million depending on the number of site resources
# crawl process written in/server/crawl.demo/crawl.log
Crawl is complete to search the Web page
So far, Nutch installation is complete.
This article from "Linux" blog, declined reprint!
Nutch+tomcat Installation Detailed Tutorial