Nutch+tomcat Installation Detailed Tutorial

Last Update:2016-04-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Environmental oraclelinux-r7-u2-server-x86_64

tomcat8.5 official website Download: http://apache.opencas.org/tomcat/tomcat-8/v8.5.0/bin/apache-tomcat-8.5.0.tar.gz

nutch1.0 Load: http://archive.apache.org/dist/nutch/nutch-1.0.tar.gz

JDK-8U77 official website Download: http://download.oracle.com/otn-pub/java/jdk/8u77-b03/jdk-8u77-linux-x64.rpm

Copy the download file to the/server directory

1. Installing the JDK

[Email protected] ~]# Cd/server

[Email protected] server]# RPM-IVH jre-8u77-linux-x64.rpm

[Email protected] server]# java-version

Java Version "1.8.0_77"

Java (TM) SE Runtime Environment (build 1.8.0_77-b03)

Java HotSpot (TM) 64-bit Server VM (build 25.77-b03, Mixed mode)

Configuring Environment variables

[Email protected] server]# Vi/etc/profile

Add the following at the end of the file

Export java_home=/usr/java/jdk1.8.0_77

Export Java_bin=/usr/java/jdk1.8.0_77/bin

Export path= $PATH: $JAVA _home/bin

Export classpath=.: $JAVA _home/lib/dt.jar: $JAVA _home/lib/tools.jar

Save exit

Use source to make parameters effective

[Email protected] server]# Source/etc/profile

2, Installation compat-libstdc++

[email protected] server]# Yum install compat-libstdc++*

3, Installation Nutch

[@Nutch server]# Su-nutch

Give Nutch user Rights

[Email protected] ~]$ chown-r nutch.nutch/server/

Decompression Nutch

[Email protected] ~]$ cd/server/

[Email protected] server]$ tar zxvf nutch-1.0.tar.gz

Modified filename called Nutch after decompression

[Email protected] server]$ MV nutch-1.0 Nutch

4. Install Tomcat

Unzip Tomcat

[Email protected] server]$ tar zxvf apache-tomcat-8.5.0.tar.gz

Modified file name after decompression tomcat

[Email protected] server]$ MV apache-tomcat-8.5.0 Tomcat

Start Tomcat (firewall 8080 is open)

[Email protected] server]$ tomcat/bin/startup.sh

Browser login http://<ip>:8080 See if successful

5. Configuring Tomcat

Delete all files under Tomcat/webpaas/root

Copy the Nutch folder under Nutch1.0.war to Tomcat/weapps/root

[Email protected] ~]$ Cp/server/nutch/nutch-1.0.war/server/tomcat/webapps/root/nutch.war

Enter the root directory to unzip the Nutch.war

[Email protected] ~]$ Cd/server/tomcat/webapps/root

[email protected] root]$ jar XVF Nutch.war

Launch Tomcat into the browse area to see if you can access the Nutch search interface

[Email protected] root]$/server/tomcat/bin/startup.sh

Configuring the Nutch-site.xml File

[Email protected] root]$ cd/server/tomcat/webapps/root/web-inf/classes/

[Email protected] classes]$ VI nutch-site.xml

Add the following between <configuration></configuration>

<name>searcher.dir</name>

<value>/server/crawl.demo</value>//value value points to the saved directory of the page Nutch crawled

</property>

<name>http.agent.name</name>

<value>nutch-1.0</value>

<description>http ' User-agent ' </description>

</property>

</configuration>

Configuring the Server.xml File

[Email protected] classes]$ cd/server/tomcat/conf/

[Email protected] conf]$ VI server.xml

Find connector port= "8080" statement and add the last two sentences

<connector port= "8080" protocol= "http/1.1"

connectiontimeout= "20000"

Redirectport= "8443"

Uriencoding= "UTF-8"

Usebodyencodingforuri= "true"/>

Save exit

6, Configuration Nutch

[Email protected] conf]$ cd/server/nutch/conf/

Configuring the Crawl-urlfilter.txt File

[Email protected] conf]$ VI crawl-urlfilter.txt

Take a look at the content

# Accept hosts in MY. Domain.name

+^http://([a-z0-9]*\.) *my. domain.name/

Modify to (change according to the content of your search)

# Accept hosts in MY. Domain.name

+^http://([a-z0-9]*\.) *com/

+^http://([a-z0-9]*\.) *cn/

+^http://([a-z0-9]*\.) *net/

Configuring the Regex-urlfilter.txt File

[Email protected] conf]$ VI regex-urlfilter.txt

Comment out the last line and add the following at the end

# Accept anything else

#+.

+^http://([a-z0-9]*\.) *com/

+^http://([a-z0-9]*\.) *cn/

+^http://([a-z0-9]*\.) *net/

Configuring the Nutch-site.xml File

Fill in the statement in the middle of <configuration></configuration>

<name>http.agent.name</name>

<value>nutch Nutch agent</value>

</property>

<name>http.agent.version</name>

</property>

</configuration>

Configure the URLs directory

Create a new URLs directory under/server

[Email protected] conf]$ cd/server/

[Email protected] server]$ mkdir URLs

Create a new URL file and fill in the domain name of the website you want to search (i filled in http://www.qq.com)

[[Email protected] urls]$ VI URL

Configure the Save directory for Nutch crawled pages

[Email protected] conf]$ cd/server/

[Email protected] server]$ mkdir Crawl.demo

Execute FETCH command

[Email protected] server]$ Cd/server/nutch

[Email protected] nutch]$ bin/nutch crawl/server/urls-dir/server/crawl.demo-depth 2-threads 4-topn >&/se Rver/crawl.demo/crawl.log

If the error:java_home is not set can be executed in the shell with Nutch once export java_home=/usr/java/jdk1.8.0_77

#/server/urls is the folder directory where the URLs are stored

#-dir/server/crawl.demo is the repository of crawled pages, corresponding to the set search directory in 3.1.2

#-depth refers to the depth of the crawl, here is the purpose of the test, the choice of depth of 2, complete crawling can be set to 10 or so

#-THREADS Specifies the concurrent process which is set to 4

#-TOPN refers to the maximum number of pages to crawl at the depth of each layer, and a full crawl can be set to 10,000 to 1 million depending on the number of site resources

# crawl process written in/server/crawl.demo/crawl.log

Crawl is complete to search the Web page

So far, Nutch installation is complete.

This article from "Linux" blog, declined reprint!

Nutch+tomcat Installation Detailed Tutorial

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Nutch+tomcat Installation Detailed Tutorial

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support