Nutch+tomcat Installation Detailed Tutorial

Source: Internet
Author: User

Environmental oraclelinux-r7-u2-server-x86_64

tomcat8.5 official website Download: http://apache.opencas.org/tomcat/tomcat-8/v8.5.0/bin/apache-tomcat-8.5.0.tar.gz

nutch1.0 Load: http://archive.apache.org/dist/nutch/nutch-1.0.tar.gz

JDK-8U77 official website Download: http://download.oracle.com/otn-pub/java/jdk/8u77-b03/jdk-8u77-linux-x64.rpm


Copy the download file to the/server directory


1. Installing the JDK


[Email protected] ~]# Cd/server

[Email protected] server]# RPM-IVH jre-8u77-linux-x64.rpm

[Email protected] server]# java-version

Java Version "1.8.0_77"

Java (TM) SE Runtime Environment (build 1.8.0_77-b03)

Java HotSpot (TM) 64-bit Server VM (build 25.77-b03, Mixed mode)

Configuring Environment variables

[Email protected] server]# Vi/etc/profile

Add the following at the end of the file

Export java_home=/usr/java/jdk1.8.0_77

Export Java_bin=/usr/java/jdk1.8.0_77/bin

Export path= $PATH: $JAVA _home/bin

Export classpath=.: $JAVA _home/lib/dt.jar: $JAVA _home/lib/tools.jar

Save exit

Use source to make parameters effective

[Email protected] server]# Source/etc/profile


2, Installation compat-libstdc++


[email protected] server]# Yum install compat-libstdc++*


3, Installation Nutch


Login Nutch User

[@Nutch server]# Su-nutch

Give Nutch user Rights

[Email protected] ~]$ chown-r nutch.nutch/server/

Decompression Nutch

[Email protected] ~]$ cd/server/

[Email protected] server]$ tar zxvf nutch-1.0.tar.gz

Modified filename called Nutch after decompression

[Email protected] server]$ MV nutch-1.0 Nutch


4. Install Tomcat


Unzip Tomcat

[Email protected] server]$ tar zxvf apache-tomcat-8.5.0.tar.gz

Modified file name after decompression tomcat

[Email protected] server]$ MV apache-tomcat-8.5.0 Tomcat

Start Tomcat (firewall 8080 is open)

[Email protected] server]$ tomcat/bin/startup.sh

Browser login http://<ip>:8080 See if successful


5. Configuring Tomcat


Delete all files under Tomcat/webpaas/root

Copy the Nutch folder under Nutch1.0.war to Tomcat/weapps/root

[Email protected] ~]$ Cp/server/nutch/nutch-1.0.war/server/tomcat/webapps/root/nutch.war

Enter the root directory to unzip the Nutch.war

[Email protected] ~]$ Cd/server/tomcat/webapps/root

[email protected] root]$ jar XVF Nutch.war

Launch Tomcat into the browse area to see if you can access the Nutch search interface

[Email protected] root]$/server/tomcat/bin/startup.sh

Configuring the Nutch-site.xml File

[Email protected] root]$ cd/server/tomcat/webapps/root/web-inf/classes/

[Email protected] classes]$ VI nutch-site.xml

Add the following between <configuration></configuration>

<configuration>


<property>

<name>searcher.dir</name>

<value>/server/crawl.demo</value>//value value points to the saved directory of the page Nutch crawled

</property>


<property>

<name>http.agent.name</name>

<value>nutch-1.0</value>

<description>http ' User-agent ' </description>

</property>


</configuration>


Configuring the Server.xml File

[Email protected] classes]$ cd/server/tomcat/conf/

[Email protected] conf]$ VI server.xml

Find connector port= "8080" statement and add the last two sentences

<connector port= "8080" protocol= "http/1.1"

connectiontimeout= "20000"

Redirectport= "8443"

Uriencoding= "UTF-8"

Usebodyencodingforuri= "true"/>

Save exit


6, Configuration Nutch


[Email protected] conf]$ cd/server/nutch/conf/


Configuring the Crawl-urlfilter.txt File

[Email protected] conf]$ VI crawl-urlfilter.txt

Take a look at the content

# Accept hosts in MY. Domain.name

+^http://([a-z0-9]*\.) *my. domain.name/

Modify to (change according to the content of your search)

# Accept hosts in MY. Domain.name

+^http://([a-z0-9]*\.) *com/

+^http://([a-z0-9]*\.) *cn/

+^http://([a-z0-9]*\.) *net/


Configuring the Regex-urlfilter.txt File

[Email protected] conf]$ VI regex-urlfilter.txt

Comment out the last line and add the following at the end

# Accept anything else

#+.

+^http://([a-z0-9]*\.) *com/

+^http://([a-z0-9]*\.) *cn/

+^http://([a-z0-9]*\.) *net/


Configuring the Nutch-site.xml File


Fill in the statement in the middle of <configuration></configuration>

<configuration>


<property>

<name>http.agent.name</name>

<value>nutch Nutch agent</value>

</property>

<property>

<name>http.agent.version</name>

<value>1.0</value>

</property>


</configuration>


Configure the URLs directory


Create a new URLs directory under/server

[Email protected] conf]$ cd/server/

[Email protected] server]$ mkdir URLs

Create a new URL file and fill in the domain name of the website you want to search (i filled in http://www.qq.com)

[[Email protected] urls]$ VI URL


Configure the Save directory for Nutch crawled pages


[Email protected] conf]$ cd/server/

[Email protected] server]$ mkdir Crawl.demo


Execute FETCH command


[Email protected] server]$ Cd/server/nutch

[Email protected] nutch]$ bin/nutch crawl/server/urls-dir/server/crawl.demo-depth 2-threads 4-topn >&/se Rver/crawl.demo/crawl.log

If the error:java_home is not set can be executed in the shell with Nutch once export java_home=/usr/java/jdk1.8.0_77


#/server/urls is the folder directory where the URLs are stored

#-dir/server/crawl.demo is the repository of crawled pages, corresponding to the set search directory in 3.1.2

#-depth refers to the depth of the crawl, here is the purpose of the test, the choice of depth of 2, complete crawling can be set to 10 or so

#-THREADS Specifies the concurrent process which is set to 4

#-TOPN refers to the maximum number of pages to crawl at the depth of each layer, and a full crawl can be set to 10,000 to 1 million depending on the number of site resources

# crawl process written in/server/crawl.demo/crawl.log


Crawl is complete to search the Web page


So far, Nutch installation is complete.






This article from "Linux" blog, declined reprint!

Nutch+tomcat Installation Detailed Tutorial

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.