[Nutch] configuration of a single machine in Linux.

Last Update:2018-12-04 Source: Internet

Author: User

Tags apache tomcat

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Environment Introduction

Operating System: Red Hat Linux 9

Nutch version: nutch-0.9, download: http://apache.etoak.com/lucene/nutch/

JDK version: JDK 1.6.

Apache Tomcat: APACHE-Tomcat-6.0.18

Http://apache.etoak.com/tomcat/tomcat-6/v6.0.18/bin/apache-tomcat-6.0.18.tar.gz

2. Configure prerequisite 2.1 to install JDK 1.6

First download JDK installation package jdk-1_6_0_13-linux-i586-rpm.bin

Step 1: # chmod + x jdk-1_6_0_13-linux-i586-rpm.bin (get execution permission)

Step 2: #./jdk-1_6_0_13-linux-i586-rpm.bin (generate RPM installation package)

Step 3: # rpm-IVH jdk-1_6_0_13-linux-i586.Rpm (install JDK)

After installation,JDK is installed in the/usr/Java/directory by default..

Step 4: Configure Java environment variables.

Set environment variables in/etc/profile

[Root @ red-hat-9 root] # vi/etc/profile

Add the following statement:

Java_home =/usr/Java/jdk1.6.0 _ 13

Export java_home

Classpath =.: $ java_home/lib: $ java_home/JRE/lib (note the colon)

Export classpath

Path = $ java_home/bin: $ java_home/JRE/bin: $ path

[Root @ red-hat-9 root] # chmod + x/etc/profile (Execution permission)

[Root @ red-hat-9 root] # source/etc/profile (effective later)

2.2 install tombact

Step 1: Set environment variables (not required)

[Root @ red-hat-9 program] # vi/etc/profile

Export jdk_home = $ java_home

[Root @ red-hat-9 program] # source/etc/profile

Step 2: Install tomnact and decompress it to a directory.

Tar xf apache-tomcat-6.0.18.tar.gz

Mv apache-Tomcat-6.0.18/zkl/progaram/

Step 3: How to Use Apache Tomcat

(1) start Tomcat and run the following command:

#/Zkl/Program/Apache-Tomcat-6.0.18/bin/startup. Sh

② The main directory of Tomcat webpage is/zkl/Program/Apache-Tomcat-6.0.18/Webapps/, You only need to add the corresponding web page in the webapps directory to access it in the browser. The default Tomcat directory is the root directory under webapps.

Http: // 127.0.0.1: 8080/access the default home directory of Tomcat, Root

Http: // 127.0.0.1: 8080/luceneweb put luceneweb into webapps

③ The Apache HTTP server port is 80, and the http: // 127.0.0.1 accesses the Apache main directory.

The port of the Apache Tomcat server is 8080. The two servers do not conflict. If there is a conflict, you can modify the tomcat configuration file server. xml.

VI/zkl/Program/Apache-Tomcat-6.0.18/CONF/server. xml

<! -- Define a non-ssl http/1.1 Connector on port 8080 -->
<Connector Port = "8080" maxhttpheadersize = "8192"
Maxthreads = "150" minsparethreads = "25" maxsparethreads = "75"
Enablelookups = "false" redirectport = "8443" acceptcount = "100"
C disableuploadtimeout = "true"
Url = "UTF-8" usebodyencodingforuri = "true"/>

The default service port is 8080. If there is a conflict (such as APACHE), you can use this configuration file to change the port (blue). If Chinese characters are garbled after configuration, the encoding configuration is added (red)

3. Configure and apply the configuration of nutch3.1

Download nutch-0.9.tar.gz;

Step 1: unzip the installation package

# Tar zxvf Nutch-0.9.tar.gz
# Music Nutch-0.9/zkl/IR/nutch-0.9

Step 2: Test

#/Zkl/IR/nutch-0.9/bin/nutch

If the following text appears, the installation is successful:

Usage: nutch command

Where command is one:

Crawl one-step crawler for Intranets

Readdb read/dump crawl DB

Step 3: Set

① Set the website crawling entry URL

[Root @ red-hat-9 nutch-0.9] # cd/zkl/IR/nutch-0.9/

[Root @ red-hat-9 nutch-0.9] # mkdir URLs

[Root @ red-hat-9 nutch-0.9] # vi URLs/urls_crawl.txt

Create a directory and create a file urls_crawl.txt. We use this method,

[Root @ red-hat-9 nutch-0.9] # vi urls_crawl.txt

Write the entry URL of the website to be crawled (crawl), that is, capture any URL page under the current domain name starting from this entry, for example:

Http://english.gu.cas.cn/ag/

② Specify crawling filtering rules

Edit the URL filtering rule file CONF/crawl-urlfilter.txt for nutch

[Root @ red-hat-9 nutch-0.9] # vi CONF/crawl-urlfilter.txt

Modify

# Accept hosts in my. domain. Name
# + ^ Http: // (/[a-z0-9] */.) * My. domain. Name/

This is the domain name of the website you want to crawl, indicating to crawl all the URL pages under the current website, crawling the starting website on①.

③ Filter character settings

If the URL that crawls a website contains the following filter characters, for example? And =, And you need these access, you can change the filter table

# Skip URLs containing certain characters as probable queries, etc.
-[? *! @ =]
Change
-[*! @]

④ Modify CONF/nutch-site.xml

Change

<Configuration>
<Property>
<Name>HTTP. Agent. Name</Name> HTTP. Agent. Name attribute
<Value> gucas.ac.cn </value> indicates the name of the crawled website,

Used in the nutch search
</Property>
<Property>
<Name> HTTP. Agent. version </Name>
<Value> 1.0 </value>
</Property>

<Name>Searcher. dir</Name>

<Value>/zkl/IR/nutch-0.9/gucas </value>

<Description> path to root of Crawl </description>

</Property>
</Configuration>
If this agent is not configured, the agent name not configured will appear during crawling! .

⑤ Start crawling

Run the crawl command to capture the website content

[Root @ red-hat-9 nutch-0.9] # bin/nutch crawl urls_crawl.txt-Dir gucas-Depth 50-threads 5-topn 1000> & logs/logs_crawl.log

·-DirDirNames settings to save the directory of the crawled web page.

·-DepthDepthIndicates the level depth of the captured webpage.

·-DelayDelayLatency of accessing different hosts, measured in seconds"

·-ThreadsThreadsIndicates the number of threads to be started

·-Topn 1000 indicates that only the first n URLs of each layer are captured.

In the parameters of the preceding command,Urls_crawl.txtThe created package contains the directory of the file urls_crawl.txt that captures the network. dir specifies the directory where the captured content is stored. Here is gucas; depth indicates the crawling depth starting from the top-level website URL to be crawled; threads specifies the number of concurrent threads; topn indicates that only the first n URLs of each layer are captured. The last logs/logs_crawl.log indicates that the content displayed during the capture process is saved in the logs_crawl.log file under the logs directory, to analyze the running status of the program.

After this command is run,The gucas directory will be generated under the nutch-0.9 directory, and there will be captured files and generated IndexesIn addition, there will be the remaining logs directory under the nutch-0.9 Directory, which generates a file logs_crawl.log that stores the capture log.

If gucas already exists before running, the following error occurs during running: gucas already exist. We recommend that you delete this directory or specify other directories to store captured web pages.

After completing the preceding steps, Data Capturing is completed successfully.

Test: Bin/nutch ORG/Apache/nutch/searcher/nutchbean

Query the keyword "".

The above only crawls a single website, does not reflect the advantages of Web Crawlers crawling data from multiple websites. The following example shows how to crawl data from multiple websites:

Create a new multiurls.txt file in the main directory of nutch, and write the list of URLs to be downloaded.

Http://www.pcauto.com.cn/

Http://www.xcar.com.cn/

Http://auto.sina.com.cn

Modify the filter rule file crawl-urlfilter.txt, allowing you to download any site

# Accept hosts in my. domain. Name

+ ^ // All website links are allowed by default.

# Skip everything else

Run the capture command

[Root @ red-hat-9 nutch-0.9] # bin/nutch crawl multiurls.txt-Dir mutilweb-Depth 50-threads 5-topn 1000> & logs/logs_crawl.log

Change CONF/nutch-site.xml

To:

<Configuration>
<Property>
<Name>HTTP. Agent. Name</Name> HTTP. Agent. Name attribute
<Value> * </value> indicates the name of a web crawler,
</Property>
<Property>
<Name> HTTP. Agent. version </Name>
<Value> 1.0 </value>
</Property>

<Name>Searcher. dir</Name>

<Value>/zkl/IR/nutch-0.9/gucas </value>

<Description> path to root of Crawl </description>

</Property>
</Configuration>
Test: Bin/nutch ORG/Apache/nutch/searcher/nutchbean SUV

Query the keyword "suv ".

---------------------------------------------------------------------

6. Deploy Web Front-end
Copy the nutch-0.9.war package under the main directory of the nutch to the webapps directory of Tomcat
[Root @ red-hat-9 nutch-0.9] # cp nutch-0.9.war/zkl/Program/Apache-Tomcat-6.0.18/webapps/
Then the browser URL http: // localhost: 8080/nutch-0.9/, the war package is automatically decompressed, A nutch-0.9 folder appears under the Tomcat web home directory webapps.

7. Modify the Web configuration of nutch in Tomcat
VI/zkl/Program/Apache-Tomcat-6.0.18/webapps // nutch-0.9/WEB-INF/classes/nutch-site.xml

SetSearcher. dirThe attribute value is changed to the directory generated by the index.

<Name>Searcher. dir</Name>

<Value>/zkl/IR/nutch-0.9/gucas </value>

Path to root of Crawl. This directory is searched (in

Order) for either the file search-servers.txt, containing a list

Distributed search servers, or the directory "Index" containing

Merged indexes, or the directory "segments" containing segment

Indexes.

</Description>

</Property>

</Configuration>

3.2 apply nutch (no result is still unsolved)

Restart tomcat,

Then visit the URL http: // localhost: 8080/nutch-0.9/

Error highlights:

① Enter a keyword and click search to find an error

HTTP Status 500 -

--------------------------------------------------------------------------------

type Exception report

message

description The server encountered an internal error () that prevented it from fulfilling this request.

exception

org.apache.jasper.JasperException: /search.jsp(151,22) Attribute value  language + "/include/header.html" is quoted with " which must be escaped when used within the value

 org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:40)

This is because the jsp2.0 syntax has changed,

Set "<% = language +"/include/header.html "%>"/>

Change

'<% = Language + "/include/header.html" %>'/>.

② The following error occurs during capturing

[Root @ red-hat-9 nutch-0.9] # bin/nutch crawl URLs-Dir gucas-Depth 50
-Threads 5> & logs/logs1.log

[Root @ red-hat-9 nutch-0.9] # Cat logs/logs1.log

Crawl started in: gucas

Rooturldir = 5

Threads = 10

Depth = 50

Injector: Starting

Injector: crawldb: gucas/crawldb

Injector: urldir: 5

Injector: Converting injected URLs to crawl dB entries.

Exception in thread "Main" org. Apache. hadoop. mapred. invalidinputexception: input path doesnt exist:/zkl/IR/nutch-0.9/5

At org. Apache. hadoop. mapred. inputformatbase. validateinput (inputformatbase. Java: 138)

At org. Apache. hadoop. mapred. jobclient. submitjob (jobclient. Java: 326)

At org. Apache. hadoop. mapred. jobclient. runjob (jobclient. Java: 543)

At org. Apache. nutch. Crawl. injector. Inject (injector. Java: 162)

At org. Apache. nutch. Crawl. Crawl. Main (crawl. Java: 115)

When you try the grep example in the Quickstart, you get an error like the following:

Org. Apache. hadoop. mapred. invalidinputexception: input path doesnt exist:/user/Ross/Input

You haven't created an input directory containing one or more text files.

Bin/hadoop DFS-put conf Input

This is because-Before threads 5-The input method is incorrect,

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More