[Nutch] configuration of a single machine in Linux.

Source: Internet
Author: User
Tags apache tomcat

1. Environment Introduction

Operating System: Red Hat Linux 9

Nutch version: nutch-0.9, download: http://apache.etoak.com/lucene/nutch/

JDK version: JDK 1.6.

Apache Tomcat: APACHE-Tomcat-6.0.18

Http://apache.etoak.com/tomcat/tomcat-6/v6.0.18/bin/apache-tomcat-6.0.18.tar.gz

2. Configure prerequisite 2.1 to install JDK 1.6

First download JDK installation package jdk-1_6_0_13-linux-i586-rpm.bin

Step 1: # chmod + x jdk-1_6_0_13-linux-i586-rpm.bin (get execution permission)

Step 2: #./jdk-1_6_0_13-linux-i586-rpm.bin (generate RPM installation package)

Step 3: # rpm-IVH jdk-1_6_0_13-linux-i586.Rpm (install JDK)

After installation,JDK is installed in the/usr/Java/directory by default..

Step 4: Configure Java environment variables.

Set environment variables in/etc/profile

[Root @ red-hat-9 root] # vi/etc/profile

Add the following statement:

Java_home =/usr/Java/jdk1.6.0 _ 13

Export java_home

Classpath =.: $ java_home/lib: $ java_home/JRE/lib (note the colon)

Export classpath

Path = $ java_home/bin: $ java_home/JRE/bin: $ path

 

[Root @ red-hat-9 root] # chmod + x/etc/profile (Execution permission)

[Root @ red-hat-9 root] # source/etc/profile (effective later)

2.2 install tombact

Step 1: Set environment variables (not required)

[Root @ red-hat-9 program] # vi/etc/profile

Export jdk_home = $ java_home

[Root @ red-hat-9 program] # source/etc/profile

Step 2: Install tomnact and decompress it to a directory.

Tar xf apache-tomcat-6.0.18.tar.gz

Mv apache-Tomcat-6.0.18/zkl/progaram/

Step 3: How to Use Apache Tomcat

(1) start Tomcat and run the following command:

#/Zkl/Program/Apache-Tomcat-6.0.18/bin/startup. Sh

② The main directory of Tomcat webpage is/zkl/Program/Apache-Tomcat-6.0.18/Webapps/, You only need to add the corresponding web page in the webapps directory to access it in the browser. The default Tomcat directory is the root directory under webapps.

Http: // 127.0.0.1: 8080/access the default home directory of Tomcat, Root

Http: // 127.0.0.1: 8080/luceneweb put luceneweb into webapps

③ The Apache HTTP server port is 80, and the http: // 127.0.0.1 accesses the Apache main directory.

The port of the Apache Tomcat server is 8080. The two servers do not conflict. If there is a conflict, you can modify the tomcat configuration file server. xml.

VI/zkl/Program/Apache-Tomcat-6.0.18/CONF/server. xml

<! -- Define a non-ssl http/1.1 Connector on port 8080 -->
<Connector Port = "8080" maxhttpheadersize = "8192"
Maxthreads = "150" minsparethreads = "25" maxsparethreads = "75"
Enablelookups = "false" redirectport = "8443" acceptcount = "100"
C disableuploadtimeout = "true"
Url = "UTF-8" usebodyencodingforuri = "true"/>

The default service port is 8080. If there is a conflict (such as APACHE), you can use this configuration file to change the port (blue). If Chinese characters are garbled after configuration, the encoding configuration is added (red)

 

3. Configure and apply the configuration of nutch3.1

Download nutch-0.9.tar.gz;

Step 1: unzip the installation package

# Tar zxvf Nutch-0.9.tar.gz
# Music Nutch-0.9/zkl/IR/nutch-0.9

Step 2: Test

#/Zkl/IR/nutch-0.9/bin/nutch

If the following text appears, the installation is successful:

Usage: nutch command

Where command is one:

Crawl one-step crawler for Intranets

Readdb read/dump crawl DB

Step 3: Set

① Set the website crawling entry URL

[Root @ red-hat-9 nutch-0.9] # cd/zkl/IR/nutch-0.9/

[Root @ red-hat-9 nutch-0.9] # mkdir URLs

[Root @ red-hat-9 nutch-0.9] # vi URLs/urls_crawl.txt

Create a directory and create a file urls_crawl.txt. We use this method,

[Root @ red-hat-9 nutch-0.9] # vi urls_crawl.txt

Write the entry URL of the website to be crawled (crawl), that is, capture any URL page under the current domain name starting from this entry, for example:

Http://english.gu.cas.cn/ag/

 

② Specify crawling filtering rules

Edit the URL filtering rule file CONF/crawl-urlfilter.txt for nutch

[Root @ red-hat-9 nutch-0.9] # vi CONF/crawl-urlfilter.txt

Modify

# Accept hosts in my. domain. Name
# + ^ Http: // (/[a-z0-9] */.) * My. domain. Name/

Is

This is the domain name of the website you want to crawl, indicating to crawl all the URL pages under the current website, crawling the starting website on.

③ Filter character settings

If the URL that crawls a website contains the following filter characters, for example? And =, And you need these access, you can change the filter table

# Skip URLs containing certain characters as probable queries, etc.
-[? *! @ =]
Change
-[*! @]

④ Modify CONF/nutch-site.xml

Change

<Configuration>
<Property>
<Name>HTTP. Agent. Name</Name> HTTP. Agent. Name attribute
<Value> gucas.ac.cn </value> indicates the name of the crawled website,

Used in the nutch search
</Property>
<Property>
<Name> HTTP. Agent. version </Name>
<Value> 1.0 </value>
</Property>

<Property>

<Name>Searcher. dir</Name>

<Value>/zkl/IR/nutch-0.9/gucas </value>

<Description> path to root of Crawl </description>

</Property>
</Configuration>
If this agent is not configured, the agent name not configured will appear during crawling! .

⑤ Start crawling

Run the crawl command to capture the website content

[Root @ red-hat-9 nutch-0.9] # bin/nutch crawl urls_crawl.txt-Dir gucas-Depth 50-threads 5-topn 1000> & logs/logs_crawl.log

 

·-DirDirNames settings to save the directory of the crawled web page.

·-DepthDepthIndicates the level depth of the captured webpage.

·-DelayDelayLatency of accessing different hosts, measured in seconds"

·-ThreadsThreadsIndicates the number of threads to be started

·-Topn 1000 indicates that only the first n URLs of each layer are captured.

In the parameters of the preceding command,Urls_crawl.txtThe created package contains the directory of the file urls_crawl.txt that captures the network. dir specifies the directory where the captured content is stored. Here is gucas; depth indicates the crawling depth starting from the top-level website URL to be crawled; threads specifies the number of concurrent threads; topn indicates that only the first n URLs of each layer are captured. The last logs/logs_crawl.log indicates that the content displayed during the capture process is saved in the logs_crawl.log file under the logs directory, to analyze the running status of the program.

After this command is run,The gucas directory will be generated under the nutch-0.9 directory, and there will be captured files and generated IndexesIn addition, there will be the remaining logs directory under the nutch-0.9 Directory, which generates a file logs_crawl.log that stores the capture log.

If gucas already exists before running, the following error occurs during running: gucas already exist. We recommend that you delete this directory or specify other directories to store captured web pages.

After completing the preceding steps, Data Capturing is completed successfully.

Test: Bin/nutch ORG/Apache/nutch/searcher/nutchbean

Query the keyword "".

 

The above only crawls a single website, does not reflect the advantages of Web Crawlers crawling data from multiple websites. The following example shows how to crawl data from multiple websites:

Create a new multiurls.txt file in the main directory of nutch, and write the list of URLs to be downloaded.

Http://www.pcauto.com.cn/

Http://www.xcar.com.cn/

Http://auto.sina.com.cn

Modify the filter rule file crawl-urlfilter.txt, allowing you to download any site

# Accept hosts in my. domain. Name

+ ^ // All website links are allowed by default.

# Skip everything else

-.

Run the capture command

[Root @ red-hat-9 nutch-0.9] # bin/nutch crawl multiurls.txt-Dir mutilweb-Depth 50-threads 5-topn 1000> & logs/logs_crawl.log

Change CONF/nutch-site.xml

To:

<Configuration>
<Property>
<Name>HTTP. Agent. Name</Name> HTTP. Agent. Name attribute
<Value> * </value> indicates the name of a web crawler,
</Property>
<Property>
<Name> HTTP. Agent. version </Name>
<Value> 1.0 </value>
</Property>

<Property>

<Name>Searcher. dir</Name>

<Value>/zkl/IR/nutch-0.9/gucas </value>

<Description> path to root of Crawl </description>

</Property>
</Configuration>
Test: Bin/nutch ORG/Apache/nutch/searcher/nutchbean SUV

Query the keyword "suv ".

---------------------------------------------------------------------

6. Deploy Web Front-end
Copy the nutch-0.9.war package under the main directory of the nutch to the webapps directory of Tomcat
[Root @ red-hat-9 nutch-0.9] # cp nutch-0.9.war/zkl/Program/Apache-Tomcat-6.0.18/webapps/
Then the browser URL http: // localhost: 8080/nutch-0.9/, the war package is automatically decompressed, A nutch-0.9 folder appears under the Tomcat web home directory webapps.

7. Modify the Web configuration of nutch in Tomcat
VI/zkl/Program/Apache-Tomcat-6.0.18/webapps // nutch-0.9/WEB-INF/classes/nutch-site.xml

SetSearcher. dirThe attribute value is changed to the directory generated by the index.

<Configuration>

<Property>

<Name>Searcher. dir</Name>

<Value>/zkl/IR/nutch-0.9/gucas </value>

<Description>

Path to root of Crawl. This directory is searched (in

Order) for either the file search-servers.txt, containing a list

Distributed search servers, or the directory "Index" containing

Merged indexes, or the directory "segments" containing segment

Indexes.

</Description>

</Property>

</Configuration>

3.2 apply nutch (no result is still unsolved)

Restart tomcat,

Then visit the URL http: // localhost: 8080/nutch-0.9/

Error highlights:

① Enter a keyword and click search to find an error

HTTP Status 500 -
--------------------------------------------------------------------------------
type Exception report
message
description The server encountered an internal error () that prevented it from fulfilling this request.
exception
org.apache.jasper.JasperException: /search.jsp(151,22) Attribute value  language + "/include/header.html" is quoted with " which must be escaped when used within the value
 org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:40)

 

This is because the jsp2.0 syntax has changed,

Set "<% = language +"/include/header.html "%>"/>

Change

'<% = Language + "/include/header.html" %>'/>.

② The following error occurs during capturing

[Root @ red-hat-9 nutch-0.9] # bin/nutch crawl URLs-Dir gucas-Depth 50
-
Threads 5> & logs/logs1.log

[Root @ red-hat-9 nutch-0.9] # Cat logs/logs1.log

Crawl started in: gucas

Rooturldir = 5

Threads = 10

Depth = 50

Injector: Starting

Injector: crawldb: gucas/crawldb

Injector: urldir: 5

Injector: Converting injected URLs to crawl dB entries.

Exception in thread "Main" org. Apache. hadoop. mapred. invalidinputexception: input path doesnt exist:/zkl/IR/nutch-0.9/5

At org. Apache. hadoop. mapred. inputformatbase. validateinput (inputformatbase. Java: 138)

At org. Apache. hadoop. mapred. jobclient. submitjob (jobclient. Java: 326)

At org. Apache. hadoop. mapred. jobclient. runjob (jobclient. Java: 543)

At org. Apache. nutch. Crawl. injector. Inject (injector. Java: 162)

At org. Apache. nutch. Crawl. Crawl. Main (crawl. Java: 115)

 

 

When you try the grep example in the Quickstart, you get an error like the following:

Org. Apache. hadoop. mapred. invalidinputexception: input path doesnt exist:/user/Ross/Input

 

You haven't created an input directory containing one or more text files.

 

Bin/hadoop DFS-put conf Input

This is because-Before threads 5-The input method is incorrect,

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.