Getting started with nutch

Source: Internet
Author: User
Tags solr query

Because the index generated by SOLR is stored on the local disk, in order to put the search index on HDFS, I recently looked at how to build the integration of distributed nutch and nutch + SOLR.

The crawling process of nutch: after the target website is crawled, five sub-directories are generated under the crawl Directory: crawldb, linkdb, segments, indexes, and index. The database crawldb contains the number of pages, etc. The linkdb contains links to pages in the database, which is determined by the number of links on the page when the website is captured by the crawler. The segments database is divided into three parts based on the timestamp, each piece is generated through generate, fetch, and update. The indexes database contains
Lucene indexes generated during the process. The index database contains the combined Lucene indexes.

Standalone running of nutch:

1. Download the bin package of nutch from the official website and decompress it. I am using a Nutch-1.2, do not guarantee other versions of the installation process is also so

2. In the installation directory of nutch, a new folder named urls(you can choose your name, in this folder, create a new file named url.txt (can be any name), write http://www.baidu.com in this text, this is the crawler entry address.

3. open the nutch-1.2/CONF/crawl-urlfilter.xml and navigate to my. domain. name line, will + ^ http: // ([a-z0-9] */.) * My. domain. change name/

+ ^ Http: // ([a-z0-9] */.) *, will be followed by the domain name *, meaning can crawl any website

4. Open nutch-1.2/CONF/nutch-site.xml and add the following content

<Configuration>
<Property>
<Name> HTTP. Agent. Name </Name>
<Value> MySearch </value>
</Property>
</Configuration>

Otherwise, an error is returned:

Fetcher: no agents listed in 'HTTP. Agent. name' property.
Exception in thread "Main" Java. Lang. illegalargumentexception: fetcher: no agents listed in 'HTTP. Agent. name' Property

Run the following command:

Bin/nutch crawl URLs-Dir crawl-depth 2-topn 100-threads 2
-Dir = localweb indicates the path for storing the downloaded data. If the directory does not exist, it is automatically created.
-Deptch = 2 the download depth is 2
-Topn = 100 download the first 100 eligible pages
-Threads = 2 Number of threads started
When the crawler runs, it outputs a large amount of data. After the crawling, it can be found that the crawl directory is generated, which contains five directories: crawldb, linkdb, segments, indexes, and index.

 

Run the nutch search:

1. download and run Tomcat

2. copy the nutch-1.2 in the nutch-1.2.war to tomcat6/webapps/, Tomcat automatically decompress this package in the running state, open the decompressed package, add:

<Property>
<Name> searcher. dir </Name>
<Value>/usr/local/nutch-1.2/Crawl </value>

</Property>

The value is the storage path of the crawled data. The search engine searches for the desired content based on this path.

3. Run the nutch search on the Web:

Address Bar input: http: // localhost: 8080/nutch-1.2

Enter the search keyword on the displayed search page to obtain the result. In case of garbled characters, locate CONF/server. XML in the tomcat installation directory to connector and modify it:

<Connector Port = "8080" protocol = "HTTP/1.1"
Connectiontimeout = "20000"
Redirectport = "8443"
Uriencoding = "UTF-8"Usebodyencodingforuri = "true"/>

References: Nutch-1.2 Configuration

 

Distributed configuration of nutch

Nutch = hadoop + Lucene. After downloading it from the official website and decompressing the package, we found that the hadoop package and Lucene package are indeed included. The distributed construction of nutch is similar to that of hadoop clusters, connect SSH to each machine and write the configuration file. For details, see the following references:

1. http://wiki.apache.org/nutch/NutchHadoopTutorial

2. Chinese Translation of nutchhadooptutorial

3. Distributed installation of nutch

4. Clustered search of nutch

5. nutch-1.0 Distributed Query deployment

Nutch + SOLR

SOLR must use version 1.4. I use version 3.1. If the result is mapped to an index, the following error occurs: Java. Io. ioexception: Job failed!
Why? I do not know how to track the errors reported in what step? If you know something about hero, please don't hesitate to give me some advice!

2011-8-1 supplement: solrj in nutch1.2 is version 1.4.0. Therefore, an error is reported when it is mapped to solr3.1. Only consistent versions are supported.

1. Configure schema. xml

Copy schema. XML in the directory of nutch1.2conf to/home/test/solr-1.4/example/SOLR/conf to replace the original file.
2. Add the following content at the end of/home/test/solr-1.4/example/SOLR/CONF/solrconfig. xml:

<requestHandler name="/nutch" class="solr.SearchHandler" ><lst name="defaults"><str name="defType">dismax</str><str name="echoParams">explicit</str><float name="tie">0.01</float><str name="qf">content^0.5 anchor^1.0 title^1.2 </str><str name="pf"> content^0.5 anchor^1.5 title^1.2 site^1.5 </str><str name="fl"> url </str><str name="mm">2&lt;-1 5&lt;-2 6&lt;90%</str><int name="ps">100</int><bool name="hl">true</bool><str name="q.alt">*:*</str><str name="hl.fl">title url content</str><str name="f.title.hl.fragsize">0</str><str name="f.title.hl.alternateField">title</str><str name="f.url.hl.fragsize">0</str><str name="f.url.hl.alternateField">url</str><str name="f.content.hl.fragmenter">regex</str></lst></requestHandler>

Note: do not generate a character greater than or less than the character. convert it into a corresponding escape character. Otherwise, when Tomcat is started, an error is reported when solrconfig. XML is loaded.

3. Start the SOLR service: SOLR/example/Java-jar start. Jar

4. Map the index of the nutch to SOLR

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

5. Test SOLR Indexes
Visit the SOLR query interface, submit a query form, and check whether the data can be found at http: // 127.0.0.1: 8983/SOLR/admin.

 

References:

1. Using nutch with SOLR
2. http://blog.csdn.net/laigood12345/article/details/6091813

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.