Because the index generated by SOLR is stored on the local disk, in order to put the search index on HDFS, I recently looked at how to build the integration of distributed nutch and nutch + SOLR.
The crawling process of nutch: after the target website is crawled, five sub-directories are generated under the crawl Directory: crawldb, linkdb, segments, indexes, and index. The database crawldb contains the number of pages, etc. The linkdb contains links to pages in the database, which is determined by the number of links on the page when the website is captured by the crawler. The segments database is divided into three parts based on the timestamp, each piece is generated through generate, fetch, and update. The indexes database contains
Lucene indexes generated during the process. The index database contains the combined Lucene indexes.
Standalone running of nutch:
1. Download the bin package of nutch from the official website and decompress it. I am using a Nutch-1.2, do not guarantee other versions of the installation process is also so
2. In the installation directory of nutch, a new folder named urls(you can choose your name, in this folder, create a new file named url.txt (can be any name), write http://www.baidu.com in this text, this is the crawler entry address.
3. open the nutch-1.2/CONF/crawl-urlfilter.xml and navigate to my. domain. name line, will + ^ http: // ([a-z0-9] */.) * My. domain. change name/
+ ^ Http: // ([a-z0-9] */.) *, will be followed by the domain name *, meaning can crawl any website
4. Open nutch-1.2/CONF/nutch-site.xml and add the following content
<Configuration>
<Property>
<Name> HTTP. Agent. Name </Name>
<Value> MySearch </value>
</Property>
</Configuration>
Otherwise, an error is returned:
Fetcher: no agents listed in 'HTTP. Agent. name' property.
Exception in thread "Main" Java. Lang. illegalargumentexception: fetcher: no agents listed in 'HTTP. Agent. name' Property
Run the following command:
Bin/nutch crawl URLs-Dir crawl-depth 2-topn 100-threads 2
-Dir = localweb indicates the path for storing the downloaded data. If the directory does not exist, it is automatically created.
-Deptch = 2 the download depth is 2
-Topn = 100 download the first 100 eligible pages
-Threads = 2 Number of threads started
When the crawler runs, it outputs a large amount of data. After the crawling, it can be found that the crawl directory is generated, which contains five directories: crawldb, linkdb, segments, indexes, and index.
Run the nutch search:
1. download and run Tomcat
2. copy the nutch-1.2 in the nutch-1.2.war to tomcat6/webapps/, Tomcat automatically decompress this package in the running state, open the decompressed package, add:
<Property>
<Name> searcher. dir </Name>
<Value>/usr/local/nutch-1.2/Crawl </value>
</Property>
The value is the storage path of the crawled data. The search engine searches for the desired content based on this path.
3. Run the nutch search on the Web:
Address Bar input: http: // localhost: 8080/nutch-1.2
Enter the search keyword on the displayed search page to obtain the result. In case of garbled characters, locate CONF/server. XML in the tomcat installation directory to connector and modify it:
<Connector Port = "8080" protocol = "HTTP/1.1"
Connectiontimeout = "20000"
Redirectport = "8443"
Uriencoding = "UTF-8"Usebodyencodingforuri = "true"/>
References: Nutch-1.2 Configuration
Distributed configuration of nutch
Nutch = hadoop + Lucene. After downloading it from the official website and decompressing the package, we found that the hadoop package and Lucene package are indeed included. The distributed construction of nutch is similar to that of hadoop clusters, connect SSH to each machine and write the configuration file. For details, see the following references:
1. http://wiki.apache.org/nutch/NutchHadoopTutorial
2. Chinese Translation of nutchhadooptutorial
3. Distributed installation of nutch
4. Clustered search of nutch
5. nutch-1.0 Distributed Query deployment
Nutch + SOLR
SOLR must use version 1.4. I use version 3.1. If the result is mapped to an index, the following error occurs: Java. Io. ioexception: Job failed!
Why? I do not know how to track the errors reported in what step? If you know something about hero, please don't hesitate to give me some advice!
2011-8-1 supplement: solrj in nutch1.2 is version 1.4.0. Therefore, an error is reported when it is mapped to solr3.1. Only consistent versions are supported.
1. Configure schema. xml
Copy schema. XML in the directory of nutch1.2conf to/home/test/solr-1.4/example/SOLR/conf to replace the original file.
2. Add the following content at the end of/home/test/solr-1.4/example/SOLR/CONF/solrconfig. xml:
<requestHandler name="/nutch" class="solr.SearchHandler" ><lst name="defaults"><str name="defType">dismax</str><str name="echoParams">explicit</str><float name="tie">0.01</float><str name="qf">content^0.5 anchor^1.0 title^1.2 </str><str name="pf"> content^0.5 anchor^1.5 title^1.2 site^1.5 </str><str name="fl"> url </str><str name="mm">2<-1 5<-2 6<90%</str><int name="ps">100</int><bool name="hl">true</bool><str name="q.alt">*:*</str><str name="hl.fl">title url content</str><str name="f.title.hl.fragsize">0</str><str name="f.title.hl.alternateField">title</str><str name="f.url.hl.fragsize">0</str><str name="f.url.hl.alternateField">url</str><str name="f.content.hl.fragmenter">regex</str></lst></requestHandler>
Note: do not generate a character greater than or less than the character. convert it into a corresponding escape character. Otherwise, when Tomcat is started, an error is reported when solrconfig. XML is loaded.
3. Start the SOLR service: SOLR/example/Java-jar start. Jar
4. Map the index of the nutch to SOLR
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
5. Test SOLR Indexes
Visit the SOLR query interface, submit a query form, and check whether the data can be found at http: // 127.0.0.1: 8983/SOLR/admin.
References:
1. Using nutch with SOLR
2. http://blog.csdn.net/laigood12345/article/details/6091813