[2.1 Basic tutorial of Nutch2.2.1] integrating Nutch, Hbase, and Solr to build a search engine

Source: Internet
Author: User
Tags xsl
1. Download the related software and unzip the version number as follows: (1) apache-nutch-2.2.1 (2) hbase-0.90.4 (3) solr-4.9.0 and unzip to usrsearch2. configuration (1) viusrsearchapache-nutch-2.2.1confnutch-site.xmlpropertynamestorage.data.store.classnamevalueorg

1. Download the related software and unzip the version number as follows: (1) apache-nutch-2.2.1 (2) hbase-0.90.4 (3) solr-4.9.0 and unzip to/usr/search 2, the configuration of Nutch (1) vi/usr/search/apache-nutch-2.2.1/conf/nutch-site.xml propertynamestorage. data. store. class/namevalueorg

1. Download and decompress the software.

The version number is as follows:

(1) apache-nutch-2.2.1

(2) hbase-0.90.4

(3) solr-4.9.0

Decompress the package to/usr/search.

2. configuration of Nutch

(1) vi/usr/search/apache-nutch-2.2.1/conf/nutch-site.xml

   Default class for storing data

(2) vi/usr/search/apache-nutch-2.2.1/ivy. xml

By default, this statement is commented out and Its annotator is removed to take effect.


(3) vi/usr/search/apache-nutch-2.2.1/conf/gora. properties

Add the following statement:

Gora. datastore. default = org. apache. gora. hbase. store. HBaseStore

The above three steps specify the use of HBase for storage.

The following steps are necessary to build a basic Nutch.

(4) Build a runtime


Ant runtime

(5) Verify that the installation of Nutch is complete.

[Root @ jediael44 apache-nutch-2.2.1] # cd/usr/search/apache-nutch-2.2.1/runtime/local/bin/
[Root @ jediael44 bin] #./nutch
Usage: nutch COMMAND
Where COMMAND is one:
Inject new urls into the database
Hostinject creates or updates an existing host table from a text file
Generate new batches to fetch from crawl db
Fetch URLs marked during generate
Parse URLs marked during fetch
Updatedb update web table after parsing
Updatehostdb update host table after parsing
Readdb read/dump records from page database
Readhostdb display entries from the hostDB
Elasticindex run the elasticsearch indexer
Solrindex run the solr indexer on parsed batches
Solrdedup remove duplicates from solr
Parsechecker check the parser for a given url
Indexchecker check the indexing filters for a given url
Plugin load a plugin and run one of its classes main ()
Nutchserver run a (local) Nutch server on a user defined port
Junit runs the given JUnit test
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

(6) vi/usr/search/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml add search task

   My Nutch Spider

(7edcreate seed.txt


Vi seed.txt


(8) modify web filter vi/usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt



# Accept anything else


# Accept anything else
+ ^ Http: // ([a-z0-9] * \.) * nutch.apache.org/

(9) add index content

By default, fields in core and index-basic in the schema. xml file will be indexed. To index more fields, you can add them as follows.

Modify nutch-default.xml, add the following red content

Plugin. schemdes

Protocol-http | urlfilter-regex | parse-(html | tika) | index-(basic | anchor) | urlnormalizer-(pass | regex | basic) | scoring-opic | index-anchor | index-more | languageidentifier | subcollection | feed | creativecommons | tld

Regular expression naming plugin directory names

Include. Any plugin not matching this expression is excluded.

In any case you need at least include the nutch-extensionpoints plugin.

Default maid fig HTML and plain text via HTTP,

And basic indexing and search plugins. In order to use HTTPS please enable

Protocol-httpclient, but be aware of possible intermittent problems with

Underlying commons-httpclient library.

Alternatively, you can add the plugin. repldes attribute in the nutch-site.xml and copy the content above. Note that attributes in the nutch-site.xml replace those in the nutch-default.xml, so you must copy the original attributes as well.

3. Hbase Configuration

(1) vi/usr/search/hbase-0.90.4/conf/hbase-site.xml


Note: This step is not required. If not, use the default value in the hbase-default.xml (/usr/search/hbase-0.90.4/src/main/resources/hbase-default.xml.

Default Value:

   The directory shared by region servers and into    which HBase persists.  The URL should be 'fully-qualified'    to include the filesystem scheme.  For example, to specify the    HDFS directory '/hbase' where the HDFS instance's namenode is    running at namenode.example.org on port 9000, set this value to:    hdfs://namenode.example.org:9000/hbase.  By default HBase writes    into /tmp.  Change this configuration else all data will be lost    on machine restart.    
That is, data is stored in the/tmp directory by default. if the machine is restarted, data may be lost.

However, we recommend that you configure these attributes, especially for the second zoopkeeper attribute. Otherwise, various problems may occur. Configure the directory in the local file system.


NOTE: If file: // is not prefixed, hdfs: // is used by default ://

However, in version 0.90.4, the local file system is used by default.

4. Solr Configuration

(1) overwrite the schema. xml file of solr. (Schema-solr4.xml should be used for solr4)

Cp/usr/search/apache-nutch-2.2.1/conf/schema. xml/usr/search/solr-4.9.0/example/solr/collection1/conf/

(2) If solr3.6 is used, the configuration has been completed, but 4.9 is used, you need to modify the following Configuration:

Modify the copied schema. xml file



5. Start the capture task

(1) Start HBase

[Root @ jediael44 bin] # cd/usr/search/hbase-0.90.4/bin/
[Root @ jediael44 bin] #./start-hbase.sh

(2) Start Solr

[Root @ jediael44 bin] # cd/usr/search/solr-4.9.0/example/
[Root @ jediael44 example] # java-jar start. jar

(3) Start the Nutch and capture the task

[Root @ jediael44 example] # cd/usr/search/apache-nutch-2.2.1/runtime/local/bin/
[Root @ jediael44 bin] #./crawl seed.txt TestCrawl http: // localhost: 8983/solr 2

The task is executed.

For some analysis of the above process, see:

Integrating with Nutch/Hbase/Solr to build a search engine 2: Content Analysis


The following error occurs when you use crontab to set a routine task of the Nutch.

JAVA_HOME is not set.

Therefore, a script is created to execute the capture task:

#!/bin/bashexport JAVA_HOME=/usr/java/jdk1.7.0_51/opt/jediael/apache-nutch-2.2.1/runtime/local/bin/crawl /opt/jediael/apache-nutch-2.2.1/runtime/local/urls/ mainhttp://localhost:8080/solr/ 2 >> ~jediael/nutch.log

Then configure the routine task

30 0,6,8,10,12,14,16,18,20,22 * * * bash /opt/jediael/apache-nutch-2.2.1/runtime/local/bin/myCrawl.sh

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.