1. Download the related software and unzip the version number as follows: (1) apache-nutch-2.2.1 (2) hbase-0.90.4 (3) solr-4.9.0 and unzip to usrsearch2. configuration (1) viusrsearchapache-nutch-2.2.1confnutch-site.xmlpropertynamestorage.data.store.classnamevalueorg
1. Download the related software and unzip the version number as follows: (1) apache-nutch-2.2.1 (2) hbase-0.90.4 (3) solr-4.9.0 and unzip to/usr/search 2, the configuration of Nutch (1) vi/usr/search/apache-nutch-2.2.1/conf/nutch-site.xml propertynamestorage. data. store. class/namevalueorg
1. Download and decompress the software.
The version number is as follows:
(1) apache-nutch-2.2.1
(2) hbase-0.90.4
(3) solr-4.9.0
Decompress the package to/usr/search.
2. configuration of Nutch
(1) vi/usr/search/apache-nutch-2.2.1/conf/nutch-site.xml
storage.data.store.class
org.apache.gora.hbase.store.HBaseStore
Default class for storing data
(2) vi/usr/search/apache-nutch-2.2.1/ivy. xml
By default, this statement is commented out and Its annotator is removed to take effect.
(3) vi/usr/search/apache-nutch-2.2.1/conf/gora. properties
Add the following statement:
Gora. datastore. default = org. apache. gora. hbase. store. HBaseStore
The above three steps specify the use of HBase for storage.
The following steps are necessary to build a basic Nutch.
(4) Build a runtime
Cd/usr/search/apache-nutch-2.2.1/
Ant runtime
(5) Verify that the installation of Nutch is complete.
[Root @ jediael44 apache-nutch-2.2.1] # cd/usr/search/apache-nutch-2.2.1/runtime/local/bin/
[Root @ jediael44 bin] #./nutch
Usage: nutch COMMAND
Where COMMAND is one:
Inject new urls into the database
Hostinject creates or updates an existing host table from a text file
Generate new batches to fetch from crawl db
Fetch URLs marked during generate
Parse URLs marked during fetch
Updatedb update web table after parsing
Updatehostdb update host table after parsing
Readdb read/dump records from page database
Readhostdb display entries from the hostDB
Elasticindex run the elasticsearch indexer
Solrindex run the solr indexer on parsed batches
Solrdedup remove duplicates from solr
Parsechecker check the parser for a given url
Indexchecker check the indexing filters for a given url
Plugin load a plugin and run one of its classes main ()
Nutchserver run a (local) Nutch server on a user defined port
Junit runs the given JUnit test
Or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
(6) vi/usr/search/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml add search task
http.agent.name
My Nutch Spider
(7edcreate seed.txt
Cd/usr/search/apache-nutch-2.2.1/runtime/local/bin/
Vi seed.txt
Http://nutch.apache.org/
(8) modify web filter vi/usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt
Vi/usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt
Set
# Accept anything else
+.
Change
# Accept anything else
+ ^ Http: // ([a-z0-9] * \.) * nutch.apache.org/
(9) add index content
By default, fields in core and index-basic in the schema. xml file will be indexed. To index more fields, you can add them as follows.
Modify nutch-default.xml, add the following red content
Plugin. schemdes
Protocol-http | urlfilter-regex | parse-(html | tika) | index-(basic | anchor) | urlnormalizer-(pass | regex | basic) | scoring-opic | index-anchor | index-more | languageidentifier | subcollection | feed | creativecommons | tld
Regular expression naming plugin directory names
Include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin.
Default maid fig HTML and plain text via HTTP,
And basic indexing and search plugins. In order to use HTTPS please enable
Protocol-httpclient, but be aware of possible intermittent problems with
Underlying commons-httpclient library.
Alternatively, you can add the plugin. repldes attribute in the nutch-site.xml and copy the content above. Note that attributes in the nutch-site.xml replace those in the nutch-default.xml, so you must copy the original attributes as well.
3. Hbase Configuration
(1) vi/usr/search/hbase-0.90.4/conf/hbase-site.xml
hbase.rootdir
hbase.zookeeper.property.dataDir
Note: This step is not required. If not, use the default value in the hbase-default.xml (/usr/search/hbase-0.90.4/src/main/resources/hbase-default.xml.
Default Value:
hbase.rootdir
file:///tmp/hbase-${user.name}/hbase
The directory shared by region servers and into which HBase persists. The URL should be 'fully-qualified' to include the filesystem scheme. For example, to specify the HDFS directory '/hbase' where the HDFS instance's namenode is running at namenode.example.org on port 9000, set this value to: hdfs://namenode.example.org:9000/hbase. By default HBase writes into /tmp. Change this configuration else all data will be lost on machine restart.
That is, data is stored in the/tmp directory by default. if the machine is restarted, data may be lost.
However, we recommend that you configure these attributes, especially for the second zoopkeeper attribute. Otherwise, various problems may occur. Configure the directory in the local file system.
hbase.rootdir
file:///home/jediael/hbaserootdir
hbase.zookeeper.property.dataDir
file:///home/jediael/hbasezookeeperdataDir
NOTE: If file: // is not prefixed, hdfs: // is used by default ://
However, in version 0.90.4, the local file system is used by default.
4. Solr Configuration
(1) overwrite the schema. xml file of solr. (Schema-solr4.xml should be used for solr4)
Cp/usr/search/apache-nutch-2.2.1/conf/schema. xml/usr/search/solr-4.9.0/example/solr/collection1/conf/
(2) If solr3.6 is used, the configuration has been completed, but 4.9 is used, you need to modify the following Configuration:
Modify the copied schema. xml file
Delete:
Added:
5. Start the capture task
(1) Start HBase
[Root @ jediael44 bin] # cd/usr/search/hbase-0.90.4/bin/
[Root @ jediael44 bin] #./start-hbase.sh
(2) Start Solr
[Root @ jediael44 bin] # cd/usr/search/solr-4.9.0/example/
[Root @ jediael44 example] # java-jar start. jar
(3) Start the Nutch and capture the task
[Root @ jediael44 example] # cd/usr/search/apache-nutch-2.2.1/runtime/local/bin/
[Root @ jediael44 bin] #./crawl seed.txt TestCrawl http: // localhost: 8983/solr 2
The task is executed.
For some analysis of the above process, see:
Integrating with Nutch/Hbase/Solr to build a search engine 2: Content Analysis
Http://blog.csdn.net/jediael_lu/article/details/37738569
The following error occurs when you use crontab to set a routine task of the Nutch.
JAVA_HOME is not set.
Therefore, a script is created to execute the capture task:
#!/bin/bashexport JAVA_HOME=/usr/java/jdk1.7.0_51/opt/jediael/apache-nutch-2.2.1/runtime/local/bin/crawl /opt/jediael/apache-nutch-2.2.1/runtime/local/urls/ mainhttp://localhost:8080/solr/ 2 >> ~jediael/nutch.log
Then configure the routine task
30 0,6,8,10,12,14,16,18,20,22 * * * bash /opt/jediael/apache-nutch-2.2.1/runtime/local/bin/myCrawl.sh