Deployment and basic use of the Nutch2.x + Hbase environment
Because the project wants to use Nutch for web crawlers, some research has found that online documents are scattered and difficult to learn. Therefore, I have summarized some of them and put them up to communicate with you.
1. Environment deployment
There are 1. x Series and 2. x Series, the main difference is 2. x uses Gora as the persistent layer media to persist data to relational databases. For more information, see the official website of Nutch.
The following describes how to deploy HBase in the form of Nutch2.3.1 + HBase. HBase depends on HDFS and Zookeeper. In fact, Nutch only regards HBase as a persistent layer, HBase does not care whether HBase is a standalone mode or a distribution mode. HBase stores files in HDFS and metadata (table information) in zookeeper. However, in standalone mode, the file system is regarded as HDFS, zookeeper can be used with built-in or external zookeeper.
Install Nutch
Decompress the source package:
Tar-zxf apache-nutch-2.3.1-src.tar.gz
Go to the decompressed directory.
Cd apache-nutch-2.3.1
Execute ant for compilation
Ant
(If you decompress the bin package, ant compilation is not required, path /... /The apache-nutch-2.3.1 is $ NUTCH_RUNTIME_HOME. In fact, 2. x has no bin package. Only 1. x has a bin package .)
Compilation takes a long time (mainly because the download dependency is very slow. It is said that the download source address is modified. If I try it, the download will fail). After compilation, the/runtime/local directory will be generated and environment variables can be set:
Export NUTCH_RUNTIME_HOME =/home/apache/apache-nutch-2.3.1/runtime/local
Export PATH = $ PATH: $ NUTCH_RUNTIME_HOME/bin
Enter bin/nutch to see the prompt, indicating that the installation is successful.
Install Ant
If no ant command is prompted when you execute ant, install ant first:
Decompress the tar package
Tar-zxf apache-ant-1.9.6-bin.tar.gz
Add Environment Variables
Export ANT_HOME =/home/apache-ant-1.9.6/ant
Export PATH =/home/apache-ant-1.9.6/bin: $ PATH
Enter ant-version. The version number indicates that the installation is successful.
Install HBase
Decompress the tar package
Tar-zxf hbase-0.98.8-hadoop2-bin.tar.gz
Add Environment Variables
Export HBASE_HOME =/home/hbase-0.98.8-hadoop2
2. configuration method: Configure HBase in standalone Mode
In standalone mode, HDFS and Zookeeper are not required, and the local file system is used as the storage medium.
(Port 2181 must be used in standalone mode. In theory, this port must not be occupied .)
Modify $ HBASE_HOME/conf/hbase-site.xml
Add the configuration in the <configuration> tab:
<Property>
<Name> hbase. rootdir </name>
<Value> file: // home/testuser/hbase </value>
</Property>
<Property>
<Name> hbase. zookeeper. property. dataDir </name>
<Value>/home/testuser/zookeeper </value>
</Property>
The configured rootdir is the storage path of HBase storage files, and dataDir is the storage path of metadata stored by HBase using zookeeper.
Run the $ HBASE_HOME/bin/start-hbase.sh command to start HBase and run the jps command to view the process. If the HMaster process is found, it indicates that the startup is successful and you can find the run log in the $ HBASE_HOME/logs directory.
In standalone mode, only local storage files can be used as the storage medium, and only built-in zookeeper can be used. External HDFS and zookeeper cannot be used.
Configure HBase in cluster mode
1. Modify $ HBASE_HOME/conf/hbase-site.xml
Add the configuration in the <configuration> tab:
<Property>
<Name> hbase. rootdir </name>
<Value> hdfs: // 192.168.1.92: 9000/hbase </value>
</Property>
<Property>
<Name> hbase. cluster. distributed </name>
<Value> true </value>
</Property>
Hbase. cluster. if the distributed attribute is false, it indicates the standalone mode. The default value is fasle. If it is set to true, it indicates the distribution mode. In this case, hbase. configure rootdir as the HDFS path. 192.168.1.92: 9000 is the ip address and port of HDFS. If not, you need to build the Hadoop2 environment.
2. Modify $ HBASE_HOME/conf/hbase-env.sh to set jdk path for HBase:
Add (or uncomment)
Export JAVA_HOME =/usr/java/jdk1.7.0 _ 71/
Run the $ HBASE_HOME/bin/start-hbase.sh command to start HBase. You will find that three items have been started. Run the jps command to view the process. If the HMaster, HRegionServer, and H QuorumPeer processes are found to be started successfully, you can find the running logs in the $ HBASE_HOME/logs directory.
3, (Optional) use externalZookeeper
In cluster mode, you can use external Zookeeper or HBase's built-in Zookeeper. The H QuorumPeer process just now is the built-in Zookeeper started by HBase. The following configuration is added when an external Zookeeper is used:
Modify $ HBASE_HOME/conf/hbase-env.sh to tell HBase to use an external zookeeper:
Add (or uncomment)
Export HBASE_MANAGES_ZK = false
Start your zookeeper before starting HBase. By default, HBase looks for zookeeper from localhost: 2181. to customize ip and port, you can add configuration in $ HBASE_HOME/conf/hbase-site.xml:
<Property>
<Name> hbase. zookeeper. quorum </name>
<Value> 192.168.1.92 </value>
</Property>
<Property>
<Name> hbase. zookeeper. property. clientPort </name>
<Value> 2181 </value>
</Property>
Start external zookeeper and then start HBase. Only two processes start HMaster and HRegionServer. You can also find the running logs in the $ HBASE_HOME/logs directory.
Associate HBase with Nutch2.x
We recommend using HBase0.98.8 in Nutch2.3.1.
Configure your Nutch before associating HBase. This configuration is similar to the configuration of Nutch1.x:
1. Modify $ maid/nutch-site.xml
Add the following attributes:
<Property>
<Name> http. agent. name </name>
<Value> My Nutch Spider </value>
</Property>
2. (optional) set url filter rules, modify $ NUTCH_RUNTIME_HOME/conf/regex-urlfilter.txt
Add a regular match similar to the following:
+^http://([a-z0-9]*\.)*nutch.apache.org/
+ Number indicates that a match is added, and-number indicates that a match is not added.
3. Configure HBase attributes for Nutch
Modify $ NUTCH_RUNTIME_HOME/conf/nutch-site.xml to add the following Configuration:
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
4. Add HBase dependency for Nutch:
Modify $ NUTCH_HOME/ivy. xml (note that $ NUTCH_HOME here refers to the directory where the Nutch is located)
Add (or uncomment) the following configuration
<!-- Uncomment this to use HBase as Gora backend. -->
<dependency org="org.apache.gora" name="gora-hbase" rev="0.6.1" conf="*->default" />
<dependency org="org.apache.hbase" name="hbase-common" rev="0.98.8-hadoop2" conf="*->default" />
5. Configure gora as hbase for Nutch:
Modify $ NUTCH_RUNTIME_HOME/conf/gora. properties
Add (or uncomment) the following configuration
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
6. Use ant to re-compile and make the configuration take effect:
Run
ant runtime
3. Instructions for using HBase
Whether in standalone or distributed mode, run the following command to test the normal use of HBase:
Start HBase
$ HBASE_HOME/bin/start-hbase.sh
Go to the command line:
$ HBASE_HOME/bin/hbase shell
Commands that can be executed on the command line:
Create 'test', 'cf'
List 'test'
Put 'test', 'row1', 'cf: A', 'value1'
Put 'test', 'row2', 'cf: B ', 'value2'
Put 'test', 'row3', 'cf: C', 'value3'
Scan 'test'
Get 'test', 'row1'
Disable 'test'
Enable 'test'
Drop 'test'
Disable HBase
$ HBASE_HOME/bin/stop-hbase.sh
Instructions for use
Take version 2.3.1 as an example.
1. Create your seeds:
Create a file seeds.txt in the URL directory, and add the URL address in seeds.txt.
2. inject can be understood as creating a crawling task:
$ NUTCH_RUNTIME_HOME/bin/nutch inject urls-crawlId yyy
Here, urls is the folder where the created seed url is located. You can use the absolute path.
Yyy is the ID of the crawling task. In HBase, a table of yyy_webpage is created to record the data of all crawling tasks. You can use commands such as list on the HBase command line to view the changes in crawling data.
3. generate the link to be crawled
$ NUTCH_RUNTIME_HOME/bin/nutch generate-crawlId yyy
4. fetch crawls web pages
$ NUTCH_RUNTIME_HOME/bin/nutch fetch-all-crawlId yyy-threads 4
5. parse webpage data
$ NUTCH_RUNTIME_HOME/bin/nutch parse-crawlId yyy-all
6. updata updates the database
$ NUTCH_RUNTIME_HOME/bin/nutch updatedb-all-crawlId yyy
7. Repeat steps 2-6 to continue the next in-depth crawling.
For detailed optional command parameters, refer to the official website of Nutch.