Deployment and basic use of the Nutch2.x + Hbase environment

Source: Internet
Author: User

Deployment and basic use of the Nutch2.x + Hbase environment

Because the project wants to use Nutch for web crawlers, some research has found that online documents are scattered and difficult to learn. Therefore, I have summarized some of them and put them up to communicate with you.

1. Environment deployment

There are 1. x Series and 2. x Series, the main difference is 2. x uses Gora as the persistent layer media to persist data to relational databases. For more information, see the official website of Nutch.

The following describes how to deploy HBase in the form of Nutch2.3.1 + HBase. HBase depends on HDFS and Zookeeper. In fact, Nutch only regards HBase as a persistent layer, HBase does not care whether HBase is a standalone mode or a distribution mode. HBase stores files in HDFS and metadata (table information) in zookeeper. However, in standalone mode, the file system is regarded as HDFS, zookeeper can be used with built-in or external zookeeper.

Install Nutch

Decompress the source package:

Tar-zxf apache-nutch-2.3.1-src.tar.gz

Go to the decompressed directory.

Cd apache-nutch-2.3.1

Execute ant for compilation

Ant

(If you decompress the bin package, ant compilation is not required, path /... /The apache-nutch-2.3.1 is $ NUTCH_RUNTIME_HOME. In fact, 2. x has no bin package. Only 1. x has a bin package .)

Compilation takes a long time (mainly because the download dependency is very slow. It is said that the download source address is modified. If I try it, the download will fail). After compilation, the/runtime/local directory will be generated and environment variables can be set:

Export NUTCH_RUNTIME_HOME =/home/apache/apache-nutch-2.3.1/runtime/local

Export PATH = $ PATH: $ NUTCH_RUNTIME_HOME/bin

Enter bin/nutch to see the prompt, indicating that the installation is successful.

Install Ant

If no ant command is prompted when you execute ant, install ant first:

Decompress the tar package

Tar-zxf apache-ant-1.9.6-bin.tar.gz

Add Environment Variables

Export ANT_HOME =/home/apache-ant-1.9.6/ant

Export PATH =/home/apache-ant-1.9.6/bin: $ PATH

Enter ant-version. The version number indicates that the installation is successful.

 

Install HBase

Decompress the tar package

Tar-zxf hbase-0.98.8-hadoop2-bin.tar.gz

Add Environment Variables

Export HBASE_HOME =/home/hbase-0.98.8-hadoop2

 

2. configuration method: Configure HBase in standalone Mode

In standalone mode, HDFS and Zookeeper are not required, and the local file system is used as the storage medium.

(Port 2181 must be used in standalone mode. In theory, this port must not be occupied .)

Modify $ HBASE_HOME/conf/hbase-site.xml

Add the configuration in the <configuration> tab:

<Property>

<Name> hbase. rootdir </name>

<Value> file: // home/testuser/hbase </value>

</Property>

<Property>

<Name> hbase. zookeeper. property. dataDir </name>

<Value>/home/testuser/zookeeper </value>

</Property>

The configured rootdir is the storage path of HBase storage files, and dataDir is the storage path of metadata stored by HBase using zookeeper.

Run the $ HBASE_HOME/bin/start-hbase.sh command to start HBase and run the jps command to view the process. If the HMaster process is found, it indicates that the startup is successful and you can find the run log in the $ HBASE_HOME/logs directory.

 

In standalone mode, only local storage files can be used as the storage medium, and only built-in zookeeper can be used. External HDFS and zookeeper cannot be used.

 

Configure HBase in cluster mode

1. Modify $ HBASE_HOME/conf/hbase-site.xml

Add the configuration in the <configuration> tab:

<Property>

<Name> hbase. rootdir </name>

<Value> hdfs: // 192.168.1.92: 9000/hbase </value>

</Property>

<Property>

<Name> hbase. cluster. distributed </name>

<Value> true </value>

</Property>

Hbase. cluster. if the distributed attribute is false, it indicates the standalone mode. The default value is fasle. If it is set to true, it indicates the distribution mode. In this case, hbase. configure rootdir as the HDFS path. 192.168.1.92: 9000 is the ip address and port of HDFS. If not, you need to build the Hadoop2 environment.

2. Modify $ HBASE_HOME/conf/hbase-env.sh to set jdk path for HBase:

Add (or uncomment)

Export JAVA_HOME =/usr/java/jdk1.7.0 _ 71/

 

Run the $ HBASE_HOME/bin/start-hbase.sh command to start HBase. You will find that three items have been started. Run the jps command to view the process. If the HMaster, HRegionServer, and H QuorumPeer processes are found to be started successfully, you can find the running logs in the $ HBASE_HOME/logs directory.

 

3, (Optional) use externalZookeeper

In cluster mode, you can use external Zookeeper or HBase's built-in Zookeeper. The H QuorumPeer process just now is the built-in Zookeeper started by HBase. The following configuration is added when an external Zookeeper is used:

Modify $ HBASE_HOME/conf/hbase-env.sh to tell HBase to use an external zookeeper:

Add (or uncomment)

Export HBASE_MANAGES_ZK = false

Start your zookeeper before starting HBase. By default, HBase looks for zookeeper from localhost: 2181. to customize ip and port, you can add configuration in $ HBASE_HOME/conf/hbase-site.xml:

<Property>

<Name> hbase. zookeeper. quorum </name>

<Value> 192.168.1.92 </value>

</Property>

<Property>

<Name> hbase. zookeeper. property. clientPort </name>

<Value> 2181 </value>

</Property>

Start external zookeeper and then start HBase. Only two processes start HMaster and HRegionServer. You can also find the running logs in the $ HBASE_HOME/logs directory.

Associate HBase with Nutch2.x

We recommend using HBase0.98.8 in Nutch2.3.1.

Configure your Nutch before associating HBase. This configuration is similar to the configuration of Nutch1.x:

1. Modify $ maid/nutch-site.xml

Add the following attributes:

<Property>

<Name> http. agent. name </name>

<Value> My Nutch Spider </value>

</Property>

2. (optional) set url filter rules, modify $ NUTCH_RUNTIME_HOME/conf/regex-urlfilter.txt

Add a regular match similar to the following:

+^http://([a-z0-9]*\.)*nutch.apache.org/

+ Number indicates that a match is added, and-number indicates that a match is not added.

 

3. Configure HBase attributes for Nutch

Modify $ NUTCH_RUNTIME_HOME/conf/nutch-site.xml to add the following Configuration:

<property>
 <name>storage.data.store.class</name>
 <value>org.apache.gora.hbase.store.HBaseStore</value>
 <description>Default class for storing data</description>
</property>

4. Add HBase dependency for Nutch:

Modify $ NUTCH_HOME/ivy. xml (note that $ NUTCH_HOME here refers to the directory where the Nutch is located)

Add (or uncomment) the following configuration

<!-- Uncomment this to use HBase as Gora backend. -->
    <dependency org="org.apache.gora" name="gora-hbase" rev="0.6.1" conf="*->default" />
<dependency org="org.apache.hbase" name="hbase-common" rev="0.98.8-hadoop2" conf="*->default" />

5. Configure gora as hbase for Nutch:

Modify $ NUTCH_RUNTIME_HOME/conf/gora. properties

Add (or uncomment) the following configuration

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

6. Use ant to re-compile and make the configuration take effect:

Run

ant runtime

 

 

3. Instructions for using HBase

Whether in standalone or distributed mode, run the following command to test the normal use of HBase:

Start HBase

$ HBASE_HOME/bin/start-hbase.sh

Go to the command line:

$ HBASE_HOME/bin/hbase shell

Commands that can be executed on the command line:

Create 'test', 'cf'

List 'test'

Put 'test', 'row1', 'cf: A', 'value1'

Put 'test', 'row2', 'cf: B ', 'value2'

Put 'test', 'row3', 'cf: C', 'value3'

Scan 'test'

Get 'test', 'row1'

Disable 'test'

Enable 'test'

Drop 'test'

Disable HBase

$ HBASE_HOME/bin/stop-hbase.sh

Instructions for use

Take version 2.3.1 as an example.

1. Create your seeds:

Create a file seeds.txt in the URL directory, and add the URL address in seeds.txt.

2. inject can be understood as creating a crawling task:

$ NUTCH_RUNTIME_HOME/bin/nutch inject urls-crawlId yyy

Here, urls is the folder where the created seed url is located. You can use the absolute path.

Yyy is the ID of the crawling task. In HBase, a table of yyy_webpage is created to record the data of all crawling tasks. You can use commands such as list on the HBase command line to view the changes in crawling data.

3. generate the link to be crawled

$ NUTCH_RUNTIME_HOME/bin/nutch generate-crawlId yyy

4. fetch crawls web pages

$ NUTCH_RUNTIME_HOME/bin/nutch fetch-all-crawlId yyy-threads 4

5. parse webpage data

$ NUTCH_RUNTIME_HOME/bin/nutch parse-crawlId yyy-all

6. updata updates the database

$ NUTCH_RUNTIME_HOME/bin/nutch updatedb-all-crawlId yyy

7. Repeat steps 2-6 to continue the next in-depth crawling.

For detailed optional command parameters, refer to the official website of Nutch.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.