Nutch+hadoop Cluster Construction (reprint)

Last Update:2015-03-06 Source: Internet

Author: User

Tags hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Apache Nutch

Apache Nutch is an open-source framework for Web search that provides all the tools we need to run our own search engines, including full-text search and web crawlers.

1.1, the Nutch component structure

webdb: Storing Web page data and connection information

Fetch lists: Divides the connection stored by webdb into multiple groups for distributed retrieval

fetchers: Retrieving the contents of the Fetch list and downloading to a local, there are two outputs: connected update, respectively

Content and information

Updates: Update the page retrieval status for WEBDB

WEBDB, updates, fetch lists, and fetchers make up a looping structure that runs continuously to ensure that the resulting web image is up-to-date

Content: interface contents, after getting content, Nutch can create indexes and perform query operations based on it

Indexers: Create an index on the target content, and when the index is large, you can divide the index into multiple index fragments and assign to different seracher to implement parallel retrieval.

Searchers: The content is cached while the query function is implemented

webservers: There are two types of roles:

1 Handling user interaction requests (Nutch Search Client)

2 getting query results from searchers (HTTP Server)

Note: The operations of the fetchers and searchers two nodes can be placed in a distributed environment (Hadoop) to complete

The operations for creating indexes and queries can be implemented through the SOLR framework

1.2. NUTCH Data structure:

The Nutch data contains 3 directory structures, respectively:

1. CRAWLDB: Used to store the URL information that the Nutch will retrieve, as well as the retrieval status (whether retrieved, when retrieved)

2. LINKDB: Used to store hyperlink information (including anchor points) contained in each URL

3, Segments: A set of URLs, they as a retrieval unit, can be used for distributed retrieval

The segment directory contains the following sub-directory information:

(1) Crawl_generate: Defines the collection of URLs to be retrieved (file type Sequencefile)

(2) Crawl_fetch: Stores the retrieval status of each URL (file type Mapfile)

(3) Content: stores the binary byte stream corresponding to each URL (file type Mapfile)

(4) Parse_text: Stores the text content parsed by each URL (file type is Mapfile)

(5) Parse_data: Stores the parsed metadata for each URL (file type is Mapfile)

(6) Crawl_parse: Used to update the contents of the CRAWLDB in a timely manner (such as the URL to retrieve no longer exists, etc.)--File type is Sequencefile

Note: In combination with NUTCH data structure and component structure, CRAWLDB is equivalent to WEBDB, while segment is equivalent to Fetchlists.

In a distributed crawl process, each MapReduce job generates a segment name, named in time.

2. Apache Hadoop

Nutch's stand-alone acquisition (local mode) is not complex, but when the data source being collected is large, a machine is difficult to meet the performance requirements, so it is common practice to integrate Nutch into the Hadoop environment to achieve the effect of distributed acquisition and distributed query (deploy mode).

The Hadoop framework consists of 3 sub-frameworks on functional partitioning, namely:

MapReduce: For distributed parallel computing

HDFS: For distributed storage

Common: The practical classes required to encapsulate HDFs and MapReduce

2.1. MapReduce Work Flow

1. Cut the input source (inputfiles) into different fragments, the size of each fragment is usually between 16m-64m (configurable via parameters), and then start the cloud program.

2.MapReduce program is based on Master/slaves mode deployment, in the cloud machine to select a machine to run the Master program, responsibilities include: scheduling tasks assigned to slaves, monitoring the execution of tasks.

3. In the graph, the slave form is the worker, when the worker receives the map task, reads the input source fragment, parses out the Key/value key value pair, and passes as the parameter to the user custom map function function, the map function function output value also is key/ Value-value pairs, which are temporarily cached in memory by the key-value pairs.

4. After caching, the program periodically writes the cached key-value pairs to the local hard disk (performing local write operations) and passes the storage address back to master so that master records their location for the reduce operation.

5. When a worker is notified to perform a reduce operation, master sends the address stored by the corresponding map output data to that worker so that it can be retrieved through a remote call. When this data is obtained, the reduce worker organizes records with the same key value together to achieve a sort effect.

The 6.Reduce worker passes the sorted data as a parameter to the user-defined Reduce function function, and the output of the function is persisted into output file.

7. When all the map tasks and reduce tasks are finished, master will wake up the user main program again, and a mapreduce operation call is complete.

2.2. HDFS Component Structure

Similar to the MapReduce deployment structure, HDFS also has a master/slaves servant structure

1. Namenode acts as the master role, including managing the namespace of the Document System (namespace), and adjusting client access to the required files (files stored in datenode)

Note: The directory structure of the namespace-mapping file system

2.DataNodes acts as a slaves role, and typically a single machine deploys only one datenode to store the data required by the MapReduce program

Namenode will receive heartbeat and blockreport feedback from Datanodes regularly.

Heartbeat feedback is used to ensure that Datanode does not appear to function abnormally;

Blockreport contains the block collection stored by the Datanode

2.3. Hadoop Resources

1 http://wiki.apache.org/nutch/NutchHadoopTutorial distributed acquisition and distributed queries based on Nutch and Hadoop

3. Environment Construction

3.1, need to prepare

3.1.12 or more Linux machines (this is assumed to be two units)

One machine name is set to master, the other is set to SLAVE01, two machines have the same login username Nutch, and the etc/hosts files of two machines are set to the same content, such as:
192.168.7.11 Master

192.168.7.12 slave01

......

In this way, the corresponding machine can be found by the host name

3.1.2 Build SSH Environment

The installation of SSH can be done by the following command:

$ sudo apt-get install SSH

$ sudo apt-get install rsync

3.1.3 Installing the JDK

$ apt-get Install Openjdk-6-jdkopenjdk-6-jre

3.1.4 Download the latest version of Hadoop and Nutch

hadoop:http://www.apache.org/dyn/closer.cgi/hadoop/common/

nutch:http://www.apache.org/dyn/closer.cgi/nutch/

3.2. Build configuration

3.2.1SSH Login Configuration

(1) Generate the certificate file on the master machine using the following command Authorized_keys

$ ssh-keygen-t Rsa-p "-F ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

(2) Copy the certificate file to the user home directory of the other machine

$SCP/home/nutch/.ssh authorized_keys [email protected]:/home/nutch/.ssh/authorized_keys

With these two steps, the master machine can ssh to the SLAVE01 machine without the need for a password

3.2.2HADOOP Configuration

Similar to the configuration of the SSH login certificate, the Hadoop configuration is also done on the master machine and then replicated to the slave machine to ensure that each machine has the same Hadoop environment

$HADOOP the _home/conf directory:

(1) hadoop-env.sh file

Export Hadoop_home=/path/to/hadoop_home

Export Java_home=/path/to/jdk_home

Export Hadoop_log_dir=${hadoop_home}/logs

(2) Core-site.xml file

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://master:9000</value>

</property>

</configuration>

(3) Hdfs-site.xml file

<configuration>

<property>

<name>dfs.name.dir</name>

<value>/nutch/filesystem/name</value>

</property>

<property>

<name>dfs.data.dir</name>

  <value>/nutch/filesystem/data</value>

</property>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

</configuration>

(4) Mapred-site.xml file

<configuration>

<property>

<name>mapred.job.tracker</name>

  <value>master:9001</value>

</property>

<name>mapred.map.tasks</name>

<value>2</value>

<name>mapred.reduce.tasks</name>

<value>2</value>

<property>

<name>mapred.system.dir</name>

<value>/nutch/filesystem/mapreduce/system</value>

</property>

<property>

<name>mapred.local.dir</name>

<value>/nutch/filesystem/mapreduce/local</value>

</property>

</configuration>

(5) Masters and slaves configuration

Add the corresponding machine IP to the corresponding configuration file

3.2.3 Nutch Configuration

$NUTCH the _home/conf directory

(1) Nutch-site.xml file

<property>

<name>http.agent.name</name>

<value>nutch spider</value>

</property>

(2) Regex-urlfilter.txt

Add URLs that need to be retrieved

+^http://([a-z0-9]*\.) *nutch.apache.org/

(3) Put the modified file in the Nutch_home/runtime/deploy/nutch-*.job

3.3. Start running

3.3.1 Start Hadoop

1. Format the Namenode node

Bin/hadoop Namenode–format

2. Start the Hadoop process

bin/start-all.sh

After successful startup, the Namenode and MapReduce run status can be viewed via the following URL

namenode:http://master:50070/

mapreduce:http://master:50030/

3. Put the test data into HDFs

$ bin/hadoop fs-put conf input

4. Perform the test

$ bin/hadoop jar hadoop-examples-*.jar grep input Output ' dfs[a-z. +

5. Close the Hadoop process

bin/stop-all.sh

3.3.2 Running Nutch

1 Start-Up prerequisites:

(1). Hadoop started successfully

(2). Add the Hadoop_home/bin path to the environment variable so that Nutch can find the Hadoop command

By modifying the/etc/enviroment configuration file implementation

(3) Execute Export java_home=/path/to/java command in console

2 storing the data to be retrieved in HDFs

$ bin/hadoop fs-put Urldir Urldir

Note: The first Urldir is a local folder, a URL data file is stored, one URL per line

The second urldir is a storage path for HDFs

3 Starting the Nutch command

Execute the following command under the Nutch_hone/runtime/deploy directory

$ bin/nutch Crawl Urldir–dir crawl-depth 3–topn 10

After the command executes successfully, the crawl directory is generated in HDFs

Note: Be sure to execute this command under the Deploy directory, and perform a single-machine capture in the local directory without using the Hadoop environment

Nutch+hadoop Cluster Construction (reprint)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More