The Nutch of Big Data

Source: Internet
Author: User
Tags solr

I. Introduction of Nutch

Nutch is the famous Doug cutting-initiated reptile project, Nutch hatched the big data-processing framework for Hadoop today. Prior to Nutch V 0.8.0, Hadoop was part of the Nutch, starting with Nutch V0.8.0, and HDFs and MapReduce stripped out of Nutch into Hadoop. After the v0.8.0, Nutch was built entirely on the basis of Hadoop.

Nutch is an open-source web crawler that crawls Web pages in search engines and automatically maintains URL information for Web pages, such as the same page weight, Web page timing updates, and web redirection. Nutch uses the MapReduce distributed crawl and parse, and has good horizontal expansibility.
Now the version of Nutch itself does not have a search capability (after V1.2, Nutch focuses on crawling data), but it can automatically submit crawled pages to the search server. For example, SOLR can control whether Nutch submits a Web page to the index server by nutch the command that comes with it.
Nutch is an excellent distributed crawler framework, but all of its designs are designed for search engine service. Developing with the Map-reduce framework on Hadoop is not a very good business for data extraction. If your business is doing data extraction (fine extraction), rather than search engines. Do not necessarily choose Nutch.
Nutch is now divided into two branches. One is the 1.x series, one is the 2.x series. The main difference is that 2.x introduces Gora as the storage abstraction layer, which supports a variety of NoSQL databases such as Hbase,cassandra, etc.

Second, Nutch installation

1.nutch Operating Environment
JDK1.7 and above
Need to use Linux operating system

2. Unzip:

Setting environment variables

Verification: Nutch

3. Directory structure

Bin: Two files, one is Nutch and the other is crawl,
Crawl is a one-stop call that encapsulates the commands in the Nutch.

Conf: The Nutch basic configuration information is stored inside, nutch-default.xml nutch-site.xml parse-plugins.xml regex-urlfilter.txt

DOCS:API Documentation
LIB: A dependent jar package that stores Nutch
Plugins: Stores the plug-in jar package used by Nutch

Third, Nutch crawler

Nutch Crawl Prep Work

1: Add Http.agent.name configuration in Nutch-site.xml. If not configured, the boot will error.

2: Create a seed address directory, URLs (available in the Nutch directory), create some seed files under the directory, and save the seed address in the seed file. Each seed address occupies a single row. http:www.zhaozhiyong.cn

Note: The seed address must start with an HTTP-like character

3: (optional): Control the URL range for crawling, crawl only inside the station, add at Regex-urlfilter.txt last: +^http://([a-z0-9]*\.) *bbs.superwu.cn/[\s\s]*

4:nutch crawling data Bin/crwal URLs crawl 1
Crawl URLs Crawl 1
URLs: Indicates a seed directory
Crawl: Indicates where the crawl file is stored
1: Represents the level of crawling


5.nucth Crawl Generated Directory
There are several directories in the specified crawl when crawling
CRAWLDB: The URL to be crawled is stored
View Catalog data: Nutch READDB Crawl/crawldb-stats-sort
LINKDB: The reverse chain information is stored
View Catalog data: Nutch READLINKDB Crawl/linkdb-dump Links
Segments: Store All the data information of the website
View Catalog data: Nutch readseg-dump crawl/segments/20150906090648 sgedb

A segment includes the following subdirectories:
Crawl_generate: Contains a list of URLs that need to be crawled
Crawl_fetch: Contains the status of each fetch page
Content: Contains the original contents of each crawl page
Parse_text: Contains parsed text for each crawl page
Parse_data: Contains external links (outer chain) and metadata for each page
Crawl_parse: An external link address that contains the URL for updating the CRAWLDB database


6. View the data generated by the Nutch
The data in the Nutch is in sequencefile format, except to view the Nutch command above
You can also read these files using Java code
Reference < View Nutch generated intermediate files >
Note: When reading a file using the provided code, you need to use the more command to view the data types stored in the corresponding file
Corresponds to the type of value that needs to be modified in 27 lines in the code.

Iv. indexing crawled data in SOLR

Before using SOLR, do the following steps:

1): You need to copy the Schema-solr4.xml under Nutch to SOLR


Command: cp/usr/local/nutch/conf/schema-solr4.xml/usr/local/solr-4.10.4/example/solr/collection1/conf

2): Delete the default profile in Solr Schema.xml, the name of the file that you just copied the past
Cd/usr/local/solr-4.10.4/example/solr/collection1/conf
RM Schema.xml
MV Schema-solr4.xml Schema.xml

3): Add a field configuration in Schema.xml
<field name= "Location" type= "string" stored= "true" indexed= "true"/>

Or you can specify it dynamically,
Example: Crawl-i-D "SOLR.SERVER.URL=HTTP://192.168.1.170:8983/SOLR" URLs crawl 1

1. Start SOLR

Cd/usr/local/solr-4.10.4/example
Java-jar Start.jar

2.

Command: crawl-i URLs crawl 1

-I: Indicates that the crawled data is indexed and is indexed by default in the native SOLR.
If you use SOLR on a different server, you need to modify the value of Solr.server.url in Nutch-default.
It is recommended to overwrite in Nutch-site.xml.

3. Inspection

Http://127.0.0.1:8983/solr/can see the contents of content through query

Note: If you change the crawl level to layer 2, the data that is fetched first is still very small, because the URL is filtered in the Regex-urlfilter.txt
Include the URL? = and the like are ignored.
So this configuration can be commented out directly, or modified to other rules.
For detailed modifications, refer to: <regex-urlfilter explanation .txt>

The Nutch of Big Data

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.