How to read the main process code of nutch

Source: Internet
Author: User

Original link http://www.iteye.com/topic/570440

Main analysis:
1. org. Apache. nutch. Crawl. Injector:
1. Enter url.txt
2. url Standardization
3. Enable the http acl Policy to authenticate the URL (regex-urlfilter.txt)
4. Map the URL with the URL standard to construct <URL, crawldatum>. during the construction process, the score is initialized for the crawldatum. The score can affect the search sorting of the URL host and the collection priority!
5. Reduce only does one thing to determine whether the URL already exists in the crawler B. If so, read the original crawler database directly. If it is a new host, store the status (status_db_unfetched ))

2. org. Apache. nutch. Crawl. Generator:
1. Filter failed URLs (use the url filter plug-in)
2. Check whether the URL is updated in a valid period.
3. Get the URL metadata. metadata records the last URL Update time.
4. rate the URL
5. Load the URL into the corresponding task group (group by host)
6. Calculate the URL hash value
7. Collect the URL until the topn is reached.

Iii. org. Apache. nutch. Crawl. fetcher:
1. Read <URL, crawler ldatum> from the segment and put it into the corresponding queue. The queue is classified by queueid, And the queueid is composed of the Protocol: // ip, in the queue process,
If no queue exists, create a queue (for example, all the addresses of javaeye belong to this queue: http: // 221.130.184.141) --> queues. addfetchitem (URL, datum );
2. Check the robot suggestion to allow the URL to be crawled (robots.txt) --> protocol. getrobotrules (fit. url, fit. datum );
3. Check whether the URL is in a valid Update Time --> If (rules. getcrawldelay ()> 0)
4. Different robots are used for different protocols, including HTTP, FTP, and file. The content has been saved here ). --> Protocol. getprotocoloutput (fit. url, fit. datum );
5. After the content is retrieved, identify the HTTP status (such as 200 and 404 ). --> Case protocolstatus. Success:
6. The content is saved successfully and enters the protocolstatus. Success area. In this area, the system constructs the output content. --> Output (fit. url, fit. datum, content, status, crawler. status_fetch_success );
7. During the content construction process, retrieve the content parser plug-in (parseutil), such as MP3 \ HTML \ PDF \ word \ zip \ JSP \ SWF ....... --> This. parseutil. parse (content); --> parsers [I]. getparse (content );
8. Now we are studying HTML Parsing. Therefore, we will parse text, title, outlinks, and metadata in htmlparser and htmlparser.
Text: filter all HTML elements; Title: webpage title; outlinks: All links under the URL; Metadata: first, check the meta name = "Robots" in the URL header to see if spider crawling is allowed,
Second, identify and record meta http-equiv refresh and other attributes to check whether the page needs to be switched.

4. org. Apache. nutch. parse. parsesegment:
1. This type of logic is much simpler. It is also very valuable to us. It only does one thing, that is, parsing the crawled content (original html, the specific Parsing is implemented through plug-ins.
For example, we can implement data analysis and statistics here.
2. After the execution is complete, output three map pairs <URL, parsetext> resolution content, <URL, parsedata> including the analysis results of all links, <URL, crawler ldatum> outlinks

5. org. Apache. nutch. Crawl. crawldb:
Update the crawldb according to the output of crawld_fatch.
1. Map standardizes the crawld_fatch and crawldb addresses (nomalizer) and intercepts (filte );
2. Reduce merges and updates the two crawld_fatch and crawldb.

6. org. Apache. nutch. Crawl. linkdb:
This class is used to manage newly converted link mappings and list the external links (incoming links) of each URL ).
1. First, retrieve the outlinks of each URL and map the URL as the incoming link of each outlinks,
2. In reduce, add all incoming links of a URL to inlinks based on each key.
3. In this way, the external links of each URL are counted. Note that the system only collects statistics on external links. What is an external link? It only collects statistics on different hosts,
Remember that iteye.com and biaowen.iteye.com are two different hosts. --> Boolean ignoreinternallinks = true;
4. Then, merge the newly added links.

Org. Apache. nutch. Crawl. indexer:
The task of this class is another task. It is a distributed index based on hadoop and Lucene. It indexes the data captured by the previous crawler so that users can search for the data.
Here there are more input, including fetch_dir, parsedata and parsetext under segments, current_dir under crawldb, and current_dir under linkdb.
1. In this class, map loads all input into a container,
2. Perform classification at reduce,
3. Implement interception --> This. Filters. Filter (Doc, parse, key, fetchdatum, inlinks );
4. Score --> This. scfilters. indexerscore (Key, Doc, dbdatum, fetchdatum, parse, inlinks, boost );
5. Of course, you must combine these data bodies into a Lucene document to index them.
6. <URL, Doc> is collected after assembly in reduce, and real indexes are performed in the outputformat class output.
Doc contains the following fields:
Content (Body)
Site (Master Address)
Title)
Host (host)
Segement (which segement belongs)
Digest (MD5 code, used for deduplication)
Tstamp)
URL (current URL)
Contains an example:
Doc =
{Content = [biaowen-javaeye technical website homepage news forum blog recruitment more FAQ .................. (Content omitted )............ Biaowen Yong NF/ICP preparation No. 05023328],
Site = [biaowen.iteye.com],
Title = [biaowen-javaeye technical website],
Host = [biaowen.iteye.com],
Segment = [2, 20090725083125],
Digest = [063ba8430fa84e614ce71276e176f4ce],
Tstamp = [20090725003318265],
Url = [http://biaowen.iteye.com/]}

Org. Apache. nutch. Crawl. deleteduplicates:
The role of this class is the meaning written by its name -- deduplication.
There will be duplicates after the previous index (not once, of course), so we need to repeat it. Why? There are no duplicates in one index, but there will be duplicates after multiple captures.
This is the reason why we need to go heavy. Of course, there are two types of deduplication rules: time-based and content-based MD5.

Org. Apache. nutch. indexer. indexmerger:
This class is relatively simple. The goal is to merge Multiple indexes into one index and directly call the Lucene Method for implementation!

References:

Directory structure, refer to "Lucene + nutch search engine development"
1. crawldb download URL and download date, used for page update
Ii. segements stores captured pages and analysis results
1. crawl_generate: URL to be downloaded
2. crawl_fetch: the status of each download URL
3. content: content on each download page
4. parse_text: contains the content of each parsed URL text.
5. parse_data: External links and metadata parsed by each URL
6. crawl_parse: used to update the crawl external link library
3. Connection Between URLs stored in linkdb
Iv. indexes: stores the independent index directories downloaded each time.
5. index: the index directory in Lucene format is the complete index after all indexes in indexes are merged.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.