[Sorting] Analysis of Web Crawlers of nutch

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original post

Http://www.diybl.com/course/3_program/java/javajs/20071018/77925.html

----------------------------------

1. Create a New webdb (Admin DB-create );
2. Write the starting URL of the capture into webdb (inject );
3. Generate fetchlist Based on webdb and write the corresponding segment (generate );
4. Capture the webpage (FETCH) according to the URL in the fetchlist ).;
5. Update webdb (updatedb) based on the captured web page ).

Through the 3-5 loop, you can achieve in-depth crawling of the nutch.

The following five files are generated in the webdb folder after the web crawler runs:
Linksbymd5 linksbyurl pagesbymd5 pagesbyurl stats

The stats file is used to store the version information after crawling, the number of webpages processed, and the number of connections;

The other four folders, such as pagesbyurl, have two files-index and data. Data files are used to store ordered key/value pairs.
The same key and comparator will be changed. Of course, there will be some other information,
For example, in pagesbyurl, an information (SYN) is put into each key/value pair of a certain length. The index file is used to store the index, but this index file
It is also ordered, which stores the key and location information, but the key that appears in the data file can be found in this index. To save space, it implements every other segment
Key/value is used to create an index. In this way, the query is ordered, so we use 2-point search. If no index is found, the minimum position information at the end is returned, we are looking for this location.
The target is very close, and then you can find it in the forward position of the data file.

To maintain the webdb, you can add or delete a webpage or connection to the webdb instead of adding a webpage or connection to the webdb.
Webdbwriter
In the internal class pageinstructionwriter or linkinstructionwriter, add a command for webpage operations, and finally store
Order to remove duplicates, and finally merge the command with the web page data stored in the existing webdb;

The fetcher class runs during actual web page capturing. The files or folders generated by crawlers are generated by this class, nutch provides options-whether to parse the captured webpage. If this option is set to false, there will be no parseddata or parsedtext folders.

Segments File Analysis

Almost all files generated by the nutch crawler are key/value pairs. The difference is that the types and values of key/value are different. After the crawler runs, the files are generated in the subfolders of the segments folder.
The following folders:

Content fetcher fetchlist index parse_data parse_text

The content in the content folder corresponds to the content class in the Protocol package; The content in The fetcher folder corresponds to the content in the fetcher package.
Fetcheroutput class; fetchlist corresponds to the total fetchlist class of the pagedb package; parse_data and parse_text correspond
Parsedata and parsetext classes in the parse package

Tip:

The segment in Lucene is different from that in nutch. The segment in Lucene is part of the index, but the segment in nutch is only the content and index of each part of the web page in webdb, finally, the index generated by the segment has nothing to do with these segments.

Bytes -----------------------------------------------------------------------------------------------------------

5 step details of the in-depth crawling of nutch

Http://hi.baidu.com/shirdrn/blog/item/16c8d33df49893e83d6d972d.html

Bytes -----------------------------------------------------------------------------------------------------------

Procedure: Create an initial URL set for analysis

The initial URL set can be created in two ways: Hyperlink and webmasters submit.

Hyperlink
It refers to the hyperlink that robot programs use to link webpages to other webpages, just like what they say in daily life: "One pass, ten passes, ten passes ......" Similarly, from a few web pages to all links to other web pages on the database. Theoretically, if a webpage has an appropriate hyperlink, the robot can traverse most of the webpages.

Submitted by webmaster
In practice, crawlers cannot capture all websites. Therefore, Webmasters can submit a request to the search engine for indexing. After the search engine has been verified, add the website to the URL set for crawling.

Workflow: inject Operation Analysis

The inject operation calls the org. Apache. nutch. Crawl. injector class in one of the core crawl packages of nutch. The result is that the content of the crawler database is updated, including the URL and its status.

The main functions of the inject operation can be described in the following three aspects:

(1) format and filter the URL set, eliminate the invalid URL, set the URL status (unfetched), and initialize the score according to certain methods;

(2) Merge URLs to eliminate duplicate URL entries;

(3) store the URL, its status, and its score to the crawldb database. If it is repeated in the original database, delete the old one and replace the new one.

Nutch workflow: Generate Operation Analysis

The generate operation calls the org. Apache. nutch. Crawl. Generator class in the crawl package. The execution result is: the capture is created.
List, stored in the segments directory, with time as the folder name. The number of times the segments folder is crawled cyclically.

The main functions of the generate operation can be described in the following three aspects:

(1) Fetch and filter the URL from the database crawldb;

(2) sort the URLs in descending order by combining the domain name, number of links, and a hash algorithm;

(3) write the list to the segments directory.

Procedure: Fetch Operation Analysis

The fetch operation calls the org. Apache. nutch. Fetcher. fetcher class in The fetcher package. It captures the page content and stores it in the Segment directory.

The main functions of the fetch operation can be described in the following four aspects:

(1) execute the capture operation according to the capture list under the segments directory;

(2) During the crawling process, the URL address of the page may be changed because of the link, so the URL address needs to be updated;

(3) multi-threaded crawling to increase the crawling speed;

(4) The parse operation is called during the fetch operation.

Procedure: parse Operation Analysis

The parse operation calls the org. Apache. nutch. parse. parsesegment class in the parse package. The result is that the page obtained by fetch is parsed to text and data and stored in the segments directory.

The main functions of parse operations can be described in the following three aspects:

(1) parse the fetch page in the segment and sort the page into parse-date and parse-text;

(2) Parse-date stores the title, author, date, and link of the page;

(3) Parse-text stores the text content of the page.

Procedure: updatedb Operation Analysis

The updatedb operation calls the org. Apache. nutch. Crawl. crawldb class in the crawl package. The result is that the crawler database is updated to prepare for the next round of crawling.

The updatedb operation is mainly used as follows:

Update the crawldb according to the contents in the fetch directory and parse directory under the segments directory, add a new URL, and replace the old URL.

Two further processes
---------------------------

Nutch workflow: Index Process Analysis

The index process, that is, the index process, includes three operations: converting data into text, analyzing text, and saving analyzed text to the database.

(1) convert to text

Before you index data, you must first convert the data into a text-only batch stream, which can be processed by the nutch. However, in the real world, rich media (rich
Media) File Format: PDF, Word, Excel, HTML, XML, etc. To this end, nutch uses the plug-in mechanism (plugin), through a variety of documentation Solutions
The parser converts rich media into plain text streams. There are a wide variety of file Resolvers. developers can select as needed, and modify or write the file on their own, which is flexible and convenient.

(2) analyze text

Data must be pre-processed before being indexed to make data analysis more suitable for indexing. When data is analyzed, the text data is first divided into some big blocks or vocabulary units.
(Tokens), and then perform some optional operations on them. For example, convert these vocabulary units to lowercase before the index, so that the search is not case sensitive; the most representative is to remove one from the input.
Some words that are frequently used but have no practical significance, such as some Stop Words in English text (a, an, the, in, on, etc ). Similarly, we also need to analyze the input vocabulary units
Remove unnecessary letters to find their stem. This process is called analysis ). Analysis technology is used in indexing and searching, and is important.

(3) Save the analyzed text to the database.

After the input data is analyzed and processed, the results can be written to the index file. Nutch uses the Lucene index format. For details, refer to the Lucene index mechanism. Lucene uses "inverted index" data results to store indexes.

Procedure: Search program analysis

The following describes how to execute the search program of nutch:

(1) the HTTP server receives user requests. The Execution Code corresponding to the nutch is a servlet called the query handler ). The query processor is responsible for responding to user requests and returning corresponding HTML result pages to users.

(2)
The query processor performs some minor processing on the query statements and forwards the search items (Terms) to a group of machines running the index searcher. The query system of nutch seems much simpler than Lucene,
This is mainly because a search engine user has a clear idea of the content to be queried. However, Lucene's system structure is very flexible and provides a variety of different query methods. Seemingly simple
The query is finally converted to a specific Lucene Query type. Each index searcher works in parallel and returns a list of ordered document IDs.

(3) There are a large number of search result data streams returned from the query processor. The query processor compares these result sets and finds out the best matching results from all the results. If any of the index searchers is ~ If the result is returned 2 seconds later, the result of the searcher will be ignored. Therefore, the final list consists of the results returned by the successfully operated searcher.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Sorting] Analysis of Web Crawlers of nutch

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Sorting] Analysis of Web Crawlers of nutch

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support