Access Mechanism and Structure Analysis of nutch/Lucene (favorites)

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The source must be indicated and shall not be used for any form of commercial activities without the consent of the author.

Subject:Solve the segmens splitting problem of nutch and the overload (rebuilding) Problem of nutch crawl.

Main Content

I. Lucene index mechanism and index file structure
Ii. crawler analysis and file structure analysis of nutch
Iii. Implementation Scheme of splitting indexes of nutch segments

I. Lucene index mechanism and index file structure
1. Lucene index mechanism
2. Lucene File Format
_ 0. F0, _ 0. F1 File
_ 0. FNM domain collection information file
0. frq; 0. PRx location and Frequency File
*. FDT and *. fdx constitute the Domain value/domain index file.
Segment1.nrm standardization factor
-Segments index Block Files
-Deletable: Save records of deleted files
-*. Tii and *. Tis constitute the item dictionary.
-Lock (without extension) is used to control read/write synchronization.

Ii. Analysis of Web Crawlers of nutch
Nutch segments Analysis
File structure analysis of nutch

Iii. Splitting solution of nutch segments

Lucene adopts the inverted index structure to create a reverse index. The Lucene analyzer is used to extract the retrieved information from the index to obtain the relevant information, such as the occurrence frequency of the index. The word divider then writes the information to the index file. Its core lies in Lucene's index file structure (inverted index structure). First, we understand the concept of "reverse Index.

Reverse indexing is a way to organize documents centered on index items. Each index item points to a document sequence. All documents in this sequence contain this index item. On the contrary, in a forward index, a document occupies a central position, and each document points to a sequence of index items it contains. You can use reverse indexes to easily find documents that contain specific index items. Lucene uses reverse indexes as its basic index structure.

Lucene index is composed of several segments. Each segment is composed of several documents. Each document is composed of several fields, each domain is composed of several terms. An item is the smallest index concept unit. It directly represents a string, its position in the file, the number of occurrences, and other information. A domain is an associated tuple consisting of a domain name and a domain value. A domain name is a string and a domain value is an item, for example, the domain that includes the title and the actual title items.

A document is the result of extracting all the information in a file, which is a segment or a sub-index. A sub-index can be combined into an index or a new sub-index that contains all the internal elements of the merged item.

Lucene uses the inverted index structure to process the index as a directory (folder). All the files contained in the index are their contents. These files are stored in different groups according to their respective segments, files in the same group have the same file name and different extensions. In addition, there are three files, which are used to save records of all segments, save records of deleted files, and control read/write synchronization. They are segments, deletable, and lock files with no extension.

Each segment contains a group of files with different file extensions, but the file names are recorded in the file of each segment in the middle segment of the file segments.
It mainly records two types of information: domain set and item set. Because the index information is statically stored, file groups in the domain set and item set adopt a similar storage method: A small index file is loaded into the memory at runtime; an actual information file corresponding to the index file can be randomly accessed according to the offset indicated in the index. There is an implicit correspondence between the index file and the information file in the order of records, that is, according to the index item in the index file

1. index item 2 ..." The information file is also sorted according to "information item 1, information item 2 ..." . The domain set and item set maintain the corresponding relationship through the domain record number recorded in the domain record file (such as segment1.fnm) of the domain. In this way, segment1.fdx and segment1.tii maintain contact. In this way, the domain set and the item set are not only linked, but also the files in the set are also linked. In this way, the index information of the entire segment

These documents form an organic structure.
The index file format used by Lucene. Basically, it is an inverted index ., However, Lucene has made some efforts in file arrangement, such as using index/information files to improve search efficiency in the form of file arrangement.

(_ 0. F0, _ 0. F1 file)
Document number)
Method: Lucene uses an integer document number to indicate the document. The first document to be indexed is numbered 0. The document to be indexed in sequence will get a number that increases progressively from the previous number.
The standard technique is to assign a field number to each segment based on the number of each segment. Add the field number when converting the document number in the specified segment to a non-section. When a document number outside a certain segment is converted to a specific segment, the document belongs to the segment based on the possible number range after the conversion in each segment, and the segment number is reduced. For example, if two segments with five documents are merged, the first segment is 0, and the second segment is 5. In the third document in section 2, the number out of the section is 8.
(_ 0. FNM domain set information)
The document in the index is composed of one or more fields. This file contains information about the fields in each index block.
All domain names are stored in the domain set information of this file. The domain names in the file are numbered according to their order. Therefore, domain 0 is the first domain in the file, and domain 1 is followed. This is the same as document number.

Item frequency. The frq file contains a list of each item and the frequency of this item in the corresponding document.
Position. The PRx file contains a list of the location information of an item in a document.

(0.frq; 0. PRx location and Frequency File)
Item frequency. The frq file contains a list of each item and the frequency of this item in the corresponding document.
Position. The PRx file contains a list of the location information of an item in a document.

(*. FDT and *. fdx constitute the Domain value/domain index file)
The Domain value storage table (stored fields) is represented by the domain index. fdx and Domain value. FDT files.

(Segment1.nrm standardization factor)
The. NRM file contains the standardization factor of each document, which is mainly used in the scoring sorting mechanism.

(Segments index block files)
The segments index block is also called the Segment segment. This file contains the index block information in the index. This file contains the name and size of each index block.

(Deletable to save records of deleted files)
The deletetable file contains the names of files that are no longer used by the index. These files may not be deleted.
The. Del file is optional and exists only after a certain segment has been deleted.

(*. Tii and *. Tis constitute the item dictionary)
The item dictionary is represented by two files: item information (. Tis file) and item information index (. tii file.
Item information index (. tii file)
Each item information index file contains 128 entries in the. Tis file, according to the order of the entries in the. Tis file. This design aims to read the index information into the memory at a time, and then use it to randomly access the. Tis file. The structure of this file is very similar to that of the. Tis file. Only one variable indexdelta is added to each entry record.

(Lock (no extension) is used to control read/write synchronization)
It is mainly used to prevent the process from modifying the index by other file operations when using the index.

Analysis of web crawlers in nutch

1. Create a New webdb (Admin DB-create );
2. Write the starting URL of the capture into webdb (inject );
3. Generate fetchlist Based on webdb and write the corresponding segment (generate );
4. Capture the webpage (FETCH) according to the URL in the fetchlist ).;
5. Update webdb (updatedb) based on the captured web page ).
Through the 3-5 loop, you can achieve in-depth crawling of the nutch.

Nutch segments Analysis
Analysis of segments/segment/Files) structure, the following five files are generated in the webdb folder after the nutch crawler runs:
Linksbymd5 linksbyurl pagesbymd5 pagesbyurl stats
Where:
The stats file is used to store the version information after crawling, the number of webpages processed, and the number of connections;
The other four folders, such as pagesbyurl, have two files-index and data. Data files are used to store ordered key/value pairs. Sorting is changed by selecting different keys and comparator, of course, there will be some other information in it. For example, in pagesbyurl, a key/value pair of a certain length will be placed into a location information (SYN); the index file will be used to store the index, however, this index file is also ordered, which stores the key and location information, but the key in the data file can be found in this index, in order to save space, every key/value segment is implemented to create an index, so that the search is ordered, so we use 2 points. If not, return the minimum location information at the end. This location is close to the target, and you can find it quickly at the position of the data file!
In addition, the way that nutch maintains this webdb is that when filling in and deleting a webpage or connection, it does not directly add a webpage or connection to this webdb, instead, add a webpage operation command to the internal pageinstructionwriter or linkinstructionwriter class in webdbwriter, and sort and remove the stored commands, finally, the command is combined with the webpage data stored in the existing webdb;

The fetcher class runs during actual web page capturing. The files or folders generated by crawlers are generated by this class, nutch provides options-whether to parse the captured webpage. If this option is set to false, there will be no parseddata or parsedtext folders.
Almost all files generated by the nutch crawler are key/value pairs. The difference is that the types and values of key/value are different. After the crawler runs, the files are generated in the subfolders of the segments folder.
The following folders: Content fetcher fetchlist index parse_data parse_text

The content in the content folder corresponds to the content class in the Protocol package; The content in The fetcher folder corresponds to the fetcheroutput class in The fetcher package; The fetchlist corresponds to the total fetchlist class in the pagedb package; parse_data and parse_text correspond to the parsedata and parsetext classes in the parse package respectively;

(Nutch segments)
As described in the preceding analysis of nutch segments, we have a clear understanding of the segments of nutch.
Tip: the segment in Lucene is different from that in nutch. The segment in Lucene is part of the index, but the segment in nutch is only the content and index of each part of the web page in webdb, finally, the index generated by the segment has nothing to do with these segments.

(Nutch crawldb)
It is mainly the data of the crawl_fetch and parse_data of the segments segment. And uses its crawldb loop fetch to generate segments for deep capturing.

(Nutch linkdb)
It stores the crawl_fetch data of all segments of the segments, and its URL list.

(Nutch indexes)
The indexes index set is generated by the nutch crawldb and linkdb segments. Its storage structure is stored in conjunction with Lucene's index mechanism.

(Nutch index)
The index information is extracted separately by the nutch indexes. Luke can be used to query the index results.

The main resources used by nutch in Tomcat are segments, linkdb, and indexes.

Splitting solution of nutch segments
From the above analysis, we can conclude that the crawling data of the nutch stores information in the segments segment.
We can reconstruct the crawl data source structure by performing slice splitting on segments.
Then, use the corresponding operations of the nutch to split and reconstruct the nutch segments.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Access Mechanism and Structure Analysis of nutch/Lucene (favorites)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Access Mechanism and Structure Analysis of nutch/Lucene (favorites)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support