[Tse Study Notes of Peking University Skynet search engine] section 2nd-Introduction to important data files

Source: Internet
Author: User
Tags md5 digest web database microsoft iis

 

This section briefly introduces the main data files in the system, so that you can read the source code of the system and understand the system. Unless otherwise specified, all paths or files are relative to the index directory, that is, the index is the current directory.

(1)./chseg/words. dict

It is a dictionary file that contains all Chinese Words, words, and punctuation characters supported by the system. This dictionary is the basis for Chinese Word Segmentation and directly determines the result of word segmentation. Each record in this file is a row that contains three types of data. The first is the serial number, the second medium term, and the third parameter is unknown for the moment. Figure 1 is the part of the words. dict file (Note: The yellow number in the figure is the row number displayed by VIM, not the content of the file ).

Figure 1

In addition, when the./words. dict file is a link file, the link target is./chseg/words. dict.

(2)./data/Tianwang. raw.2559638448

For the raw webpage data captured by the webpage collection module, this file is stored in a fixed format (Skynet format) and has a specified storage format. The following is a part of the original webpage data file in Skynet format.

Version: 1.0

URL: http: // ***. 105.138.175/default2.asp? Lang = GB

Origin: http: // ***. 105.138.175/

Date: Fri, 23may 2008 20:01:36 GMT

IP: 162.105.138.175

Length: 38413

 

HTTP/1.1 200 OK

Server: Microsoft-Microsoft IIS/5.0

Date: Fri, 23 May 2008 11:17:49 GMT

Connection: keep-alive

Connection: keep-alive

Content-Length: 38088

Content-Type: text/html; charset = gb2312

Expires: Fri, 23 May 2008 11:17:49 GMT

Set-COOKIE: aspsessionidsstrdcab = imeombiaipdfckpaedjfhoih; Path =/

Cache-control: Private

 

<! Doctype HTML public "-// W3C // DTD html4.01 transitional //" http://www.w3.org/TR/html4/loose.dtd ">

<HTML>

<Head>

......

Version: 1.0

URL :***

......

 

Each webpage record consists of the header, webpage data, and blank lines: Header + blank lines + webpage Data + blank lines. In the preceding example, the red and black parts are headers, the blue part is the webpage data. The first data in the header must be the version number description: Version: 1.0. Therefore, row version: 1.0 in this file is the division between the two web page records. For more information about the original webpage data files, see search.

This file is captured by the web page collection module and is followed by webpage analysis and inverted index creation.

(3)./data/Doc. idx

It is a Web index file, which is part of the index Web database described in section 1 of chapter 4 of search. The task of indexing a webpage database is to complete the given URL and locate the record pointed to by the URL in the original webpage data. Because the original webpage data files are very large, if you do not index the webpage records, and perform sequential search, the efficiency will be very low.

Each record in the webpage index file is a line that contains the serial number of the webpage file (the serial number stored in the original webpage data file, recorded as docid) the offset position of the webpage record in the original webpage data file and the MD5 Digest value of the webpage content. Figure 2 shows the part of the doc. idx file (Note: The yellow number in the figure is the row number displayed by VIM, not the content of the file ).

Figure 2

In this way, with this file, it is easy to find webpage records in the original webpage data file through docid to obtain webpage data.

(4)./data/url. idx. sort_uniq

The URL index file is part of the index Web library described in section 1 of chapter 4 of search. This file can find the corresponding docid from the URL. This file is very simple. Each record is a line that contains the MD5 Digest value and docid of the URL. In order to quickly find the corresponding docid for the given URL, you need to sort it according to the URL digest value. After sorting, you can quickly locate the corresponding URL through binary search. Therefore, the URL. idx. sort_uniq file unduplicates the URL and sorts the URL summary value to the docid ing file. Figure 3 shows the part of the URL. idx. sort_uniq file (Note: The yellow number in the figure is the row number displayed by VIM, not the file content ).

Figure 3

In this way, it is convenient to obtain the data of the webpage in the original webpage database. First, calculate the MD5 Digest value for the given URL, find the corresponding docid in the file, and then use the doc. the idx file obtains the offset of the webpage in the original webpage database and then reads the webpage data.

(5)./data/Sun. iidx

The inverted index file is a key file in the system. The inverted index method is also widely used by modern search engines. When it comes to inverted index files, there are actually forward index files, which are generated based on the forward index file. In short, forward index files refer to the ing from webpage files to keywords, while inverted index files refer to the ing from keywords to web files, for details about the two, refer to search or relevant online materials.

The inverted index file records the ing between all the keywords that appear in the original webpage data (the words that appear in the dictionary used by the system are recorded as terms) and the webpages that appear in the keyword. Each record in the file is a line that contains the docid of the webpage where the term and term appear. The term is separated by \ t, and the docid is stored in the order separated by spaces. Figure 4 is the part of the sun. iidx file (Note: The yellow number in the figure is the row number displayed by VIM, not the content of the file ).

Figure 4

The inverted index file is very critical. The keyword search in the file is located in the file. First, find the keyword, then obtain the docid of the webpage where the keyword appears, and then read the webpage record to obtain the webpage content.

Supplement:

When sun. iidx is started in Vim, the Chinese keywords in it are garbled. This is because of the encoding format problem. You can configure the current user's vim configuration file (~ /. Vimrc), add the following lines to the file:

Setfileencodings = UTF-8, gb2312, GBK, gb18030

Set termencoding = UTF-8

Set fileformats = Unix

Set encoding = PRC

For more information about Vim coding configuration, see this article (http://www.2cto.com/ OS /201111/110622.html ).

 

By:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.