[Tse Study Notes of Peking University Skynet search engine] section 5th-prepare data

Source: Internet
Author: User
Tags map data structure

In the previous section, tsesearch is the entry program for the search function. the main function of CPP is introduced to give a rough understanding of the process of implementing the search function, this section describes the main processes mentioned in the previous section: preparing data, Obtaining user input, Chinese word segmentation, searching keywords, sorting results, and displaying search results for detailed analysis. This section analyzes the source code of the prepared data. (The content of this section is very simple, and clear friends can skip it directly)

(1)Load dictionary

The first line of the main function in the previous section defines the cdict object IDICT. The cdict class is a dictionary class. Find the source code. /chseg/dict. CPP: the constructor calls the opendict function. This function is very simple. It opens the dictionary file and reads words from the dictionary file row by row and adds them to mapdict, a class member, that is, load the dictionary content from the file to the memory to prepare for later use.

The opened file is dictfilename. The macro is defined as conststringdictfilename ("words. dict. dict should be very familiar with it. It has already been introduced in section 2nd, which is a dictionary file.

In addition, take a look at the data structure used to store the dictionary. /chseg/dict. the Declaration of the cdict class in H. The class defines Map <string, int> mapdict;, which indicates that the dictionary is stored by STL map in Tse, the map data structure in STL is implemented by the red and black trees at the underlying layer. Why do we need to mention the data structure of dictionary storage, because the data size of the dictionary is very large (let me look at words. dict files have a total of 108783 rows), while Chinese Word Segmentation requires frequent searches in the dictionary for the existence of the specified string, which is one of the important factors affecting the efficiency of the search engine. If you do not need to optimize the data structure storage of a dictionary, and you need to traverse the entire dictionary file every time, you can imagine how low the efficiency is (O (n) time complexity )! The time complexity is O (Log
N), greatly improving the efficiency. Some Chinese Word Segmentation programs use Hash map to store dictionaries, which is more efficient than the red/black tree.

(2)Load inverted table

The getinvlists function is called in row 50th of the main function in the previous section. The definition of this function in./query. cpp is displayed. The Code is also very simple. Open the inverted index file and read each row cyclically into mapbuckets, that is, load the inverted index from the file to the memory, to prepare for subsequent use.

The opened file is inf_info_name. The macro is defined as conststringinf_info_name (". /data/Sun. iidx. iidx should be very familiar with it. It has already been introduced in section 2nd. This is an inverted file.

(3)Load web Index

The getdocidx function is called in the second line of the main function in the previous section. Similar to the above, the function loads data from the Web index file (Doc. idx) to the memory and prepares for subsequent use.

 

 

By:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.