The previous chapter describes the query service subsystem in detail. It can be found that the query service subsystem implements the query function based on some "Data Files". These "Data Files" are described in section 2nd, including: dictionary files (words. dict), the original web database file (Tianwang. raw.2559638448), Web index file (Doc. idx), URL index file (URL. idx. sort_uniq), inverted index file (sun. iidx), where words. dict is an existing Tianwang. raw.2559638448 is generated by crawlers. Other files need to be generated by the program. This is the work of the preprocessing subsystem. The Preprocessing Program analyzes the original web database files to generate these data files. The most important is the inverted index file, because the query service subsystem searches for keywords in the file.
So how does the preprocessing program run to generate these data files? In the source code of TSE, a document (TSE/index/readme.txt) is described.
========================================================== ======= Readme.txt ====================================== ====================
1. The document index (Doc. idx) keeps information about each document. It is a fixed width isam (index sequential access mode) index, orderd by docid.
The information stored in each entry has des a pointer into the repository, a document length, a document checksum. The URL index (URL. idx) is used
Convert URLs into docids. It is a list of URL checksums with their corresponding docids and is sorted by checksum. In order to find the docid of
The URL's checksum is computed and a binary search is saved med on the checksums file to find its docid.
./Docindex
Got Doc. idx, URL. idx, docid2url. idx
2. Sort URL. idx | uniq> URL. idx. sort_uniq
3. segment document to terms, (with finding document according to the URL)
./Docsegment Tianwang. ra4102559638448
Got Tianwang. raw.2559638448.seg
4. Create forward index (docic --> termid)
./Crtforwardidx Tianwang. raw.2559638448.seg> moon. fidx
5. # Set | grep "Lang" lang = en;
Export Lang;
Sort moon. fidx> moon. fidx. Sort
6. Create inverted index (termid --> docid)
./Crtinvertedidx moon. fidx. Sort> sun. iidx
----------------------------------------------------
Provding Service
At http: // 162.105.80.60/TSE/
Tsesearch CGI program for query
Snapshot CGI program for page Snapshot
========================================================== ========================================================== ==================================
This file describes how to generate Inverted Files and other important data files. Run make in the index directory. the executable programs generated in the directory include crtforwardidx, crtinvertedidx, docindex, docsegment, snapshot, and tsesearch. Among them, tsesearch and snapshot are CGI programs for querying and web snapshot functions, while others are programs that execute locally to generate data files. Here is a brief introduction to how to execute these programs. The index directory contains a Data Directory, which contains Tianwang. raw.2559638448 file, and the query service subsystem also reads data files in this directory, So we generate all the data files in index/data, therefore, the first step is to copy the executable program crtforwardidx, crtinvertedidx, docindex, and docsegment to index/data, and then execute the program through the command line in the directory.
1>./docindex
After the command is executed, the doc is generated. idx (Web index), URL. idx (URL index), docid2url. idx (docid-to-URL index), no parameter is input when this program is executed, because the program regularly reads the Tianwang of the current directory. raw.2559638448 file, which can be seen later when analyzing the source code. In fact, the program should pass in the original webpage data file to be analyzed through command line parameters like several other programs.
2> sort URL. idx | uniq> URL. idx. sort_uniq
Sort is a Linux Command. This step is for URL. idx is ordered alphabetically, while URL. the first field recorded in idx is the MD5 value of the URL, so it is sorted by the MD5 value of the URL, and uniq is used to remove duplicate URL records. Therefore, the final URL. idx. sort_uniq is an index table that is de-duplicated and sorted by the URL's MD5 value. The sorting is for more efficient search (as described in section 2nd ).
3>./docsegment Tianwang. raw.2559638448
In this step, the content of the webpage body in the original webpage data file is segmented, And the HTML tag of the webpage is removed first to obtain the body part, then, the Chinese word segmentation module is called to split the body into separate words. Run the command to get Tianwang. raw.2559638448.seg, as shown in content 1 of the file, each record occupies two rows, the docid of the first behavior web page, and the text content after the second behavior word segmentation (separated.
Figure 1
4>./crtforwardidx Tianwang. raw.2559638448.seg> moon. fidx
This step is the split web page body (Tianwang. raw.2559638448.seg) to create a forward index file. The input parameter of this command is Tianwang. raw.2559638448.seg. The output is a standard output. Here, the output content is redirected to moon using the Linux redirection (>) symbol. fidx file, which is a forward index file (docid to keyword index ). 2. The first field of each line is the word in the webpage, and the second field is the docid of the webpage, separated by \ t.
Figure 2
5> sort moon. fidx> moon. fidx. Sort
This step sorts forward index files generated in the previous step by keywords (dictionary order), sort is a Linux Command, sort is previously sorted by docid, and sort is then sorted by keywords, all the same keywords are arranged together to facilitate the establishment of inverted indexes in the next step. The sorted forward index file moon. fidx. Sort content is shown in 3.
Figure 3
6>./crtinvertedidx moon. fidx. Sort> sun. iidx
This step creates an inverted index and combines all the same keywords in the forward index file after the previous sorting. The input parameter of this command is Moon. fidx. sort, the output result is redirected to the file sun. in iidx, Sun. as shown in iidx content 4, each record occupies one row. The first field is the keyword, and the second field is the docid sequence of the webpage where the keyword appears (docid is separated by space ). Figure 4 shows the inverted index of the word "Search. Therefore, the query service sub-system can query the inverted table to obtain the docid of all webpages with the keyword.
Figure 4
Now, all the search data files required by the query service subsystem have been generated, and the pre-processing subsystem has been completed. The source code of the programs crtforwardidx, crtinvertedidx, docindex, and docsegment executed above is crtforwardidx. cpp, crtinvertedidx. cpp, docindex. cpp, and docsegment. cpp in the index directory. These source code is very simple, that is, simple text analysis, this series of notes will not be explained in detail, readers can read it by themselves. Now, the introduction of the preprocessing subsystem is complete.
By: