[Project] Introduction to Minisearch Text retrieval

Last Update:2015-08-19 Source: Internet

Author: User

Tags file handling rewind idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. pretreatment process

Preprocessing is primarily used to generate data that the program may use in advance in order to speed up processing time.

The preprocessing process mainly generates three files required by the program: Web page library files, Web location information files, and inverted index files.

Web Page library file

Where the Web page library file Ripepage.lib is mainly to format the data to store a large number of Web page information, the format of each page data is:

<doc>

<docid>id</docid>

<docurl>url</docurl>

<doctitle>title</doctitle>

<doccontent>content</doccontent>

</doc>

Web location information file

Web location information file Offset.lib is mainly to store the page in the page library offset position, so that the program can quickly remove the specified page, each row of the file store a page file in the page library location information, the format of each line is:

DocId Offset Size

Where DocId is the ID of the Web page (this ID has global uniqueness), offset is the number of bytes in the page library where the document is located from the beginning of the file, and size is the document.

Inverted index file

The inverted index file Invert.lib is an association of all the words in the Web library (after participle, to deactivate the word) and the document containing the words.

The inverted index of each word occupies a row in the file, and each line is formatted as:

Word docid1 frequency1 weight1 ... docidi frequencyi weighti ...

Where word is a word in a Web page library, followed by a group of three, Docidi is the page that contains the word, frequencyi the frequency of the time in the document, Weighti the weight in the document (after normalization).

2. Program Operation process

The program first reads the page location information from the Offset.lib, then reads the page information from the Rippage.lib based on the information, and then reads the inverted index information from the invert.lib.

The program loops through the socket to accept requests from the client and, once requested, fork a child process to process the request while the main process continues to listen. The child process accepts query statements from the client, finds the results based on the query statement, and returns the results to the client.

1. Building a Web page library

Generate Web page library Ripepage.lib, the location of the generated page offset file offset.lib.

Traverse the directory to read the files needed to build a Web page library, stitch into standard format, then write to the file and build the library index at the same time. The code is as follows:

#include <iostream> #include <stdio.h> #include <stdlib.h> #include <string.h> #include < vector> #include <fstream> #include <sys/types.h> #include <dirent.h> #include <sys/stat.h> #include <stdexcept> #include <unistd.h> #include <string.h> #include <map> #include < utility> #include <set> #include <functional> #include <algorithm>//scan Catalog class, This class scans the items under the specified directory and saves the absolute path to the items that belong to the normal file. Class dirscan{public://a constructor with parameters, passing in a vector container to hold the absolute path of the file Dirscan (std::vector<std::string>& VEC): M_vec ( VEC) {}//overloads the function call operator, passing in a path. void operator () (const std::string &dir_name) {traverse (dir_name);} Iterates through the path, saving the absolute path of items belonging to the file type into the vector container during traversal. Traversal algorithm://Open the directory, go to the directory, loop through the items in the directory, determine the properties of the item, if the type of the item is a file to save the absolute path of the item, if the item is a directory, recursively traverse the directory. Switch to the top-level directory of the directory after you finally traverse the directory. void Traverse (const std::string& dir_name) {//Opens the specified directory dir* pdir = Opendir (Dir_name.c_str ()); if (Pdir = = NULL) {std:: cout << "dir open" << Std::endl; exit (-1);} Enter the specified directory ChDir (diR_name.c_str ()); struct dirent* mydirent; struct stat mystat;//Loop through related items in this directory while ((Mydirent =readdir (pdir))!=null) {//Received Take the attribute of the item in the directory stat (mydirent->d_name, &mystat);//Determine if the item is a directory. if (S_isdir (Mystat.st_mode)) {//If the directory is '. ' and ' ... ' (Each directory has these two items, if you do not exclude the two programs will enter an infinite loop), then skip that cycle to continue the next time if (strncmp (Mydirent->d_name, ".", 1) = = 0 | | strncmp (mydirent->d_ Name, "..", 2) = = 0) {continue;} else//if the directory is not the first, then recursively traverse the directory {traverse (mydirent->d_name);}} else//If the item is not a directory (is a file), save the absolute path of the item {std::string file_name= ""; file_name = file_name + GETCWD (null,0) + "/" + Mydirent->d_ Name; M_vec.push_back (file_name);}} ChDir (".."); Closedir (Pdir);} A reference to a vector container outside the private://class that holds the absolute path to the file std::vector<std::string>& M_vec;};/ /File processing class, which formats files in some format and saves individual files to a single file to form a Web page library file. Each file is processed into <doc><docid>id</docid><doctitle>title</doctitle><docurl>url</ Docurl><doccontent>content</doccontent></doc>class fileprocess{public://A constructor with parameters, The first parameter is the vector container that holds the file path, and the second argument is the passed-in string for extractingThe ' title ' Fileprocess in the document (std::vector<std::string>& VEC, std::string& str): M_vec (VEC) {m_title = str;} The overloaded function call operator, passing in two file names, is used to save the offset position of the constructed Web library and individual document in the Web page library void operator () (const std::string &file_name, const std::string & Offset_file) {do_some (file_name, offset_file);} Create a Web page library and save it as well as the document's offset in the library to a file. void Do_some (const std::string& file_name, const std::string& offset_file) {//file pointer for saving a Web page library file* fp = fopen (file_ Name.c_str (), "w");//The file pointer used to save the document offset in the page library file* fp_offset = fopen (Offset_file.c_str (), "w"); if (fp = = NULL | | fp_offset = NU LL) {std::cout << "File open" << Std::endl; exit (0);} int index;//dynamically creates an array of characters to hold all content read from the file char* mytxt = new char[1024*1024] (); int mydocid; char myurl[256] = "";//dynamically Create a word char* mycontent = new char[1024 * 1024] ();//Save the document title char* MyTitle = new char[1024] ();//Each document is processed sequentially. Processing includes: Generate Document ID (which has global uniqueness), extract document title, Generate document URL (absolute path of document), extract document contents for (index = 0; Index! = m_vec.size (); index + +) {memset (mytxt, 0, 1024*1024); memset (myurl, 0, N); meMset (mycontent, 0, 1024*); memset (mytitle, 0, 1024);//Open the specified document file * Fp_file = fopen (M_vec[index].c_str (), "R");//Read text File, and save the title to Mytitleread_file (Fp_file, mycontent, MyTitle); fclose (fp_file); mydocid = index + 1; strncpy (Myurl, M_vec[index ].c_str (), m_vec[index].size ());//Format the document as a string in the specified format sprintf (mytxt, "<doc><docid>%d</docid>< Docurl>%s</docurl><doctitle>%s</doctitle><doccontent>%s</doccontent></doc >\n ", Mydocid, Myurl, MyTitle, mycontent);//calculates the starting position of the document in the page library int myoffset = Ftell (FP); The function ftell is used to get the offset bytes of the current position of the file position pointer relative to the top of the file. int mysize = strlen (mytxt); char offset_buf[128]= "";//The offset of the document is determined by the three numbers (document ID document at the beginning of the Web page library), which is a line in the file//writes the document offset information to the offset file. fprintf (Fp_offset, "%d\t%d\t%d\n", Mydocid, Myoffset, mysize);//writes the formatted document to the Web page library write_to_file (FP, mytxt); }fclose (FP);} Read the contents of the document and extract the title, respectively, to save the content and title to the space pointed to by mycontent and mytitle to void Read_file (file* fp, char* mycontent, char* mytitle) {int IRE t; const int size = 1024x768 * 1024x768; char* line = new char[(); int pos = 0;//loop reads the contents of the document while (1) {int iret = fread (mycontent + pos, 1, Size-pos, FP); if (iret = = 0)//If the document is read, jump out loop {break;} else//If you have not finished reading, then continue to the original place Read {pos + = Iret;}} Returns the file pointer to the beginning of the document for extracting the title Rewind (FP);//Count records the number of rows currently read, whether the flag record has found the title (0 is not found, 1 indicates it has been found). int count = 0, flag = 0; ;//Check out the first 11 lines of the document to see if each line has a ' title ' word, if there is a row as the title, if not the next line (line 12th) as the title//if the entire document does not have 11 lines, the first row is directly used as the title while (Count <=10 & & Fgets (line, 0, fp)! = NULL) {std::string str_line (line);//If a row has a ' title ' two words if (Str_line.find (M_title.c_str (),)! = St D::string::npos) {//assigns the line to MyTitle as the title strncpy (MyTitle, Str_line.c_str (), str_line.size ()); flag = 1; break;} count + +;} if (Count < one && flag = = 0)//If the document does not have 12 lines, the first row is the title {rewind (FP); Fgets (mytitle,1024, FP);} else if (count = = && Flag = = 0)//If there are 12 rows, 12 rows as the title {fgets (mytitle,1024, FP);}} Writes the formatted document to the page library file void Write_to_file (file* fp, char* mytxt) {int iret, pos = 0; int len = strlen (mytxt);//Loop writes to the page library file until it is written End while (1) {iret = fwrite (Mytxt + pos, 1, lEn-pos, fp); len = Len-iret; if (len = = 0) {break;} }}private://a reference to the container where the file path is saved. std::vector<std::string>& M_vec;//For saving extract title std::string m_title; std::map<int, std::p air<int, int> > M_offset;}; void Show (std::vector<std::string>::value_type& val) {std::cout << val << Std::endl;} int main (int argc, char* argv[])//exe src_txt_dir ripepage_filename offset_file_name{//Initializes a container to hold the path to the document STD::VECTOR&L T;std::string> Str_vec;//define a scan directory object Dirscan mydirscan (Str_vec); Mydirscan (argv[1]);//define a file handling object fileprocess Myfileprocess (Str_vec, title), Myfileprocess (Argv[2], argv[3]), Std::cout << "over" << Std::endl; return 0;}

2. Web page to re-

Web page to re-generate new location offset file newOffset.lib.

1. According to the location of the page offset file offset.lib, from the page library file Ripepage.lib in turn, the contents of each page is read into memory.

Vector<page> to store Web pages in memory

Where page is a custom class that extracts every Web page file in the Hard drive Web page library out of Docid,doctitle,docurl,doccontent (these 4 items are stored in string). The frequency of each word in the Web page (stored using unordered_map<string, int> mapwordfreq), and encapsulates the method of calculating the hash fingerprint for each page

2. Use the word breaker to segment the content of each page, the results of the participle into a temporary vector<string&gt, and complete the steps to remove the stop word, the approximate code is as follows:

std::vector<std::string>  split::wordsplit (const char*  pagecontent) {        size_t pagecontentsize = Strlen (pagecontent);    char* contentaftersplit = new char[6 * Src_len] ();        Chinese Academy of Sciences word processing program, the content after the word is stored in string form in the Contentaftersplit character array    ictclas_paragraphprocess (PageContent, Pagecontentsize, Contentaftersplit, CODE_TYPE_GB, 0);    Std::istringstream sin (contentaftersplit);    std::string word;    The result of storing word    std::vector<std::string> Vecword;    while (sin >> word) {        if (!conf.setstoplist.count (word) && word[0]! = ' \ r ') {            vecword.push_back ( word);        }    }        delete [] contentaftersplit;        return Vecword;}

3. Count the frequency of words appearing on each page

Use unordered_map<string, int> mapwordfreq for storage

void Page::getwordfreq (std::vector<std::string>& vecword)//parameter Vecword The return result of a Web page after it has been removed by word breaker {    //  STD :: Unordered_map<std::string, Int> Mapwordfreq is a data member of Web page class page;    Std::vector<std::string>::iterator iter  ;    for (iter = Vecword.begin (); ITER! = Vecword.end (); iter + +) {        Mapwordfreq[*iter] + +;    }}

4. According to the word frequency dictionary mapwordfreq each page in vector<page>, you can get the words that appear in all the pages

Put each word in each page in a hashset, defined here as unordered_set<string> setallwords

Count the number of occurrences of each word in a setallwords on all pages

Traverse each word in the setallwords to see if it is in the mapwordfreq of each page,

It is stored here with unordered_map<string, int> mapwordfreqinallpage.

5. Calculate the TF-IDF value for each word in each page

Use unordered_map<string, double> Maptfidfofword to store

Traversal by 3 Gets the word frequency dictionary unordered_map<string, int> mapwordfreq, combined unordered_map<string, int> mapwordfreqinallpage, It is easy to get the TF-IDF value for each word, which is calculated as follows:

Tfdoc is the number of times the word appears in this page (frequency), n is the total number of pages, and Dfword is the number of pages that have appeared in the word.

The TF-IDF value indicates the importance of a word in a Web page, the stronger the ability to predict a topic in a Web page, the greater its weight (TF-IDF value).

The TF-IDF value of each word in the Web page is normalized with the following formula:

6. Calculate the hash fingerprint for each page (Simhash method)

The Simhash method first maps the word using the MD5 algorithm to a 64-bit binary vector, and then incorporates the weights into the vector to form a real vector. Assuming that the weight of a word (TF-IDF) is W, the binary vector is rewritten as follows: If a bit of binary is a value of 1, the corresponding position in the real vector is rewritten as w, and if the bit value is 0, the corresponding position in the real vector is rewritten to-W, which is the negative value of the weight. Through the above rules, the binary vector is rewritten as a real vector that embodies the weight of the word.

When each word in the page has been rewritten above, the real vector of all the words is added to obtain a real vector representing the whole of the document.

In the final step, the real vector is converted to a binary vector again, and the conversion rule is as follows: If the value of the corresponding position is greater than 0, it is set to the binary number 1, or to the binary number 0 if it is less than 0.

The hash fingerprint is stored in unordered_map<string (DOCID), String (fingerprint) > fingerprint.

7. Use the hash fingerprint to remove the page weight

If the Hamming distance of the two Web pages is less than 3, we will determine if the two pages are the same (similar) pages.

8. In the process of Web page go heavy, update vector<page>

Web page de-heavy void removeduppage (std::vector<page>& vecpage) {    int i, J;        for (i = 0; i!= vecpage.size () – 1; i + +) {for        (j = i + 1; J! = Vecpage.size (); j + +) {            if (vecpage[i] = = VECPA GE[J]) {//Overloads the Page class's operator==, uses the hash fingerprint to determine whether two articles are similar                                mypage tmp = vecpage[j];                VECPAGE[J] = vecpage[vecpage.size ()-1];                Vecpage[vecpage.size ()-1] = tmp;                                Vecpage.pop_back ();                                J--;}}    }

After the page is de-weighed, generate a new position offset file NewOffset.lib

Note: In the configuration Class conf, there is a way to load the original location offset file offset.lib into memory

The format of the storage Offset.lib is: Std::unordered_map<int, std: :p air<int, int> > M_offset

void Updateoffset (const std::vector<page> &vecpage) {    std::ofstream of (conf.m_conf["Mynewoffset"].c_ STR ());    if (!of) {        std::cout << "open Mynewoffset fail" << Std::endl;        Exit (0);    }        Re-save the offset information of the redirected document back to a new offset file for    (page_index = 0; Page_index! = Vecpage.size (); Page_index + +) {of        << a Toi (Vecpagevecpage[page_index].m_docid.c_str ()) << "   "           <<conf.m_offset[atoi (Vecpage[page_ Index].m_docid.c_str ())].first << "   "            <<conf.m_offset[atoi (page_vec[page_index].m_docid.c_ STR ())].second << Std::endl;     }        Of.close ();}

3. Creating an inverted Index file

Build Word-Article inverted index file invert.lib
Format: word1 <doc1, weight> <doc2, weight> ... <docn, weight>

           word2 <doc1, weight> <doc2, weight> ... & LT;DOCN, Weight>

           ...

           wordm <doc1, weight> <doc2, weight> ... <DOCN, Weight>

Inverted index storage format: std::unordered_map<std::string, STD::VECTOR<STD::p air<int,int> > > Mapreverseindex

Generate Inverted index void Invert_index (std::vector<mypage> &vecpage,                   std::unordered_map<std::string, std:: VECTOR<STD::p air<int,int> > > &mapreverseindex) {    int index;    Iterate through each Page object    for (index = 0; Index! = vecpage.size (); index + +) {                std::map<std::string, int >::iterator it ER;        Unordered_map<string, double> Maptfidfofword for        (iter = (Vecpage[index].maptfidfofword). Begin ();              Iter! = Vecpage[index].maptfidfofword.end ();              ITER + +)        {            mapreverseindex[iter->first].push_back (Std::make_pair (atoi (            vecpage[index].m_ Docid.c_str ()), Iter->second));}}}

Note that you need to write back files from memory for backup

4. Program Query logic

First, the query statement to stop the word word, get a phrase, calculate the weight of each word of the phrase (through TF*IDF), and then according to the page Library inverted index (already loaded into memory), to find the various documents containing the query phrase, Then, by calculating the cosine similarity of each document found to the query statement (which treats the query statement as a document), the found collection of documents is sorted from large to small (the greater the cosine, the higher the similarity), and the result is then encapsulated in JSON-formatted data return.

adjourned

[Project] Introduction to Minisearch Text retrieval

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More