Background design, implementation and optimization of lightweight text search engine

Source: Internet
Author: User
Tags idf

Main frame diagram

See: HTTP://R.PHOTO.STORE.QQ.COM/PSB?/V12VVUOZ2VXBMG/M2GZPWFNBLS8BUBT*16Y2XM9QKAAP8TMEPOLIPC1MLM!/R/DFMAAAAAAAAA

1.1 Build Library--word frequency library, words Index Library

Process:

Project Package:

1.1.1 Generation Library--Chinese corpus file

Main process:

The use of Ictclas of Chinese Academy of Sciences participle, example:

Hangzhou Mayor Spring Pharmacy. -"Hangzhou/ns Changchun/nz pharmacy/N."

1.1.2 Generation Library--word frequency library

Data:

1 hash_map<string ,  int ,  myhashfn >

Example :

1.1.3 Build Library--word Index Library

Data:

1 hash_map<string ,  set<string>,  myhashfn >

Example :

1.1.3 Build Library--word Index library UTF-8 encoding

To intercept a Chinese character, UTF-8 the first byte of the Chinese character to roll out the length

1 byte 0xxxxxxx
2 bytes 110xxxxx 10xxxxxx
3 bytes 1110xxxx 10xxxxxx 10xxxxxx
4 bytes 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

1 if(Word[i] & (1<<4))        2Key = Word.substr (I,4); 3 Else if(Word[i] & (1<<5))        4Key = Word.substr (I,3); 5 Else if(Word[i] & (1<<6))        6Key = Word.substr (I,2); 7 Else    8Key = Word.substr (I,1);
1.2 Build Library--Web page library, Web page offset Library

Process:

Project Package :

1.2.1 Build Library--Web page library

Webpage record:

One row of a page record (string)

Page record format:

<doc><docid> Page Number </docid>

<docurl> Web url</docurl>

<doctitle> Page Title </doctitle>

<doccontent> Web content </doccontent></doc>

Data Structure :

1 struct Page 2 {3    int ID; 4    string   url,title,content; 5 };
1 int

Example:

1.2.1 Build Library--page offset Library

Offset record:

One line offset record

Offset record Format:

PAGE_ID Offset Page Size

Data Structure :

1 hash_map<int, std::p air<intint> >

Example:

1.3 Build Library--web page to reset, set up inverted index

Process:

Project Package:

1.3.1 Build Library--web page to weight

Top-k algorithm:

• Use the Top10 of each page as a feature of the page. If the word frequency is the same, then Fu She by character.   • Two page feature similarity >= 60%, the two pages are duplicated. 1.3.2 Build Library--Create inverted index

Index:

One line index

Index format:

Words page_id the word frequency weight value page_id The term weight value ...

Data:

1 hash_map<std::string, std::set<std::p air<intdouble> , myhashfn>

Example:

Calculation of the weight value:

• 1, TF*IDF frame tf (term Frequency) word frequency df (document Frequency) The number of documents that appear • N total number of documents IDF (Invert document Frequency) Inverse document frequency

IDF (Invert document Frequency) Inverse documentation Frequency

IDF reflects the case of a feature word in the entire document collection, and the more IDF values appear, the lower the word's ability to differentiate between documents.

A number of experiments have shown that using the following formula works better:

Vector space Model:

Calculation of cosine similarity

If the word W appears 50 times in a Web page containing 100 words, 100 times on the B page containing 1000 words, it is obvious that the weight of w in a should be greater, but the result is the opposite. Therefore, the weights should be standardized:

            

2.1 Thread pool resources• Web page
1 int , page>

• Page Offset
1 hash_map<int, std::p air<intint> >

• Inverted Index
1 hash_map<std::string, std::set<std::p air<intdouble> , myhashfn>

• Word Frequency
1 map<string ,  int>

• Word Index
1 map<string ,  set<string>>

• Discontinued words
1  HASH_MAP<STD::string, std::string, myhashfn>

epoll Events, socket worker threads, cache save Threads • Thread pool error correction cache, thread pool query result cache • Task queue
1 class Task 2   {3       public 4 int//           Socket Descriptor 5       // next task 6   };
1  Vector<task>

2.2 Socket Job Details • Establish socket Register Epoll Hear socket descriptor, accept client, register read operation • Listen to read operation, add task • Monitor to hear write operation, output query result

2.3.1 Work Thread Detail--Error correction module

• Edit Distance algorithm

Refers to the minimum number of edit operations required between two strings, which is converted from one to another. Permission edits include replacing one character with another character, inserting a character, and deleting a character.

Example:

Calculates the editing distance between the x string and the Y string

DP[I][J] The editing distance of the first I character of the X-string and the first J characters of the Y-string

1 if(X [i-1] = = Y[j-1])       2DP[I][J] = dp[i-1][j-1];//same last character3 Else4 { 5     intT1 = dp[i-1][J];//Delete x i characters6T1 = T1 < dp[i][j-1] ? T1:dp[i][j-1];//Remove Y-J characters7T1 = T1 < dp[i-1][j-1] ? T1:dp[i-1][j-1];//last character changed to the same8DP[I][J] = t1 +1;9}

Because of the need to calculate the editing distance of Chinese, the char changed to String,string to Vector<string>.

• Take task receive client • Word segmentation • Word segmentation optimization, go-to-stop words

Ictclas word breakers may result in errors after the word breaker, for example:

1, to the word segmentation results to stop using words, such as "the" ""? "

2. Simple misspelled word segmentation optimization algorithm

(1) When a continuous single character is present, merge it into one word

(2) When a discontinuous single character is present, and is not the first word, merge it into the left word

Optimization results:

• Query Query Results Cache • If found, generate JSON based on query result collection, register write operation, end work • If not found, each word is queried for error-correcting cache. • If not found, make corrections (such as the same editing distance, choose the highest word frequency), update the error correction cache 2.3.2 Work thread Detail-Query module • Query Query result cache again • If found, generate JSON based on query result collection, register write operation, end work • If not found, query the page containing all keywords • build a vector space model, calculate cosine similarity and sort, generate query result set · Generate JSON, register write operations, end work cosine similarity calculation

The weights of the feature words of a query statement are composed of vector a

The weights of the corresponding feature words in the Web page are made of vector b

The cosine similarity of the query statement to the page:

2.4 Cache Save thread work details • Error correction cache
1 hash_map<stringstring , Myhashfn >  // error correction before error correction

• Query Results Cache
1 Set<string>,vector<pair<int, vector<double> > >  //   
• Timed scan of worker threads in the thread pool

• Each worker thread is scanned, overwriting the cache content in the worker thread to the thread pool cache, overwriting the cache content in the thread pools to the worker thread

• Write the cache in the thread pool back to disk after the scan is complete

Background design, implementation and optimization of lightweight text search engine

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.