Background design, implementation and optimization of lightweight text search engine

Last Update:2016-03-06 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Main frame diagram

See: HTTP://R.PHOTO.STORE.QQ.COM/PSB?/V12VVUOZ2VXBMG/M2GZPWFNBLS8BUBT*16Y2XM9QKAAP8TMEPOLIPC1MLM!/R/DFMAAAAAAAAA

1.1 Build Library--word frequency library, words Index Library

Process:

Project Package:

1.1.1 Generation Library--Chinese corpus file

Main process:

The use of Ictclas of Chinese Academy of Sciences participle, example:

Hangzhou Mayor Spring Pharmacy. -"Hangzhou/ns Changchun/nz pharmacy/N."

1.1.2 Generation Library--word frequency library

Data:

1 hash_map<string ,  int ,  myhashfn >

Example :

1.1.3 Build Library--word Index Library

Data:

1 hash_map<string ,  set<string>,  myhashfn >

Example :

1.1.3 Build Library--word Index library UTF-8 encoding

To intercept a Chinese character, UTF-8 the first byte of the Chinese character to roll out the length

1 byte 0xxxxxxx
2 bytes 110xxxxx 10xxxxxx
3 bytes 1110xxxx 10xxxxxx 10xxxxxx
4 bytes 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

1 if(Word[i] & (1<<4))        2Key = Word.substr (I,4); 3 Else if(Word[i] & (1<<5))        4Key = Word.substr (I,3); 5 Else if(Word[i] & (1<<6))        6Key = Word.substr (I,2); 7 Else    8Key = Word.substr (I,1);

1.2 Build Library--Web page library, Web page offset Library

Process:

Project Package :

1.2.1 Build Library--Web page library

Webpage record:

One row of a page record (string)

Page record format:

<doc><docid> Page Number </docid>

<doctitle> Page Title </doctitle>

<doccontent> Web content </doccontent></doc>

Data Structure :

1 struct Page 2 {3    int ID; 4    string   url,title,content; 5 };

1 int

Example:

1.2.1 Build Library--page offset Library

Offset record:

One line offset record

Offset record Format:

PAGE_ID Offset Page Size

Data Structure :

1 hash_map<int, std::p air<intint> >

Example:

1.3 Build Library--web page to reset, set up inverted index

Process:

Project Package:

1.3.1 Build Library--web page to weight

Top-k algorithm:

• Use the Top10 of each page as a feature of the page. If the word frequency is the same, then Fu She by character. • Two page feature similarity >= 60%, the two pages are duplicated. 1.3.2 Build Library--Create inverted index

Index:

One line index

Index format:

Words page_id the word frequency weight value page_id The term weight value ...

Data:

1 hash_map<std::string, std::set<std::p air<intdouble> , myhashfn>

Example:

Calculation of the weight value:

• 1, TF*IDF frame tf (term Frequency) word frequency df (document Frequency) The number of documents that appear • N total number of documents IDF (Invert document Frequency) Inverse document frequency

IDF (Invert document Frequency) Inverse documentation Frequency

IDF reflects the case of a feature word in the entire document collection, and the more IDF values appear, the lower the word's ability to differentiate between documents.

A number of experiments have shown that using the following formula works better:

Vector space Model:

Calculation of cosine similarity

If the word W appears 50 times in a Web page containing 100 words, 100 times on the B page containing 1000 words, it is obvious that the weight of w in a should be greater, but the result is the opposite. Therefore, the weights should be standardized:

2.1 Thread pool resources• Web page

1 int , page>

• Page Offset

1 hash_map<int, std::p air<intint> >

• Inverted Index

1 hash_map<std::string, std::set<std::p air<intdouble> , myhashfn>

• Word Frequency

1 map<string ,  int>

• Word Index

1 map<string ,  set<string>>

• Discontinued words

1  HASH_MAP<STD::string, std::string, myhashfn>

epoll Events, socket worker threads, cache save Threads • Thread pool error correction cache, thread pool query result cache • Task queue

1 class Task 2   {3       public 4 int//           Socket Descriptor 5       // next task 6   };

1  Vector<task>

2.2 Socket Job Details • Establish socket Register Epoll Hear socket descriptor, accept client, register read operation • Listen to read operation, add task • Monitor to hear write operation, output query result

2.3.1 Work Thread Detail--Error correction module

• Edit Distance algorithm

Refers to the minimum number of edit operations required between two strings, which is converted from one to another. Permission edits include replacing one character with another character, inserting a character, and deleting a character.

Example:

Calculates the editing distance between the x string and the Y string

DP[I][J] The editing distance of the first I character of the X-string and the first J characters of the Y-string

1 if(X [i-1] = = Y[j-1])       2DP[I][J] = dp[i-1][j-1];//same last character3 Else4 { 5     intT1 = dp[i-1][J];//Delete x i characters6T1 = T1 < dp[i][j-1] ? T1:dp[i][j-1];//Remove Y-J characters7T1 = T1 < dp[i-1][j-1] ? T1:dp[i-1][j-1];//last character changed to the same8DP[I][J] = t1 +1;9}

Because of the need to calculate the editing distance of Chinese, the char changed to String,string to Vector<string>.

• Take task receive client • Word segmentation • Word segmentation optimization, go-to-stop words

Ictclas word breakers may result in errors after the word breaker, for example:

1, to the word segmentation results to stop using words, such as "the" ""? "

2. Simple misspelled word segmentation optimization algorithm

(1) When a continuous single character is present, merge it into one word

(2) When a discontinuous single character is present, and is not the first word, merge it into the left word

Optimization results:

• Query Query Results Cache • If found, generate JSON based on query result collection, register write operation, end work • If not found, each word is queried for error-correcting cache. • If not found, make corrections (such as the same editing distance, choose the highest word frequency), update the error correction cache 2.3.2 Work thread Detail-Query module • Query Query result cache again • If found, generate JSON based on query result collection, register write operation, end work • If not found, query the page containing all keywords • build a vector space model, calculate cosine similarity and sort, generate query result set · Generate JSON, register write operations, end work cosine similarity calculation

The weights of the feature words of a query statement are composed of vector a

The weights of the corresponding feature words in the Web page are made of vector b

The cosine similarity of the query statement to the page:

2.4 Cache Save thread work details • Error correction cache

1 hash_map<stringstring , Myhashfn >  // error correction before error correction

• Query Results Cache

1 Set<string>,vector<pair<int, vector<double> > >  //

• Timed scan of worker threads in the thread pool

• Each worker thread is scanned, overwriting the cache content in the worker thread to the thread pool cache, overwriting the cache content in the thread pools to the worker thread

• Write the cache in the thread pool back to disk after the scan is complete

Background design, implementation and optimization of lightweight text search engine

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More