Algorithm and principle of English word segmentation

Last Update:2015-08-17 Source: Internet

Author: User

Tags solr idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Algorithm and principle of English word segmentation

Calculating formulas based on document dependencies

Tf-idf:http://lutaf.com/210.htm
Bm25:http://lutaf.com/211.htm

Word segmentation quality is extremely important for correlation calculation based on frequency of words

English (Western language) the basic unit of language is the word, so the word is particularly easy to do, only 3 steps:

Get word groups based on space/symbol/paragraph separation
Filter, remove stop word
Extracting stemming

First step: Press space/symbol participle

It's easy to use regular expressions

  
 
   
    
  
    
   pattern = r‘‘‘(?x)    # set flag to allow verbose regexps
  
  
    
        ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
  
  
    
      | \w+(-\w+)*        # words with optional internal hyphens
  
  
    
      | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
  
  
    
      | \.\.\.            # ellipsis
  
  
    
      | [][.,;"‘?():-_`]  # these are separate tokens
  
  
    
      ‘‘‘
  
  
    
   re.findall(pattern,待分词文本)

Step two: Exclude Stop word

Stopword is similar to such a/an/and/are/then high-frequency words, high-frequency words will be based on the word frequency calculation of the formula produced great interference, so need to filter

Step three: extracting stemming

Stemming (stemming) This is a western language specific processing, such as English words have singular complex deformation,-ing and-ed deformation, but in the calculation of relevance, should be the same word. For example, Apple and apples,doing and done are the same word, and the purpose of extracting stems is to merge these perverts

Stemming has 3 major mainstream algorithms

Porter stemming
Lovins Stemmer
Lancaster stemming

Lucene English participle comes with 3 stemming algorithms, respectively

Englishminimalstemmer
The famous Porter stemming
Kstemmer

The stemming algorithm is not complex, it's a bunch of rules, or it's easy to program with a mapping table, but it has to be an expert in this language to understand word-formation.

http://text-processing.com/demo/stem/is a Web site that tests the stem extraction algorithm online

Lemmatisation

Lemmatisation is a linguistic term that is synonymous with stemming (stemming), which can be called word-of-word reduction , which is to restore "drove" to "drive" by querying the dictionary.
And stemming will shorten the word, "apples", "apple" after processing has become "APPL"

Wikipedia introduction to Word-of-word reduction
European languages Lemmatizer A C-language Lib

Do computer language study will involve lemmatization, I personally think do search completely can not consider, stemming already can solve big problem

Reference

http://text-processing.com/
Www.nltk.org Python's natural language pack, very useful
Python Natural Language Processing Chinese version. pdf

Search Correlation algorithm formula: BM25

The full name of the BM25 algorithm is Okapi BM25, which is an extension of the binary independent model and can be used to sort the relevance of the search.

The default correlation algorithm for Sphinx is the BM25. You can also choose to use the BM25 algorithm after Lucene4.0 (the default is TF-IDF). If you are using SOLR, just modify the Schema.xml and add the following line to

<similarity class= "SOLR. Bm25similarity "/>

BM25 is also based on the word frequency of the calculation of the formula, participle of the results of its calculation is also very important

IDF formula

F (qi,d): is the word frequency
| d|:[the length of the given document]D.
AVGDL: The length of all documents in the index.

Abstract point of view, BM25 formula in fact and TF-IDF formula is similar, can also be used as =∑IDF (q) * FX (TF),

However, BM25 's IDF and TF have made some variants, especially the TF formula, and added two empirical parameters K1 and b,k1 and B to adjust the accuracy, generally we take k1=2,b=0.75

As to which correlation algorithm is better for BM25 and TF-IDF, I think it depends on the search quality evaluation criteria

Reference

http://hi.baidu.com/hontlong/item/466b8c4e023084eda5c06684
Http://en.wikipedia.org/wiki/Okapi_BM25

The correlation calculation formula of Lucene TF-IDF

Lucene uses the TF-IDF algorithm to calculate the relevance of keywords and documents by default when querying a keyword, using this data to sort

TF: Word frequency, IDF: Reverse document frequencies, TF-IDF is a statistical method, or is called a vector space model , the name sounds complex, but it actually contains only two simple rules

The more often a word or phrase appears in an article, the more relevant it is
The less the number of documents that contain a word in the entire document collection, the more important the word is

So a term's tf-idf correlation equals TF * IDF

These two rules are very simple, this is the core rule of TF-IDF, the second rule is actually flawed, he simply think the text frequency small words the more important, the text frequency of the word is more useless, obviously this is not completely correct. Can not effectively reflect the importance of the word and the distribution of the characteristics of the word, such as the search for Web documents, in the HTML of the different structure of the characteristics of the content of the article reflected in different degrees, should have different weights

The advantage of TF-IDF is that the algorithm is simple and the operation speed is fast

Lucene in order to improve the programmable line, in the above rules do some expansion, is to add a number of programming interfaces to different queries to do a weighted normalization, but the core formula is still TF * IDF

The Lucene algorithm formula is as follows

Score (Q,D) = Coord (q,d) · Querynorm (q) · ∑ (TF (T in D) • IDF (t) 2 · T.getboost () · Norm (T,d))

TF (T in D), = frequency?
IDF (t) = 1 +log (total documents/(number of documents containing T +1))
coord (q,d) scoring factor,. The more query items in a document, the higher the document matching program, for example, the query "A B C", then the document containing A/B/C3 words is 3 points, only A/b document is 2 points, coord can be turned off in query
Querynorm (q) queries the standard query so that different queries can be compared between
T.getboost () and Norm (T,d) are both available programmable interfaces that can adjust the weights of field/document/query items

A variety of programming jacks seem cumbersome and can be used without, so we can simplify the calculation of lucence formula

Score (Q,D) = Coord (q,d) · ∑ (TF (T in D) • IDF (T) 2)

Conclusion

TF-IDF algorithm is based on the term, the term is the smallest word breaker, which shows that the word segmentation algorithm is very important to the ranking based on statistics, if you use Chinese word segmentation, then will lose all the semantic relevance, this time the search is only as an efficient full-text matching method
In accordance with rule 1 某个词或短语在一篇文章中出现的次数越多，越相关 be sure to remove stop word, because the frequency of these words is too high, that is, the value of TF is very large, will seriously interfere with the calculation of the results
TF and IDF are calculated when the index is generated: TF will be saved with DocId (part of Docids), idf= total number of documents/Docids length owned by current term

Algorithm and principle of English word segmentation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More