Algorithm and principle of English word segmentation

Source: Internet
Author: User
Tags solr idf

Algorithm and principle of English word segmentation

Calculating formulas based on document dependencies

    • Tf-idf:http://lutaf.com/210.htm

    • Bm25:http://lutaf.com/211.htm

Word segmentation quality is extremely important for correlation calculation based on frequency of words

English (Western language) the basic unit of language is the word, so the word is particularly easy to do, only 3 steps:

    1. Get word groups based on space/symbol/paragraph separation

    2. Filter, remove stop word

    3. Extracting stemming

First step: Press space/symbol participle

It's easy to use regular expressions

  
 
   
  
  1. pattern = r‘‘‘(?x)    # set flag to allow verbose regexps

  2.      ([A-Z]\.)+        # abbreviations, e.g. U.S.A.

  3.    | \w+(-\w+)*        # words with optional internal hyphens

  4.    | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%

  5.    | \.\.\.            # ellipsis

  6.    | [][.,;"‘?():-_`]  # these are separate tokens

  7.    ‘‘‘

  8. re.findall(pattern,待分词文本)

Step two: Exclude Stop word

Stopword is similar to such a/an/and/are/then high-frequency words, high-frequency words will be based on the word frequency calculation of the formula produced great interference, so need to filter

Step three: extracting stemming

Stemming (stemming) This is a western language specific processing, such as English words have singular complex deformation,-ing and-ed deformation, but in the calculation of relevance, should be the same word. For example, Apple and apples,doing and done are the same word, and the purpose of extracting stems is to merge these perverts

Stemming has 3 major mainstream algorithms

    • Porter stemming

    • Lovins Stemmer

    • Lancaster stemming

Lucene English participle comes with 3 stemming algorithms, respectively

    1. Englishminimalstemmer

    2. The famous Porter stemming

    3. Kstemmer

The stemming algorithm is not complex, it's a bunch of rules, or it's easy to program with a mapping table, but it has to be an expert in this language to understand word-formation.

http://text-processing.com/demo/stem/is a Web site that tests the stem extraction algorithm online

Lemmatisation

Lemmatisation is a linguistic term that is synonymous with stemming (stemming), which can be called word-of-word reduction , which is to restore "drove" to "drive" by querying the dictionary.
And stemming will shorten the word, "apples", "apple" after processing has become "APPL"

    • Wikipedia introduction to Word-of-word reduction

    • European languages Lemmatizer A C-language Lib

Do computer language study will involve lemmatization, I personally think do search completely can not consider, stemming already can solve big problem

Reference

    • http://text-processing.com/

    • Www.nltk.org Python's natural language pack, very useful

    • Python Natural Language Processing Chinese version. pdf




Search Correlation algorithm formula: BM25

The full name of the BM25 algorithm is Okapi BM25, which is an extension of the binary independent model and can be used to sort the relevance of the search.

The default correlation algorithm for Sphinx is the BM25. You can also choose to use the BM25 algorithm after Lucene4.0 (the default is TF-IDF). If you are using SOLR, just modify the Schema.xml and add the following line to

<similarity class= "SOLR. Bm25similarity "/>

BM25 is also based on the word frequency of the calculation of the formula, participle of the results of its calculation is also very important

IDF formula

    • F (qi,d): is the word frequency

    • | d|:[the length of the given document]D.

    • AVGDL: The length of all documents in the index.

Abstract point of view, BM25 formula in fact and TF-IDF formula is similar, can also be used as =∑IDF (q) * FX (TF),

However, BM25 's IDF and TF have made some variants, especially the TF formula, and added two empirical parameters K1 and b,k1 and B to adjust the accuracy, generally we take k1=2,b=0.75

As to which correlation algorithm is better for BM25 and TF-IDF, I think it depends on the search quality evaluation criteria

Reference

    • http://hi.baidu.com/hontlong/item/466b8c4e023084eda5c06684

    • Http://en.wikipedia.org/wiki/Okapi_BM25


The correlation calculation formula of Lucene TF-IDF

Lucene uses the TF-IDF algorithm to calculate the relevance of keywords and documents by default when querying a keyword, using this data to sort

TF: Word frequency, IDF: Reverse document frequencies, TF-IDF is a statistical method, or is called a vector space model , the name sounds complex, but it actually contains only two simple rules

    1. The more often a word or phrase appears in an article, the more relevant it is

    2. The less the number of documents that contain a word in the entire document collection, the more important the word is

So a term's tf-idf correlation equals TF * IDF

These two rules are very simple, this is the core rule of TF-IDF, the second rule is actually flawed, he simply think the text frequency small words the more important, the text frequency of the word is more useless, obviously this is not completely correct. Can not effectively reflect the importance of the word and the distribution of the characteristics of the word, such as the search for Web documents, in the HTML of the different structure of the characteristics of the content of the article reflected in different degrees, should have different weights

The advantage of TF-IDF is that the algorithm is simple and the operation speed is fast

Lucene in order to improve the programmable line, in the above rules do some expansion, is to add a number of programming interfaces to different queries to do a weighted normalization, but the core formula is still TF * IDF

The Lucene algorithm formula is as follows

Score (Q,D) = Coord (q,d) ·    Querynorm (q) ·  ∑ (TF (T in D) •  IDF (t) 2 ·  T.getboost () · Norm (T,d))

    • TF (T in D), = frequency?

    • IDF (t) = 1 +log (total documents/(number of documents containing T +1))

    • coord (q,d) scoring factor,. The more query items in a document, the higher the document matching program, for example, the query "A B C", then the document containing A/B/C3 words is 3 points, only A/b document is 2 points, coord can be turned off in query

    • Querynorm (q) queries the standard query so that different queries can be compared between

    • T.getboost () and Norm (T,d) are both available programmable interfaces that can adjust the weights of field/document/query items

A variety of programming jacks seem cumbersome and can be used without, so we can simplify the calculation of lucence formula

Score (Q,D) = Coord (q,d) ·  ∑ (TF (T in D) • IDF (T) 2)

Conclusion
    1. TF-IDF algorithm is based on the term, the term is the smallest word breaker, which shows that the word segmentation algorithm is very important to the ranking based on statistics, if you use Chinese word segmentation, then will lose all the semantic relevance, this time the search is only as an efficient full-text matching method

    2. In accordance with rule 1 某个词或短语在一篇文章中出现的次数越多,越相关 be sure to remove stop word, because the frequency of these words is too high, that is, the value of TF is very large, will seriously interfere with the calculation of the results

    3. TF and IDF are calculated when the index is generated: TF will be saved with DocId (part of Docids), idf= total number of documents/Docids length owned by current term


Algorithm and principle of English word segmentation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.