Algorithm and principle of English word segmentation
Calculating formulas based on document dependencies
Word segmentation quality is extremely important for correlation calculation based on frequency of words
English (Western language) the basic unit of language is the word, so the word is particularly easy to do, only 3 steps:
Get word groups based on space/symbol/paragraph separation
Filter, remove stop word
Extracting stemming
First step: Press space/symbol participle
It's easy to use regular expressions
pattern = r‘‘‘(?x) # set flag to allow verbose regexps
([A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(-\w+)* # words with optional internal hyphens
| \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"‘?():-_`] # these are separate tokens
‘‘‘
re.findall(pattern,待分词文本)
Step two: Exclude Stop word
Stopword is similar to such a/an/and/are/then
high-frequency words, high-frequency words will be based on the word frequency calculation of the formula produced great interference, so need to filter
Step three: extracting stemming
Stemming (stemming) This is a western language specific processing, such as English words have singular complex deformation,-ing and-ed deformation, but in the calculation of relevance, should be the same word. For example, Apple and apples,doing and done are the same word, and the purpose of extracting stems is to merge these perverts
Stemming has 3 major mainstream algorithms
Porter stemming
Lovins Stemmer
Lancaster stemming
Lucene English participle comes with 3 stemming algorithms, respectively
Englishminimalstemmer
The famous Porter stemming
Kstemmer
The stemming algorithm is not complex, it's a bunch of rules, or it's easy to program with a mapping table, but it has to be an expert in this language to understand word-formation.
http://text-processing.com/demo/stem/is a Web site that tests the stem extraction algorithm online
Lemmatisation
Lemmatisation is a linguistic term that is synonymous with stemming (stemming), which can be called word-of-word reduction , which is to restore "drove" to "drive" by querying the dictionary.
And stemming will shorten the word, "apples", "apple" after processing has become "APPL"
Do computer language study will involve lemmatization, I personally think do search completely can not consider, stemming already can solve big problem
Reference
http://text-processing.com/
Www.nltk.org Python's natural language pack, very useful
Python Natural Language Processing Chinese version. pdf
Search Correlation algorithm formula: BM25
The full name of the BM25 algorithm is Okapi BM25, which is an extension of the binary independent model and can be used to sort the relevance of the search.
The default correlation algorithm for Sphinx is the BM25. You can also choose to use the BM25 algorithm after Lucene4.0 (the default is TF-IDF). If you are using SOLR, just modify the Schema.xml and add the following line to
<similarity class= "SOLR. Bm25similarity "/>
BM25 is also based on the word frequency of the calculation of the formula, participle of the results of its calculation is also very important
IDF formula
F (qi,d): is the word frequency
| d|:[the length of the given document]D.
AVGDL: The length of all documents in the index.
Abstract point of view, BM25 formula in fact and TF-IDF formula is similar, can also be used as =∑IDF (q) * FX (TF),
However, BM25 's IDF and TF have made some variants, especially the TF formula, and added two empirical parameters K1 and b,k1 and B to adjust the accuracy, generally we take k1=2,b=0.75
As to which correlation algorithm is better for BM25 and TF-IDF, I think it depends on the search quality evaluation criteria
Reference
The correlation calculation formula of Lucene TF-IDF
Lucene uses the TF-IDF algorithm to calculate the relevance of keywords and documents by default when querying a keyword, using this data to sort
TF: Word frequency, IDF: Reverse document frequencies, TF-IDF is a statistical method, or is called a vector space model , the name sounds complex, but it actually contains only two simple rules
The more often a word or phrase appears in an article, the more relevant it is
The less the number of documents that contain a word in the entire document collection, the more important the word is
So a term's tf-idf correlation equals TF * IDF
These two rules are very simple, this is the core rule of TF-IDF, the second rule is actually flawed, he simply think the text frequency small words the more important, the text frequency of the word is more useless, obviously this is not completely correct. Can not effectively reflect the importance of the word and the distribution of the characteristics of the word, such as the search for Web documents, in the HTML of the different structure of the characteristics of the content of the article reflected in different degrees, should have different weights
The advantage of TF-IDF is that the algorithm is simple and the operation speed is fast
Lucene in order to improve the programmable line, in the above rules do some expansion, is to add a number of programming interfaces to different queries to do a weighted normalization, but the core formula is still TF * IDF
The Lucene algorithm formula is as follows
Score (Q,D) = Coord (q,d) · Querynorm (q) · ∑ (TF (T in D) • IDF (t) 2 · T.getboost () · Norm (T,d))
TF (T in D), = frequency?
IDF (t) = 1 +log (total documents/(number of documents containing T +1))
coord (q,d) scoring factor,. The more query items in a document, the higher the document matching program, for example, the query "A B C", then the document containing A/B/C3 words is 3 points, only A/b document is 2 points, coord can be turned off in query
Querynorm (q) queries the standard query so that different queries can be compared between
T.getboost () and Norm (T,d) are both available programmable interfaces that can adjust the weights of field/document/query items
A variety of programming jacks seem cumbersome and can be used without, so we can simplify the calculation of lucence formula
Score (Q,D) = Coord (q,d) · ∑ (TF (T in D) • IDF (T) 2)
Conclusion
TF-IDF algorithm is based on the term, the term is the smallest word breaker, which shows that the word segmentation algorithm is very important to the ranking based on statistics, if you use Chinese word segmentation, then will lose all the semantic relevance, this time the search is only as an efficient full-text matching method
In accordance with rule 1 某个词或短语在一篇文章中出现的次数越多,越相关
be sure to remove stop word, because the frequency of these words is too high, that is, the value of TF is very large, will seriously interfere with the calculation of the results
TF and IDF are calculated when the index is generated: TF will be saved with DocId (part of Docids), idf= total number of documents/Docids length owned by current term
Algorithm and principle of English word segmentation