Full-text search, data mining, recommendation engine series 5-Article Glossary

Last Update:2018-12-05 Source: Internet

Author: User

Tags processing text idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Whether you want to perform full-text search or perform automatic Cluster Analysis on articles, you must represent the articles as Term vectors ), in Lucene, terms vector is used to index and search articles. However, Lucene does not provide an appropriate terminology vector calculation interface. Therefore, we must do this on our own.

Glossary

As we all know, an article consists of words. When processing text, we first perform Chinese word segmentation, including removing frequently used stop words such as ", location, and, add synonyms for keywords, such as abbreviations and full names. If it is an English word, it may need to be changed to lower-case letters, remove the plural and past word segmentation, etc. You may also need to extract the root word. In short, after the preprocessing of the preceding steps, the article will become a string array composed of a series of words.

For each article in a system, we first calculate the frequency of occurrence of each word (TF: TermFrequency), that is, the number of occurrences of the word divided by the total number of words in the article, then, count the Document Frequency (IDF: Inverse Document Frequency) of this word, the number of occurrences in all articles, and divide the total number of articles by this number, that is, the total number of articles divided by the number of articles that appear in the word. From the above definition, we can see that the more important a word is, the more frequent the word appears. The more the word appears only in this article, the less frequently it appears in other articles, the more important the word is to this article. A certain formula can be used to calculate the weight of each word on each article. In this way, all words plus their corresponding weights form a multidimensional term vector.

Calculate TF * IDF

Currently, there are no mature algorithms for the calculation of GLOSSARY vectors. Currently, some empirical algorithms are commonly used. The algorithms proposed in some articles are more accurate and have not yet been verified. Here we use the most commonly used algorithms, which are also quite simple to correct based on actual needs.

First, the system needs to maintain several global variables: the total number of articles, all words in the system, and the number of times they appear in the article.

// You can set different weights because words appear in different positions of an article and are of different importance.
Public final static int TITLE_WEIGHT = 1;
Public final static int KEYWORD_WEIGHT = 1;
Public final static int tag_weight = 1;
Public final static int abct_weight = 1;
Public final static int body_weight = 1;

Private Static int docsnum = 0; // the total number of documents in the current system, which will be stored in the database in the future
Private Static Map <string, integer> worddocs = NULL; // number of articles that each word contains in each article (number of articles)
Private Static vector <integer> termnumdoc = NULL; // The total number of words in each article
Private Static vector <terminfo> termvectors = NULL; // Glossary of each article

The program that generates the original term vector for a piece of text is as follows:
/**
* An article consists of the title, keyword, abstract, sign, and body. Each part of the word has a different weight.
* This function performs Chinese Word Segmentation and generates the glossary of this part.
* @ Param text the text to be processed
* @ Param termarray Glossary
* @ Param weight of a word in this part
* @ Return the total number of orders in this part (used to calculate the Word Frequency TF)
*/
Private static int procDocPart (String text, Vector <TermInfo> termArray, int weight ){
Collection <String> words = FteEngine. tokenize (text );
Iterator <String> itr = words. iterator ();
String word = null;
TermInfo termInfo = null;
Int termMount = 0;
While (itr. hasNext ()){
Word = itr. next ();
If (termArray. contains (word )){
TermInfo = termArray. get (termArray. indexOf (word ));
TermInfo. setMountPerDoc (termInfo. getMountPerDoc () + weight );
} Else {
TermInfo = new TermInfo ();
TermInfo. setMountPerDoc (weight );
TermInfo. setTermStr (word );
TermInfo. setRawWeight (0.0 );
TermInfo. setWeight (0.0 );
TermArray. add (termInfo );
}
TermMount + = weight;
}
Return termMount;
}

The following is a program for finding the TF * IDF and then normalization to generate the final term vector:

/**
* The title, keyword, Tag, abstract, and body are superimposed to generate a glossary.
* @ Param docIdx document number. If it is-1, it indicates the newly added document.
* @ Param text the text to be processed
* @ Param weight of words in this text segment
* @ Return Document No.
*/
Public static int genTermVector (int docIdx, String text, int weight ){
Vector <TermInfo> termVector = null;
If (docIdx <0 ){
DocIdx = docsNum;
TermNumDoc. add (0 );
TermVector = new Vector <TermInfo> ();
TermVectors. add (termVector );
DocsNum ++;
}
TermVector = termVectors. elementAt (docIdx );
Int termMount = procDocPart (text, termVector, weight );
TermNumDoc. set (docIdx, termNumDoc. elementAt (docIdx). intValue () + termMount );
// Calculate IDF for all terms
TermInfo termInfo = null;
String termStr = null;
Iterator <TermInfo> termInfoItr = termVector. iterator ();
// Calculate the number of entries that each word appears in the article
While (termInfoItr. hasNext ()){
TermInfo = termInfoItr. next ();
TermStr = termInfo. getTermStr ();
If (wordDocs. get (termStr )! = Null ){
WordDocs. put (termStr, wordDocs. get (termStr). intValue () + 1 );
} Else {
WordDocs. put (termStr, 1 );
}
TermInfo. setTf (termInfo. getMountPerDoc ()/(double) termNumDoc. elementAt (docIdx). intValue ()));
}
Iterator <Vector <TermInfo> docItr = termVectors. iterator ();
// Calculate TF * IDF
Double rwpsum= 0.0;
While (docItr. hasNext ()){
TermVector = docItr. next ();
TermInfoItr = termVector. iterator ();
RwPSum = 0.0;
While (termInfoItr. hasNext ()){
TermInfo = termInfoItr. next ();
TermInfo. setRawWeight (termInfo. getTf () * Math. log (double) docsNum )/
WordDocs. get (termInfo. getTermStr (). intValue ()));
RwPSum + = termInfo. getRawWeight () * termInfo. getRawWeight ();
}
// Normalize TF * IDF
TermInfoItr = termVector. iterator ();
While (termInfoItr. hasNext ()){
TermInfo = termInfoItr. next ();
TermInfo. setWeight (termInfo. getRawWeight ()/Math. sqrt (rwPSum ));
}
}
Return docidx;
}

Article Similarity Calculation

The similarity of the article is to calculate the distance between the terms vector corresponding to the two articles, that is, the square combination of the weight difference of TF * IDF corresponding to the normalization of each term before development, similar to 2D vector distance calculation, the specific implementation code is as follows:

/**
* Calculate the distance of the glossary vector. A small value indicates a high similarity between the two articles.
* @ Param termvector1
* @ Param termvector2
* @ Return distance
*/
Public static double caltermvectordist (collection <terminfo> termvector1, collection <terminfo> termvector2 ){
Double Dist = 0.0;
Vector <terminfo> TV1 = (vector <terminfo>) termvector1;
Vector <terminfo> TV2 = (vector <terminfo>) termvector2;
Hashtable <String, TermInfo> tv2Tbl = new Hashtable <String, TermInfo> ();
Iterator <TermInfo> tvItr = null;
TermInfo termInfo = null;
TermInfo ti2 = null;
Double [] weights = new double [tv2.size ()];
Int idx = 0;
// Initialize data
TvItr = tv2.iterator ();
While (tvItr. hasNext ()){
TermInfo = tvItr. next ();
// Weights [idx ++] = termInfo. getWeight ();
Tv2Tbl. put (termInfo. getTermStr (), termInfo );
}
//
TvItr = tv1.iterator ();
While (tvItr. hasNext ()){
TermInfo = tvItr. next ();
Ti2 = tv2Tbl. get (termInfo. getTermStr ());
If (ti2! = Null ){
Dist + = (termInfo. getWeight ()-ti2.getWeight () * (termInfo. getWeight ()-ti2.getWeight ());
Ti2.setWeight (0.0 );
} Else {
Dist + = termInfo. getWeight () * termInfo. getWeight ();
}
}
TvItr = tv2Tbl. values (). iterator ();
While (tvItr. hasNext ()){
TermInfo = tvItr. next ();
System. out. println ("######:" + termInfo. getTermStr () + "=" + termInfo. getWeight () + "! ");
Dist + = termInfo. getWeight () * termInfo. getWeight ();
}
System. out. println ();

Return Math. sqrt (dist );
}

The following three statements are calculated:

Java programming

C ++ programming guide

Gay websites transform into e-commerce websites

The calculated term vector value is:

Java: 0.5527962688403749
Language: 0.20402065516569604
Program: 0.20402065516569604
Technical: 0.5527962688403749
Details: 0.5527962688403749
############# Doc2 ############
C: 0.6633689723434504 (Note: Our dictionary does not contain C ++)
Language: 0.24482975009584626
Program: 0.24482975009584626
Guide: 0.6633689723434504
############# Doc3 ############
Homosexuality: 0.531130184813292
Network: 0.196024348194679
Site: 0.196024348194679
Transformation: 0.531130184813292
E-commerce: 0.531130184813292
Network: 0.196024348194679
Site: 0.196024348194679

Then the calculated distance is:

Article 1 and Article 2: 1.3417148340558687

Article 1 and Article 3: 1.3867764455130116

Therefore, the calculation result system considers the first and second articles to be closer, and the actual situation is the same, because there are two words in the first and second articles that are the same, the first and third articles are not the same.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More