Leveraging term Vectors
The so-called term vector,
For a field of documents, such as title and body, a multi-dimensional vector space for word frequency is created. Each word is one-dimensional,
The value of this dimension is the frequency of the word in this field.
If you want to use term vectors, You need to enable the option of term vectors for the field during indexing:
Field options for term Vectors
Termvector. Yes-Record the unique terms
That occurred, and their counts, in each document, but do not store any
Positions or offsets information.
Termvector. with_positions-Record
Unique terms and their counts, and also the positions of each occurrence
Every term, but no offsets.
Termvector. with_offsets-Record the unique terms
And their counts, with the offsets (Start & End character position) of each
Occurrence of every term, but no positions.
Termvector. with_positions_offsets
-Store unique terms and their counts, along with positions and
Offsets.
Termvector. No-Do not store any term vector information.
If
Index. No is specified for a field, then you must also specify termvector. No.
In this way, after the index is complete, given the Document ID and field name, we can read this term from indexreader.
Vector (the premise is that you created terms vector in indexing ):
Termfreqvector =
Reader. gettermfreqvector (ID, "subject ");
You can traverse the termfreqvector to retrieve each word and word frequency,
If you choose to save the offsets and positions information during index, you can also obtain them here.
With this term
Vector we can do some interesting applications:
1) books like this
Compare the similarities between the two books and abstract the books into a document file,
Having author, subject fields. Now we can use these two fields to compare the similarity between the two books.
Author this field is multiple
Fields, that is, there can be multiple author, so the first step is to compare the author,
String [] authors =
Doc. getvalues ("author ");
Booleanquery authorquery = new booleanquery ();//
#3
For (INT I = 0; I <authors. length; I ++) {// #3
String author =
Authors [I]; // #3
Authorquery. Add (New termquery (new term ("author ",
Author), booleanclause. occur. shocould );//
#3
}
Authorquery. setboost (2.0f );
Finally, you can set the boost value of this query to a high value, indicating that this condition is very important and the weight is high,
If the author is the same, it is very similar.
The second step is to use the term vector. Here we use very simple, just look at the term of the Subject field.
Whether the terms in the vector are the same,
Termfreqvector vector = //
#4
Reader. gettermfreqvector (ID, "subject"); // #4
Booleanquery
Subjectquery = new booleanquery (); // #4
For (Int J = 0; j <
Vector. Size (); j ++) {// #4
Termquery TQ = new termquery (New
Term ("subject", vector. getterms () [J]);
Subjectquery. Add (TQ,
Booleanclause. occur. shocould); // #4
}
2) What category?
This is a little more advanced than the previous example,
How is the classification? For the document subject, we have the term vector.
Therefore, we can compare the terms of the two documents.
The smaller the angle of the vector in the vector space, the more similar the two documents are.
Since classification is a training process, we must create the term of each class.
Vector is used as a standard to compare other documents.
Here we use map to implement this term vector, (term, frequency ),
N maps are used to represent n-dimensional data. We need to generate a term vector, category, and term for each category.
The vector can also be connected with a map. Create the term vector of the category, and do this:
Traverse every document in this class,
Take the term vector of document and add it to the term vector of category.
Private void
Addtermfreqtomap (MAP vectormap, termfreqvector ){
String []
Terms = termfreqvector. getterms ();
Int [] freqs =
Termfreqvector. gettermfrequencies ();
For (INT I = 0; I <terms. length;
I ++ ){
String term = terms [I];
If
(Vectormap. containskey (TERM )){
Integer value = (integer)
Vectormap. Get (TERM );
Vectormap. Put (term, new
INTEGER (value. intvalue () + freqs [I]);
} Else {
Vectormap. Put (term, new INTEGER (freqs [I]);
}
}
}
First, retrieve the list of terms and frequency from the term vector of the document, and then
Take every term in the vector and add the term frequency of the document. OK
With the category of each class,
We are about to calculate the angle between the document and the vector of this class.
Cos = a * B/| A | B |
A * B is the dot product, that is, the one-dimensional multiplication of two vectors,
Then add up.
For ease of calculation, assume that there are only two conditions for term frequency in document. 0 or 1 indicates that the term frequency appears or does not appear.
Private
Double computeangle (string [] words, string category ){
// Assume words
Are unique and only occur once
Map vectormap = (MAP)
Categorymap. Get (category );
Int dotproduct = 0;
Int sumofsquares =
0;
For (INT I = 0; I <words. length; I ++ ){
String word =
Words [I];
Int categorywordfreq = 0;
If
(Vectormap. containskey (Word )){
Categorywordfreq = (integer)
Vectormap. Get (Word). intvalue ();
}
Dotproduct + =
Categorywordfreq; // optimized because we assume frequency in words is
1
Sumofsquares + = categorywordfreq * categorywordfreq;
}
Double denominator;
If (sumofsquares = words. length)
{
// Avoid precision issues for special case
Denominator =
Sumofsquares; // SQRT x * SQRT x = x
} Else {
Denominator =
Math. SQRT (sumofsquares )*
Math. SQRT (words. Length );
}
Double ratio = dotproduct/denominator;
Return
Math. ACOs (ratio );
}
This function is simple to implement the above formula.
3) morelikethis
Lucene also provides an efficient interface for finding similar documents. The morelikethis Interface
Http://lucene.apache.org/java/1_9_1/api/org/apache/lucene/search/similar/MoreLikeThis.html
For the above method, we can compare the cosine values of each two documents and sort the cosine values to find the most similar documents. But the biggest problem with this method is that the calculation workload is too large, when the number of documents is large, it is almost unacceptable. Of course, there are special methods to optimize the cosine method, which can greatly reduce the amount of computing, but this method is accurate, but the threshold is high.
The principle of this interface is very simple. For a document, we only need to extract interestingterm (I .e., TF × IDF words), and then use Lucene to search for documents containing the same words, as a similar document, this method has the advantage of being efficient, but its disadvantage is that it is not accurate. This interface provides many parameters that you can configure to select interestingterm.
Morelikethis MLT = new morelikethis (IR );
Reader target =...
// Orig source of Doc you want to find similarities
Query query = MlT. Like (target );
Hits hits = is. Search (query );
The usage is very simple, so you can get similar documents.
This interface is flexible. Instead of using the like interface, you can use
Retrieveinterestingterms (Reader
R)
In this way, you can obtain the interestingterm, and then how to handle it depends on your own needs.