Lucene in action note term vector--the word frequency vector space established for a specific field, using the Cos to calculate the document similarity for the field

Source: Internet
Author: User
Tags cos

Excerpt from: http://blog.csdn.net/fxjtoday/article/details/5142661

Leveraging term vectors
The so-called term vector, which is a field of documents, such as the text type of title,body, establishes the multidimensional vector space of word frequency . Each term is one dimension, and the value of this dimension is the frequency of the word in this field.

If you want to use term vectors, you will need to open the term vectors option on indexing when you are in the Field:

Field options for the term vectors
Termvector.yes–record The unique terms that occurred, and their counts, in each document, but does not store any positions or offsets Information.
Termvector.with_positions–record the unique terms and their counts, and also the positions of each occurrence of every t erm, but no offsets.
Termvector.with_offsets–record the unique terms and their counts, with the offsets (start & end character Position) of each occurrence of every term, but no positions.
Termvector.with_positions_offsets–store unique terms and their counts, along with positions and Offsets.
Termvector.no–do not store any term vector information.
If index.no is specified for a field and then you must also specify Termvector.no.

So after index, given this document ID and field name, we can read the term vector from Indexreader (provided that you created the terms vector at indexing):
Termfreqvector termfreqvector = Reader.gettermfreqvector (id, "subject");
You can go through the termfreqvector to get each word and frequency, and if you choose to save offsets and positions information at index, you can also get it here.

With this term vector we can do some interesting applications:
1) Books like this
Compare two books is similar, the book abstract into a document file, with author, subject FIELDS. Then compare the similarities of the two books by using these two FIELD.
Author This field is multiple fields, which means there can be multiple author, then the first step is the same as author,
string[] authors = Doc.getvalues ("author");
Booleanquery authorquery = new Booleanquery (); #3
for (int i = 0; i < authors.length; i++) {//#3
String Author = authors[i]; #3
Authorquery.add (new termquery ("author", author), BooleanClause.Occur.SHOULD); #3
}
Authorquery.setboost (2.0f);
finally, you can set the boost value of this query to high, indicating that this condition is very important, the weight is higher, if the author is the same, then it is very similar.
The second step is to use the term vector, here is very simple, simple to see subject field of the term vector of the term is the same,
Termfreqvector Vector =//#4
Reader.gettermfreqvector (id, "subject"); #4
Booleanquery subjectquery = new Booleanquery (); #4
for (int j = 0; J < vector.size (); J + +) {//#4
Termquery TQ = new Termquery (new term ("subject", vector.getterms () [j]));
Subjectquery.add (tq, BooleanClause.Occur.SHOULD); #4
}

2) What category?
This is higher than the previous example, how to classify, or for document subject, we have a term vector.
So for two document, we canComparing the angle of the term vector in the vector space between the two articles, the smaller the angle, the more similar the two document.
So since there is a training process for classification, we have to build each class of the term vector as a standard to compare to other Document.
Here we use map to implement this term vector, (term, frequency), using n such a map to represent N-dimensional. We are going to generate a term vector for each category, and the category and term vectors can also be connected using a map. create this category of the term vector to do this:
Traverse each document in this class, take the term vector of document, and add it to the term vector of category.
private void Addtermfreqtomap (Map vectormap, termfreqvector Termfreqvector) {
string[] terms = termfreqvector.getterms ();
int[] freqs = termfreqvector.gettermfrequencies ();
for (int i = 0; i < terms.length; i++) {
String term = terms[i];
If (vectormap.containskey (term)) {
Integer value = (integer) vectormap.get (term);
Vectormap.put (term, New Integer (value.intvalue () + freqs[i]));
} else {
Vectormap.put (term, New Integer (freqs[i]));
}
}
}
First remove the term and frequency list from the term vector of document, then take each term from the term vector of category and add the term frequency of document. okay, Ok.

With this category for each class, we're going to begin to calculate the angle between the document and the vector of this class.
cos = a*b/| a| | b|
A*b is a dot product, which is two vectors multiplied by each dimension, and then all added up.
For the sake of simple calculation, it is assumed that there are only two cases of term frequency in document, 0 or 1. It means that it appears or does not appear

3) Morelikethis

Lucene also provides a more efficient interface for finding similar documents, Morelikethis interface

Http://lucene.apache.org/Java/1_9_1/api/org/apache/lucene/search/similar/MoreLikeThis.html

For the above method we can compare the cosine of every two documents, and then sort the cosine value to find the most similar document, but the biggest problem of this method is that the computational amount is too large, when the number of documents is very large, almost unacceptable, of course, there is a special method to optimize the cosine method, can make the computation greatly reduced, But this method is accurate, but the threshold is higher.

The principle of this interface is very simple, for a document, we only need to extract the Interestingterm (that is, TFXIDF High word), and then use Lucene to search for documents containing the same word, as a similar document, the advantage of this method is efficient, but the disadvantage is not accurate, This interface provides a number of parameters that you can configure to select Interestingterm.

Morelikethis MLT = new Morelikethis (ir);

Reader target = ...

Orig source of Doc want to find similarities to

Query query = Mlt.like (target);

Hits Hits = Is.search (query);

The usage is simple so that you can get a similar document

This interface is more flexible, you can not directly with the like interface, but with
Retrieveinterestingterms (Reader R)

So you can get interestingterm, and then how to deal with it according to your own NEEDS.

Lucene in action note term vector--the word frequency vector space established for a specific field, using the Cos to calculate the document similarity for the field

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.