Lucene in action note term vector

Source: Internet
Author: User

Leveraging term vectors
The so-called term vector, is for documents in a field, such as title,body this type of text, the establishment of the frequency of multidimensional vector space. Each word is one-dimensional, and the value of this dimension is the frequency of the word in this field.

If you want to use term vectors, you need to open the term vectors option on the field when indexing:

Field options for term vectors
Termvector.yes–record The unique terms that occurred, and their counts, in each document, but does not store any positions or offsets information.
Termvector.with_positions–record the unique terms and their counts, and also the positions of each occurrence of every Erm, but no offsets.
Termvector.with_offsets–record the unique terms and their counts, with the offsets (start & end character) of each occurrence of every term, but no positions.
Termvector.with_positions_offsets–store unique terms and their counts, along with positions and offsets.
Termvector.no–do not store any term vector information.
If index.no is specified to field, then you must also specify.

So, given this document ID and field name after index, we can read the term vector from Indexreader (if you created terms vector when you indexing):
Termfreqvector Termfreqvector = reader.gettermfreqvector (ID, "subject");
You can traverse this termfreqvector to remove each word and frequency, and if you choose to save offsets and positions information at index, you can also take it here.

With this term vector we can do some interesting applications:
1 books as this
Compare the similarities between the two books and abstract them into a document file with author, subject fields. Now it's time to compare the similarities of the two books by using these two field.
Author This field is multiple fields, which means there can be multiple author, then the first step is the same as the author,
string[] authors = Doc.getvalues ("author");
Booleanquery authorquery = new Booleanquery (); #3
for (int i = 0; i < authors.length i++) {//#3
String author = authors[i]; #3
Authorquery.add (New Termquery (New Term ("author", author)), BooleanClause.Occur.SHOULD); #3
}
Authorquery.setboost (2.0f);
Finally, the boost value of this query can be set high, indicating that this condition is very important, the weight is higher, if the author is the same, then very similar.
The second step is to use the term vector, which is very simple to see whether the term in the term vector of the subject field is the same,
Termfreqvector vector =//#4
Reader.gettermfreqvector (ID, "subject"); #4
Booleanquery subjectquery = new Booleanquery (); #4
for (int j = 0; J < Vector.size (); j + +) {//#4
Termquery TQ = new Termquery (New Term ("Subject", Vector.getterms () [j]));
Subjectquery.add (TQ, BooleanClause.Occur.SHOULD); #4
}

2) What category?
This is a bit more advanced than the previous example, how to classify it, or the subject of document, we have term vector.
So for the two document, we can compare the term vectors of the two articles in the vector space, the smaller the angle, the more similar the two document.
So since there is a training process for classification, we have to establish the term vector of each class as a standard to compare to other document.
Here, map is used to implement this term vector, (term, frequency), n such a map to represent n-dimensional. We're going to generate a term vector for each category, category and term vectors can also be connected with a map. Create this category vector of term, do this:
Iterate through each document in the class, take the term vector of the document and add it to the term vector of the category.
private void Addtermfreqtomap (Map vectormap, Termfreqvector termfreqvector) {
string[] terms = termfreqvector.getterms ();
int[] Freqs = Termfreqvector.gettermfrequencies ();
for (int i = 0; i < terms.length; i++) {
String term = terms[i];
if (Vectormap.containskey (term)) {
Integer value = (integer) vectormap.get (term);
Vectormap.put (term, new Integer (Value.intvalue () + freqs[i));
} else {
Vectormap.put (term, new Integer (freqs[i));
}
}
}
First, remove the list of term and frequency from the term vector of document, then take each term from category's term vector and add term frequency of the document. Okay, OK.

With this category of each class, we're going to start calculating the angle between the document and the vector of the class.
cos = a*b/| a| | b|
A*b is the dot product, which is multiplied by each dimension of two vectors, and then all adds up.
For the sake of simplicity, it is assumed that there are only two cases of term frequency in document, 0 or 1. Means to appear or not to appear
Private double Computeangle (string[] words, String category) {
Assume words are unique and only occur once
Map Vectormap = (map) categorymap.get (category);
int dotproduct = 0;
int sumofsquares = 0;
for (int i = 0; i < words.length; i++) {
String word = words[i];
int categorywordfreq = 0;
if (Vectormap.containskey (word)) {
Categorywordfreq = ((Integer) vectormap.get (Word)). Intvalue ();
}
Dotproduct + = Categorywordfreq; Optimized because we assume frequency in words is 1
Sumofsquares + = Categorywordfreq * CATEGORYWORDFREQ;
}
double denominator;
if (sumofsquares = = words.length) {
Avoid precision issues for special case
denominator = sumofsquares; sqrt x * sqrt x = x
} else {
Denominator = MATH.SQRT (sumofsquares) *
MATH.SQRT (words.length);
}
Double ratio = Dotproduct/denominator;
return Math.acos (ratio);
}
This function is the implementation of the above formula is relatively simple.

3) Morelikethis

Lucene also provides a more efficient interface for finding more similar documents, morelikethis interfaces

Http://lucene.apache.org/java/1_9_1/api/org/apache/lucene/search/similar/MoreLikeThis.html

For the above method we can compare the cosine of every two documents, then the cosine is sorted to find the most similar document, but the biggest problem with this method is that the computation is too large, when the number of documents is very large, it is almost unacceptable, of course, there are special methods to optimize the cosine method, so that the calculation can be greatly reduced, But this method is accurate, but the threshold is higher.

The principle of this interface is very simple, for a document, we only need to extract the interestingterm (that is, the TFXIDF word), and then use Lucene to search documents containing the same words, as a similar document, the advantage of this method is efficient, but the disadvantage is inaccurate, This interface provides a number of parameters that you can configure to select Interestingterm.

morelikethis MLT = new Morelikethis (IR);

Reader target = ...

Orig Source of Doc you want to find similarities to

query query = mlt.like (target);

Hits Hits = is.search (query);

The usage is simple so that you can get a similar document

This interface is more flexible, you can not use the like interface directly, but with
retrieveinterestingterms (Reader R)

This way you can get interestingterm, and then how to deal with it according to your own needs.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.