Lucene in action note term vector

Last Update:2018-08-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Leveraging term vectors
The so-called term vector, is for documents in a field, such as title,body this type of text, the establishment of the frequency of multidimensional vector space. Each word is one-dimensional, and the value of this dimension is the frequency of the word in this field.

If you want to use term vectors, you need to open the term vectors option on the field when indexing:

Field options for term vectors
Termvector.yes–record The unique terms that occurred, and their counts, in each document, but does not store any positions or offsets information.
Termvector.with_positions–record the unique terms and their counts, and also the positions of each occurrence of every Erm, but no offsets.
Termvector.with_offsets–record the unique terms and their counts, with the offsets (start & end character) of each occurrence of every term, but no positions.
Termvector.with_positions_offsets–store unique terms and their counts, along with positions and offsets.
Termvector.no–do not store any term vector information.
If index.no is specified to field, then you must also specify.

So, given this document ID and field name after index, we can read the term vector from Indexreader (if you created terms vector when you indexing):
Termfreqvector Termfreqvector = reader.gettermfreqvector (ID, "subject");
You can traverse this termfreqvector to remove each word and frequency, and if you choose to save offsets and positions information at index, you can also take it here.

With this term vector we can do some interesting applications:
1 books as this
Compare the similarities between the two books and abstract them into a document file with author, subject fields. Now it's time to compare the similarities of the two books by using these two field.
Author This field is multiple fields, which means there can be multiple author, then the first step is the same as the author,
string[] authors = Doc.getvalues ("author");
Booleanquery authorquery = new Booleanquery (); #3
for (int i = 0; i < authors.length i++) {//#3
String author = authors[i]; #3
Authorquery.add (New Termquery (New Term ("author", author)), BooleanClause.Occur.SHOULD); #3
}
Authorquery.setboost (2.0f);
Finally, the boost value of this query can be set high, indicating that this condition is very important, the weight is higher, if the author is the same, then very similar.
The second step is to use the term vector, which is very simple to see whether the term in the term vector of the subject field is the same,
Termfreqvector vector =//#4
Reader.gettermfreqvector (ID, "subject"); #4
Booleanquery subjectquery = new Booleanquery (); #4
for (int j = 0; J < Vector.size (); j + +) {//#4
Termquery TQ = new Termquery (New Term ("Subject", Vector.getterms () [j]));
Subjectquery.add (TQ, BooleanClause.Occur.SHOULD); #4
}

2) What category?
This is a bit more advanced than the previous example, how to classify it, or the subject of document, we have term vector.
So for the two document, we can compare the term vectors of the two articles in the vector space, the smaller the angle, the more similar the two document.
So since there is a training process for classification, we have to establish the term vector of each class as a standard to compare to other document.
Here, map is used to implement this term vector, (term, frequency), n such a map to represent n-dimensional. We're going to generate a term vector for each category, category and term vectors can also be connected with a map. Create this category vector of term, do this:
Iterate through each document in the class, take the term vector of the document and add it to the term vector of the category.
private void Addtermfreqtomap (Map vectormap, Termfreqvector termfreqvector) {
string[] terms = termfreqvector.getterms ();
int[] Freqs = Termfreqvector.gettermfrequencies ();
for (int i = 0; i < terms.length; i++) {
String term = terms[i];
if (Vectormap.containskey (term)) {
Integer value = (integer) vectormap.get (term);
Vectormap.put (term, new Integer (Value.intvalue () + freqs[i));
} else {
Vectormap.put (term, new Integer (freqs[i));
}
}
}
First, remove the list of term and frequency from the term vector of document, then take each term from category's term vector and add term frequency of the document. Okay, OK.

With this category of each class, we're going to start calculating the angle between the document and the vector of the class.
cos = a*b/| a| | b|
A*b is the dot product, which is multiplied by each dimension of two vectors, and then all adds up.
For the sake of simplicity, it is assumed that there are only two cases of term frequency in document, 0 or 1. Means to appear or not to appear
Private double Computeangle (string[] words, String category) {
Assume words are unique and only occur once
Map Vectormap = (map) categorymap.get (category);
int dotproduct = 0;
int sumofsquares = 0;
for (int i = 0; i < words.length; i++) {
String word = words[i];
int categorywordfreq = 0;
if (Vectormap.containskey (word)) {
Categorywordfreq = ((Integer) vectormap.get (Word)). Intvalue ();
}
Dotproduct + = Categorywordfreq; Optimized because we assume frequency in words is 1
Sumofsquares + = Categorywordfreq * CATEGORYWORDFREQ;
}
double denominator;
if (sumofsquares = = words.length) {
Avoid precision issues for special case
denominator = sumofsquares; sqrt x * sqrt x = x
} else {
Denominator = MATH.SQRT (sumofsquares) *
MATH.SQRT (words.length);
}
Double ratio = Dotproduct/denominator;
return Math.acos (ratio);
}
This function is the implementation of the above formula is relatively simple.

3) Morelikethis

Lucene also provides a more efficient interface for finding more similar documents, morelikethis interfaces

Http://lucene.apache.org/java/1_9_1/api/org/apache/lucene/search/similar/MoreLikeThis.html

For the above method we can compare the cosine of every two documents, then the cosine is sorted to find the most similar document, but the biggest problem with this method is that the computation is too large, when the number of documents is very large, it is almost unacceptable, of course, there are special methods to optimize the cosine method, so that the calculation can be greatly reduced, But this method is accurate, but the threshold is higher.

The principle of this interface is very simple, for a document, we only need to extract the interestingterm (that is, the TFXIDF word), and then use Lucene to search documents containing the same words, as a similar document, the advantage of this method is efficient, but the disadvantage is inaccurate, This interface provides a number of parameters that you can configure to select Interestingterm.

morelikethis MLT = new Morelikethis (IR);

Reader target = ...

Orig Source of Doc you want to find similarities to

query query = mlt.like (target);

Hits Hits = is.search (query);

The usage is simple so that you can get a similar document

This interface is more flexible, you can not use the like interface directly, but with
retrieveinterestingterms (Reader R)

This way you can get interestingterm, and then how to deal with it according to your own needs.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Lucene in action note term vector

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Lucene in action note term vector

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support