Morelikethis similar Search

Last Update:2018-08-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Source: http://www.cnblogs.com/huangfox/archive/2012/07/05/2578179.html

Morelikethis, similar search. Find similar documents for a document, common in "News like", "related articles", and so on, which is entirely content-based analysis.

1) Use of Morelikethis

fsdirectory Directory = Simplefsdirectory.open (new File ("D:/nrttest2")); Indexreader reader = indexreader.open (directory); Indexsearcher searcher = new Indexsearcher (reader); // Morelikethis MLT = new Morelikethis (reader); mlt.setfieldnames (new string[] {"AB"}); fields for calculation // int DocNum = 1; termfreqvector vector = Reader.gettermfreqvector (docnum, "AB"); System.out.println (vector.tostring ()); query query = mlt.like (DocNum); Trying to find with DocNum=1 Similar documents System.out.println (reader.document (DocNum)); System.out.println (query.tostring ()); View the constructed query, followed by the normal lucene retrieval process. Topdocs topdocs = searcher.search (query, 10); scoredoc[] Scoredocs = Topdocs.scoredocs; for (Scoredoc sdoc:scoredocs) { Document doc = reader.document (sdoc.doc); System.out.println (Doc.get ("Ti")); //System.out.println (Doc.get ("Ti")); }

2) morelikethis interpretation of source code

It's easy to run MLT, so let's take a closer look at how he's doing it.

The key is query query = mlt.like (docnum); we'll do it from him.

2.1) Like

Public Query like (int docnum) throws IOException {if (FieldNames = = null) {//Gather list of valid fields From Lucene collection<string> fields = Readerutil.getindexedfields (IR); FieldNames = Fields.toarray (New String[fields.size ()]); } return CreateQuery (Retrieveterms (DocNum)); }

Filednames is a field that participates in the "more like this" operation and is set in the Setfilednames method of the Morelikethis object.

2.2) Retrieveterms

Public priorityqueue<object[]> retrieveterms (int docnum) throws IOException { map< string,int> Termfreqmap = new hashmap<string,int> (); for (int i = 0; i < fieldnames.length; i++) { String fi Eldname = Fieldnames[i]; termfreqvector vector = Ir.gettermfreqvector (DocNum, fieldName); Remove term vector //If the current field is not stored Termvector , it needs to be recalculated. In fact, this is the participle, and calculate the term word frequency process, note that he is using the default StandardAnalyzer word breaker ... if (vector = = null) { Document D = Ir.document (DocNum); String text[] = d.getvalues (fieldName); if (text! = null) { for (int j = 0; J < text.Length J + +) { addtermfrequencies (new StringReader (Text[j]), Termfreqmap, fieldName); } } } else {//if Termvector was previously saved, then it would be much easier. addtermfrequencies (termfreqmap, vector); } }

2.3) Addtermfrequencies

Because the term and field in Termvector are not related, whether it is the title or the body, as long as the term content of the same frequency accumulation. Addtermfrequencies to do this thing.

Store the accumulated results in the TERMFREQMAP.

private void Addtermfrequencies (map<string,int> termfreqmap, Termfreqvector vector) { string[] terms = vector.getterms (); int Freqs [] = Vector.gettermfrequencies (); for (int j = 0; J < Terms.length; J + +) { String term = TERMS[J]; if (Isnoiseword (term)) { continue; } //Increment frequency Int cnt = termfreqmap.get (term); if (cnt = = null) { cnt = new Int () ; termfreqmap.put (term, CNT); cnt.x = Freqs[j]; } else { cnt.x + = freqs[j]; } } }

As of this, we store the specified document (the matched document) in the specified operation field, storing its term and corresponding frequency in the map. In this process, we see a very good-sounding operation-de- noising .

Then how to judge a term is not noise it.

Private Boolean Isnoiseword (String term) {int len = term.length (); if (Minwordlen > 0 && len < Minwordlen) {return true; } if (Maxwordlen > 0 && len > Maxwordlen) {return true; } if (stopwords! = null && stopwords.contains (term)) {return true; } return false; }

He judged the standard very simple, first: whether it is a prescribed stop word, and second: whether the term length is too long or too short, this range is controlled by Minwordlen and Maxwordlen.

2.4) Createqueue

Here the queue should be a priority queue, the previous step we obtained all <term, Frequency>, although to do the noise, but the term project is too many, but also need to find out the relative importance of the first n term.

Private priorityqueue<object[]> Createqueue (map<string,int> words) throws IOException { //Gets the total number of documents for the current index. int numdocs = Ir.numdocs (); freqq res = new Freqq (Words.size ()); Deposit by term score iterator<string> it = Words.keyset (). Iterator (); while (It.hasnext ()) {//Traverse all term String word = It.next (); int tf = Words.get (word). x; TF If for term (mintermfreq > 0 && tf < mintermfreq) {&NBSP;&NBSP;&NB sp; continue; Similar to noise removal, TF is too small for the term to go off directly. } //For the same term, find the largest DF field and store it in TopfIeld. String TopField = fieldnames[0]; int docfreq = 0; for (int i = 0; i < fieldnames.length; i++) { &nbs p; int freq = ir.docfreq (New term (fieldnames[i], word)); TopField = (Freq > Docfreq)? Fieldnames[i]: TopField;

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Morelikethis similar Search

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Morelikethis similar Search

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support