Morelikethis similar Search

Source: Internet
Author: User

Source: http://www.cnblogs.com/huangfox/archive/2012/07/05/2578179.html

Morelikethis, similar search. Find similar documents for a document, common in "News like", "related articles", and so on, which is entirely content-based analysis.

1) Use of Morelikethis

                 fsdirectory Directory = Simplefsdirectory.open (new File ("D:/nrttest2"));          Indexreader reader = indexreader.open (directory);          Indexsearcher searcher = new Indexsearcher (reader);         //         Morelikethis MLT = new Morelikethis (reader);          mlt.setfieldnames (new string[] {"AB"}); fields for calculation         //         int DocNum = 1;       termfreqvector vector = Reader.gettermfreqvector (docnum, "AB");       System.out.println (vector.tostring ());          query query = mlt.like (DocNum); Trying to find with DocNum=1 Similar documents          System.out.println (reader.document (DocNum));          System.out.println (query.tostring ()); View the constructed query, followed by the normal lucene retrieval process.          Topdocs topdocs = searcher.search (query, 10);          scoredoc[] Scoredocs = Topdocs.scoredocs;          for (Scoredoc sdoc:scoredocs) {              Document doc = reader.document (sdoc.doc);              System.out.println (Doc.get ("Ti"));             //System.out.println (Doc.get ("Ti"));         }

2) morelikethis interpretation of source code

It's easy to run MLT, so let's take a closer look at how he's doing it.

The key is query query = mlt.like (docnum); we'll do it from him.

2.1) Like

Public Query like (int docnum) throws IOException {if (FieldNames = = null) {//Gather list of valid fields        From Lucene collection<string> fields = Readerutil.getindexedfields (IR);      FieldNames = Fields.toarray (New String[fields.size ()]);    } return CreateQuery (Retrieveterms (DocNum)); }

Filednames is a field that participates in the "more like this" operation and is set in the Setfilednames method of the Morelikethis object.

2.2) Retrieveterms

Public priorityqueue<object[]> retrieveterms (int docnum) throws IOException {     map< string,int> Termfreqmap = new hashmap<string,int> ();      for (int i = 0; i < fieldnames.length; i++) {       String fi Eldname = Fieldnames[i];        termfreqvector vector = Ir.gettermfreqvector (DocNum, fieldName); Remove term vector               //If the current field is not stored Termvector , it needs to be recalculated. In fact, this is the participle, and calculate the term word frequency process, note that he is using the default StandardAnalyzer word breaker ...        if (vector = = null) {         Document D = Ir.document (DocNum);          String text[] = d.getvalues (fieldName);          if (text! = null) {            for (int j = 0; J < text.Length J + +) {             addtermfrequencies (new StringReader (Text[j]), Termfreqmap,                   fieldName);           }         }       } else {//if Termvector was previously saved, then it would be much easier.          addtermfrequencies (termfreqmap, vector);       }             }

2.3) Addtermfrequencies

Because the term and field in Termvector are not related, whether it is the title or the body, as long as the term content of the same frequency accumulation. Addtermfrequencies to do this thing.

Store the accumulated results in the TERMFREQMAP.

private void Addtermfrequencies (map<string,int> termfreqmap,        Termfreqvector vector) {     string[] terms = vector.getterms ();      int Freqs [] = Vector.gettermfrequencies ();      for (int j = 0; J < Terms.length; J + +) {       String term = TERMS[J];                if (Isnoiseword (term)) {          continue;       }       //Increment frequency         Int cnt = termfreqmap.get (term);        if (cnt = = null) {         cnt = new Int () ;          termfreqmap.put (term, CNT);          cnt.x = Freqs[j];       } else {         cnt.x + = freqs[j];        }     }   }

As of this, we store the specified document (the matched document) in the specified operation field, storing its term and corresponding frequency in the map. In this process, we see a very good-sounding operation-de- noising .

Then how to judge a term is not noise it.

Private Boolean Isnoiseword (String term) {int len = term.length ();      if (Minwordlen > 0 && len < Minwordlen) {return true;      } if (Maxwordlen > 0 && len > Maxwordlen) {return true;      } if (stopwords! = null && stopwords.contains (term)) {return true;    } return false; }

He judged the standard very simple, first: whether it is a prescribed stop word, and second: whether the term length is too long or too short, this range is controlled by Minwordlen and Maxwordlen.

2.4) Createqueue

Here the queue should be a priority queue, the previous step we obtained all <term, Frequency>, although to do the noise, but the term project is too many, but also need to find out the relative importance of the first n term.

Private priorityqueue<object[]> Createqueue (map<string,int> words)         throws IOException {    //Gets the total number of documents for the current index.      int numdocs = Ir.numdocs ();      freqq res = new Freqq (Words.size ()); Deposit by term score            iterator<string> it = Words.keyset (). Iterator ();      while (It.hasnext ()) {//Traverse all term        String word = It.next ();                int tf = Words.get (word). x; TF        If for term (mintermfreq > 0 && tf < mintermfreq) {&NBSP;&NBSP;&NB sp;      continue; Similar to noise removal, TF is too small for the term to go off directly.       }                //For the same term, find the largest DF field and store it in TopfIeld.        String TopField = fieldnames[0];        int docfreq = 0;        for (int i = 0; i < fieldnames.length; i++) {     &nbs p;   int freq = ir.docfreq (New term (fieldnames[i], word));          TopField = (Freq > Docfreq)? Fieldnames[i]: TopField;
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.