Location Information Inverted index K-word nearest neighbor search algorithm implementation __java

Source: Internet
Author: User
Tags readline stdin

The location information index is implemented on the basis of inverted indexes, where information about the position of the word item in the document is added to the inverted record table. Location information is typically stored in the inverted record in the following way:

Document ID: (location 1, location 2, ...) )
The complete list of inverted records containing location information is shown in the following illustration:

As an example of an inverted record in the figure, to is the document frequency for this item, 993427 to, that appears in 993,427 documents. In the outermost bracket, 1,2,4,5,7 is the document ID that contains to, listing only 5, and the document ID is to the word frequency, which is the number of times to appear in the document. The last inner bracket is the position information in the document to be represented by the offset at the beginning of the document. With this "enhanced" inverted index table, we can extend the function of the inverted index query. And this time I'm here. The implementation of K-word nearest neighbor search algorithm based on positional information, the specific meaning is to search two words, can specify two words in the document interval.

For example: If a query is to is or not, we see only the to be,to and be next to, that is, the interval k is 0 words. First, look for the document that contains the two-word items, and then check the table's position to see if there is a previous entry in the position where you just appear to. One of the possible matches in the inverted record table given below is:

To:〈 ...; 4:〈 ..., 429,433〉;
	be:〈 ...; 4:〈 ..., 430,434〉;...

In this paper, the algorithm derived from the "Introduction to Information Retrieval" in the book, Wang bin translation, the book gives the algorithm pseudo code. In addition, my algorithm is completed after the completion of the course, Java written, just very simple implementation, the code is also relatively poor, not worth doing research and use, just posted out for everyone to reference. The code and index files are given below.

Import java.io.*;
Import java.util.ArrayList;
Import java.util.List;

Import Java.util.StringTokenizer;


Import javax.annotation.PostConstruct; public class Positionalindex {public Positionalindex () {//TODO auto-generated constructor stub//lterm = new A
		Rraylist<string> ();
	Lpostinglist = new arraylist<string> (); @SuppressWarnings ("resource")/******************** search for the corresponding inverted record by entry ******************/public static String Findposti
		Nglist (String term) {list<string> lterm = new arraylist<> ();
		String szpostinglist = "";				File File = new file ("Bin/index.txt");
		Read the Inverted record table file String sztermandposting;
		String szsub;
		StringTokenizer sztoken = null;
		BufferedReader BR =null;
		int isplitindex;
			try {br = new BufferedReader (new FileReader (file));			while ((sztermandposting = Br.readline ())!=null)//Read an inverted record by row {Isplitindex = Sztermandposting.indexof (","); Press ', ' divide the inverted record szsub = sztermandposting.substring (0, IsplitinDEX);
				Get Word item Szsub.trim ();
					if (Term.equals (szsub))//To determine whether the search term matches the word item {lterm.add (term);		Isplitindex = Sztermandposting.indexof (":");	Dividing the document frequency with the inverted record szsub = Sztermandposting.substring (isplitindex+1);
					Get Inverted Records//szsub = szsub.substring (isplitindex+1);
					Szpostinglist + = Szsub;
				Break
		} br.close ();											catch (Exception e) {} return szpostinglist; Returns the inverted record}/*********************** a list of documents with a value of less than K in the inverted record table based on k values **************************/public static list<string > Posintersect (String termandplist_1,string termandplist_2,int k) {list<string> LAns = new Arraylist<> (
		);
		StringTokenizer Strtoken;
		list<string> szterm_1 = new arraylist<> (), szterm_2 = new arraylist<> ();			list<string> szplist_1 = new arraylist<> (), szplist_2 = new arraylist<> ();
		Inverted record int idocid_1,idocid_2; list<string> ltermandlist_1 = new arraylist<> (), ltermandlist_2 = new Arraylist<> ();								Strtoken = new StringTokenizer (termandplist_1, ";"); Press '; '										Divide the inverted record while (Strtoken.hasmoretokens ()) {Ltermandlist_1.add (Strtoken.nexttoken ());
		Document id+ location information} strtoken = new StringTokenizer (termandplist_2, ";");
		while (Strtoken.hasmoretokens ()) {Ltermandlist_2.add (Strtoken.nexttoken ());
		for (int i = 0,J = 0;i<ltermandlist_1.size () &&j<ltermandlist_2.size ();)						{strtoken = new StringTokenizer (Ltermandlist_1.get (i), "");
			Divide the document id+ position information Szterm_1.add (Strtoken.nexttoken ());											Szplist_1.add (Strtoken.nexttoken ());
			Position information Strtoken = new StringTokenizer (Ltermandlist_2.get (j), "");
			Szterm_2.add (Strtoken.nexttoken ());
			
			Szplist_2.add (Strtoken.nexttoken ());
			int isplitindex = Szterm_1.get (i). IndexOf (",");			Idocid_1 = Integer.parseint (szterm_1.get (i) substring (0, isplitindex));
			Get Document ID Isplitindex = Szterm_2.get (i). IndexOf (","); Idocid_2 = Integer.parseint (szterm_2.geT (j). substring (0, isplitindex));
				/****************** two inverted records compare the document ID number *********************/if (idocid_1 = = idocid_2)//To determine the same document in the inverted record {
				list<string> ltemp = new arraylist<> ();
				list<string> lposting_1 = new arraylist<> ();
				list<string> lposting_2 = new arraylist<> ();		StringTokenizer Szpostingtoken = new StringTokenizer (Szplist_1.get (i), ",");
				Division Position information while (Szpostingtoken.hasmoretokens ()) {Lposting_1.add (Szpostingtoken.nexttoken ());
				} Szpostingtoken = new StringTokenizer (Szplist_2.get (j), ",");
				while (Szpostingtoken.hasmoretokens ()) {Lposting_2.add (Szpostingtoken.nexttoken ()); for (int p=0;p<lposting_1.size ();p + +) {for (int q=0;q<lposting_2.size (); q++) {if Math.Abs ( Integer.parseint (Lposting_1.get (P))-integer.parseint (Lposting_2.get (q)) <=k)//calculate the distance between the items in the document by comparing it with K {ltemp.				Add (Lposting_2.get (q)); Lposting_ that match the position in the lposting_12 location information for storage} else if (Integer.parseint (Lposting_2.get (q)) >integer.parseint (Lposting_1.get (p)) break
					; /*for (int x=0;x<ltemp.size () &&math.abs (Integer.parseint (Ltemp.get (x))-integer.parseint (Lposting_1.
					Get (P)) >k;x++) {ltemp.remove (0); }*/for (int x=0;x<ltemp.size (); x + +) {String anstemp = integer.tostring (idocid_1) + "," +lposting_1.get (p)			+ "," +ltemp.get (x);
					The Lans.add (anstemp) is stored by synthesizing the position Information Group of the ltemp in the lposting_1 compared with the conditions;
				} ltemp.clear ();
				
			} i++;j++;
			else if (idocid_1<idocid_2)//idocid_1<idocid_2,idocid_1 ordinal number +1 i++;									else J + +;
		
	Similarly, the idocid_2 ordinal number +1, returns the continuation comparison} return LAns; }/** * @param args * @throws ioexception/public static void main (string[] args) {//TODO auto-generated met		Hod stub String sztermandplist_1;
		Entry and inverted record String sztermandplist_2;
		String Szreadline;				String szterm_1;
		Entry String szterm_2;
		int k; BuFferedreader Stdin = new BufferedReader (new InputStreamReader (system.in)); try {while (Szreadline = Stdin.readline ())!=null&&!szreadline.equals ("Exit"))//console input, exit exit {LIST&L T
				string> lanswer = new arraylist<> ();		StringTokenizer Szargstoken = new StringTokenizer (Szreadline, "");
				The Entry Szterm_1 = Szargstoken.nexttoken () is divided by space;
				Szterm_2 = Szargstoken.nexttoken ();							K = Integer.parseint (Szargstoken.nexttoken ());							Dividing the K-value sztermandplist_1 = Findpostinglist (szterm_1);
				According to the terms of the word inverted record group sztermandplist_2 = Findpostinglist (szterm_2);			Lanswer = Posintersect (Sztermandplist_1, sztermandplist_2, K);
				The storage result if (Lanswer.size () ==0)//Output System.out.println ("not Found ...");
			for (int i=0;i<lanswer.size (); i++) System.out.println (Lanswer.get (i));
		} stdin.close ();
 The catch (IOException e) {}}}

The following is the index file Index.txt screenshot, the inverted record interval symbol is a good, convenient program in the division of the parts. Indexes are built manually, not by using program construction.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.