Location Information Inverted index K-word nearest neighbor search algorithm implementation _

Location Information Inverted index K-word nearest neighbor search algorithm implementation __java

Last Update:2018-07-27 Source: Internet

Author: User

Tags readline stdin

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The location information index is implemented on the basis of inverted indexes, where information about the position of the word item in the document is added to the inverted record table. Location information is typically stored in the inverted record in the following way:

Document ID: (location 1, location 2, ...) ）

The complete list of inverted records containing location information is shown in the following illustration:

As an example of an inverted record in the figure, to is the document frequency for this item, 993427 to, that appears in 993,427 documents. In the outermost bracket, 1,2,4,5,7 is the document ID that contains to, listing only 5, and the document ID is to the word frequency, which is the number of times to appear in the document. The last inner bracket is the position information in the document to be represented by the offset at the beginning of the document. With this "enhanced" inverted index table, we can extend the function of the inverted index query. And this time I'm here. The implementation of K-word nearest neighbor search algorithm based on positional information, the specific meaning is to search two words, can specify two words in the document interval.

For example: If a query is to is or not, we see only the to be,to and be next to, that is, the interval k is 0 words. First, look for the document that contains the two-word items, and then check the table's position to see if there is a previous entry in the position where you just appear to. One of the possible matches in the inverted record table given below is:

To:〈 ...; 4:〈 ..., 429,433〉;
	be:〈 ...; 4:〈 ..., 430,434〉;...

In this paper, the algorithm derived from the "Introduction to Information Retrieval" in the book, Wang bin translation, the book gives the algorithm pseudo code. In addition, my algorithm is completed after the completion of the course, Java written, just very simple implementation, the code is also relatively poor, not worth doing research and use, just posted out for everyone to reference. The code and index files are given below.

Import java.io.*;
Import java.util.ArrayList;
Import java.util.List;

Import Java.util.StringTokenizer;


Import javax.annotation.PostConstruct; public class Positionalindex {public Positionalindex () {//TODO auto-generated constructor stub//lterm = new A
		Rraylist<string> ();
	Lpostinglist = new arraylist<string> (); @SuppressWarnings ("resource")/******************** search for the corresponding inverted record by entry ******************/public static String Findposti
		Nglist (String term) {list<string> lterm = new arraylist<> ();
		String szpostinglist = "";				File File = new file ("Bin/index.txt");
		Read the Inverted record table file String sztermandposting;
		String szsub;
		StringTokenizer sztoken = null;
		BufferedReader BR =null;
		int isplitindex;
			try {br = new BufferedReader (new FileReader (file));			while ((sztermandposting = Br.readline ())!=null)//Read an inverted record by row {Isplitindex = Sztermandposting.indexof (","); Press ', ' divide the inverted record szsub = sztermandposting.substring (0, IsplitinDEX);
				Get Word item Szsub.trim ();
					if (Term.equals (szsub))//To determine whether the search term matches the word item {lterm.add (term);		Isplitindex = Sztermandposting.indexof (":");	Dividing the document frequency with the inverted record szsub = Sztermandposting.substring (isplitindex+1);
					Get Inverted Records//szsub = szsub.substring (isplitindex+1);
					Szpostinglist + = Szsub;
				Break
		} br.close ();											catch (Exception e) {} return szpostinglist; Returns the inverted record}/*********************** a list of documents with a value of less than K in the inverted record table based on k values **************************/public static list<string > Posintersect (String termandplist_1,string termandplist_2,int k) {list<string> LAns = new Arraylist<> (
		);
		StringTokenizer Strtoken;
		list<string> szterm_1 = new arraylist<> (), szterm_2 = new arraylist<> ();			list<string> szplist_1 = new arraylist<> (), szplist_2 = new arraylist<> ();
		Inverted record int idocid_1,idocid_2; list<string> ltermandlist_1 = new arraylist<> (), ltermandlist_2 = new Arraylist<> ();								Strtoken = new StringTokenizer (termandplist_1, ";"); Press '; '										Divide the inverted record while (Strtoken.hasmoretokens ()) {Ltermandlist_1.add (Strtoken.nexttoken ());
		Document id+ location information} strtoken = new StringTokenizer (termandplist_2, ";");
		while (Strtoken.hasmoretokens ()) {Ltermandlist_2.add (Strtoken.nexttoken ());
		for (int i = 0,J = 0;i<ltermandlist_1.size () &&j<ltermandlist_2.size ();)						{strtoken = new StringTokenizer (Ltermandlist_1.get (i), "");
			Divide the document id+ position information Szterm_1.add (Strtoken.nexttoken ());											Szplist_1.add (Strtoken.nexttoken ());
			Position information Strtoken = new StringTokenizer (Ltermandlist_2.get (j), "");
			Szterm_2.add (Strtoken.nexttoken ());
			
			Szplist_2.add (Strtoken.nexttoken ());
			int isplitindex = Szterm_1.get (i). IndexOf (",");			Idocid_1 = Integer.parseint (szterm_1.get (i) substring (0, isplitindex));
			Get Document ID Isplitindex = Szterm_2.get (i). IndexOf (","); Idocid_2 = Integer.parseint (szterm_2.geT (j). substring (0, isplitindex));
				/****************** two inverted records compare the document ID number *********************/if (idocid_1 = = idocid_2)//To determine the same document in the inverted record {
				list<string> ltemp = new arraylist<> ();
				list<string> lposting_1 = new arraylist<> ();
				list<string> lposting_2 = new arraylist<> ();		StringTokenizer Szpostingtoken = new StringTokenizer (Szplist_1.get (i), ",");
				Division Position information while (Szpostingtoken.hasmoretokens ()) {Lposting_1.add (Szpostingtoken.nexttoken ());
				} Szpostingtoken = new StringTokenizer (Szplist_2.get (j), ",");
				while (Szpostingtoken.hasmoretokens ()) {Lposting_2.add (Szpostingtoken.nexttoken ()); for (int p=0;p<lposting_1.size ();p + +) {for (int q=0;q<lposting_2.size (); q++) {if Math.Abs ( Integer.parseint (Lposting_1.get (P))-integer.parseint (Lposting_2.get (q)) <=k)//calculate the distance between the items in the document by comparing it with K {ltemp.				Add (Lposting_2.get (q)); Lposting_ that match the position in the lposting_12 location information for storage} else if (Integer.parseint (Lposting_2.get (q)) >integer.parseint (Lposting_1.get (p)) break
					; /*for (int x=0;x<ltemp.size () &&math.abs (Integer.parseint (Ltemp.get (x))-integer.parseint (Lposting_1.
					Get (P)) >k;x++) {ltemp.remove (0); }*/for (int x=0;x<ltemp.size (); x + +) {String anstemp = integer.tostring (idocid_1) + "," +lposting_1.get (p)			+ "," +ltemp.get (x);
					The Lans.add (anstemp) is stored by synthesizing the position Information Group of the ltemp in the lposting_1 compared with the conditions;
				} ltemp.clear ();
				
			} i++;j++;
			else if (idocid_1<idocid_2)//idocid_1<idocid_2,idocid_1 ordinal number +1 i++;									else J + +;
		
	Similarly, the idocid_2 ordinal number +1, returns the continuation comparison} return LAns; }/** * @param args * @throws ioexception/public static void main (string[] args) {//TODO auto-generated met		Hod stub String sztermandplist_1;
		Entry and inverted record String sztermandplist_2;
		String Szreadline;				String szterm_1;
		Entry String szterm_2;
		int k; BuFferedreader Stdin = new BufferedReader (new InputStreamReader (system.in)); try {while (Szreadline = Stdin.readline ())!=null&&!szreadline.equals ("Exit"))//console input, exit exit {LIST&L T
				string> lanswer = new arraylist<> ();		StringTokenizer Szargstoken = new StringTokenizer (Szreadline, "");
				The Entry Szterm_1 = Szargstoken.nexttoken () is divided by space;
				Szterm_2 = Szargstoken.nexttoken ();							K = Integer.parseint (Szargstoken.nexttoken ());							Dividing the K-value sztermandplist_1 = Findpostinglist (szterm_1);
				According to the terms of the word inverted record group sztermandplist_2 = Findpostinglist (szterm_2);			Lanswer = Posintersect (Sztermandplist_1, sztermandplist_2, K);
				The storage result if (Lanswer.size () ==0)//Output System.out.println ("not Found ...");
			for (int i=0;i<lanswer.size (); i++) System.out.println (Lanswer.get (i));
		} stdin.close ();
 The catch (IOException e) {}}}

The following is the index file Index.txt screenshot, the inverted record interval symbol is a good, convenient program in the division of the parts. Indexes are built manually, not by using program construction.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More