"Common algorithm" kdtree, locally sensitive hash lsh, in nearest neighbor-based algorithm, when n is particularly large (TODO)

Source: Internet
Author: User
Tags rand

Algorithms based on nearest neighbor, often used in various situations,
For example, 100,000 users, for each user to find the most similar users,
When N is particularly large, the efficiency is not very high, such as when the n=10^5, it is not very good, because the violence of the time complexity of the law is O (n^2).


This requires special means, there are two commonly used methods, one is the Kdt tree (and ball trees), one is a local sensitive hash (approximate algorithm, got to meet a certain confidence interval result)
Kdt:o (N*LONGN)
Local sensitive hash (LSH): related to bucket size


# k-dimensional Tree,kdt, https://en.wikipedia.org/wiki/K-d_tree
Construct a binary tree with the original sample,

The first deep layer uses the first deep% p features to divide the sample space, finally get a binary tree, find the time according to a certain rule can achieve the average logn times complexity, (with the tree hook is basically logn),






Todo

Because I didn't understand it well, I wrote it for a long time.


In fact, the core is that the target point, to divide the axis distance >= the current minimum distance, the minimum distance may not be in the other half, so you can prune

That


Examples:

Coordinate point: {{7,7},{3,4},{5,3},{1,9},{8,3},{8,2},{10,10}};

Target point: 6.5,1


Lookup process:



KDT Code:

Pseudo code:

The root node pointer to all feature points at the current point, depth (axis) void Insert (node* &root, vector<point> xlist, int deep) {//The current node is empty, then a new node unit, the current interface + Left and right child node null pointer//get x array, deep axis of the median//put all points xlist division, <median to the left, =median to the current, >median to the rights//points to the point is not 0, then to the corresponding direction recursively insert}//root node pointer , Target point, current optimal, depth (axis) float query (node* root, point P, float best, int. deep) {/////////Abort//Call recursion//Up Maintenance//Recursive trilogy (1), terminate processing//Current section The point is empty, then returns infinity//the current node left and right children are empty, that is, the leaf node, then calculate the distance, and return to change the distance//Recursive trilogy (2), call down recursion, that is, consider when the query as a known result consider//according to the deep axis of the decision, to the left or recursive to the < left,> right//Recursive trilogy (3), with recursive results for the current layer processing, i.e. up maintenance, backtracking//calculate the target point and the current node distance//Determine whether the distance from the target point to the current sub-vertical line is <= the current minimum distance/if < is called recursively with another child of the current node as a node. If you do not extend another child, because on the other side there can be no smaller distances, ************************************************ pruning occurs here//to find the current distance, the left subtree optimal results, Right subtree optimal result, minimum value//return minimum distance}

Realize:

#include <stdio.h> #include <algorithm> #include <vector> #include <math.h> #include <time.h >using namespace std; #define MAXDIST ~ (1 <<) int countkdt = 0;struct point {float x[2];}; The data size in struct Node {//struct must be definite, so the vector can only be vector<point>* xlist with pointers; node* l; node* r;};/ /sorted by x bool Cmp0 (const point P1, const point p2) {//Sort (Xlist.begin (), Xlist.end (), cmp0); return p1.x[0] < p2.x[0];} sort bool Cmp1 According to Y (const point P1, const point p2) {//Sort (Xlist.begin (), Xlist.end (), CMP1), return p1.x[1] < p2.x[1]; }//calculate two points distance float getdist (Point p1, point p2) {if (p1.x[0] = = P2.x[0] && p1.x[1] = = p2.x[1]) return Maxdist;return s qRT ((P1.x[0]-p2.x[0]) * (P1.x[0]-p2.x[0]) + (p1.x[1]-p2.x[1]) * (P1.x[1]-p2.x[1]));} O (n) time complexity seek median float getmedian (vector<point> A, int l, int r, int k, int deep) {//printf ("L =%d, R =%d, K =%d\n ", L, R, K); if (L = = r && k = = 0) return a[l].x[deep];int PL = L;int PR = r;int tmp = a[l].x[deep];while (PL &lT PR) {while (PL < PR && a[pr].x[deep] > tmp) pr--;if (PL >= pr) break;a[pl++].x[deep] = A[pr].x[deep];whil E (PL < PR && A[pl].x[deep] < tmp) pl++;if (PL >= pr) break;a[pr--].x[deep] = A[pl].x[deep];} A[pl].x[deep] = tmp;if (pl-l = = k) return Tmp;if (Pl-l > K) {return Getmedian (A, L, Pl-1, K, deep);} else {return Getmedian (A, PL + 1, r, K-(pl-l + 1), deep);}} Build kdtreevoid Insert (node* &root, vector<point> xlist, int deep) {int I;int mid = Xlist.size () >> 1;if ( root = null) {root = (node*) malloc (sizeof (Node)); root->l = Null;root->r = null;} vector<point> cur;vector<point> left;vector<point> right;float median;//Sort the method to get the median if (deep = = 0) { Sort (Xlist.begin (), Xlist.end (), cmp0);}  else if (deep = = 1) {sort (Xlist.begin (), Xlist.end (), cmp1);} median = Xlist[mid].x[deep]; Based on the idea of a quick platoon, get the//median = Getmedian (xlist, 0, Xlist.size ()-1, Mid, deep); for (i = 0; i < xlist.size (); i++) {if (Xlis T[i].x[deep] = = median) {cur.push_back (Xlist[i]),} else if (Xlist[i].x[deep] < median) {Left.push_back (xlist[i]);} else {Right.pus H_back (Xlist[i]);}} /*//printf ("====1===\n"); for (i = 0; i < left.size (); i++) {printf ("%d,%d\n", Left[i]);} for (i = 0; i < cur.size (); i++) {printf ("Mid:%d,%d\n", Cur[i]);} for (i = 0; i < right.size (); i++) {printf ("%d,%d\n", Right[i]);} printf ("====2===\n"); *///root->xlist = Cur;root->xlist = new vector<point>; (vector<point>*) malloc (vector<point>); Error, because vector<point> size unknown for (i = 0; i < cur.size (); i++) {(* (root->xlist)). Push_back (Cur[i]);} if (left.size () > 0) {insert (Root->l, left, (deep + 1)% 2);} if (Right.size () > 0) {Insert (Root->r, right, (d EEP + 1)% 2);}} Print tree void ShowTree (node* root) {if (root = NULL) return;printf ("\NL:"); ShowTree (root->l); int i;printf ("\nm:"); for ( i = 0; I < (* (root->xlist)). Size (); i++) {printf ("%.2f,%.2f\n", (* (root->xlist)) [i].x[0], (* (root->xlist)) [I].X[1]);} printf ("\NR:"); ShowTree (root->r);} Find the most recent float query (node* root, point P, float best, int deep) {if (root = NULL) return maxdist;//printf ("\ncur x =%.2f,% .2f, best =%.2f, deep =%d\n ", (* (root->xlist)) [0].x[0], (* (root->xlist)) [0].x[1], best, deep);//printf (" LC =%d, rc =%d\n ", root->l, root->r); int i, j;float dist;if (root->l = = NULL && Root->r = null) {//printf (" L EAF node \ n "); for (i = 0; i < (* (root->xlist)). Size (); i++) {countkdt++;d ist = getdist ((* (root->xlist)) [i], p); be St = Dist < best? Dist:best;} printf ("best =%f\n", best); Left or Rightif (P.x[deep] <= (* (root->xlist)) [0].x[deep]) {//printf ("LLL \ n"); best = Query (Root->l, p, best, (deep + 1)% 2);} else {//printf ("RRR \ n"); best = Query (Root->r, p, best, (deep + 1)% 2);} Curfor (i = 0; i < (* (root->xlist)). Size (); i++) {countkdt++;d ist = getdist ((* (root->xlist)) [i], p); best = Dis T < best? Dist:best;} Another Sideif (BEST >= fabs (P.x[deep]-(* (root->xlist)) [0].x[deep]) {Float distanother = maxdist;if (P.x[deep] <= (* (ROOT-&GT;XL IST) [0].x[deep]) {//printf ("another RRR \ n");d istanother = Query (Root->r, p, best, (deep + 1)% 2);} else {//printf ("a Nother lll \ n ");d istanother = Query (Root->l, p, best, (deep + 1)% 2);} if (Distanother < best) {best = Distanother;}} return best;} Float A[][2] = {{7,7},{3,4},{5,3},{1,9},{8,3},{8,2},{10,10}}; p = 6.5, 1//float a[][2] = {{2,3}, {5,4}, {9,6}, {4,7}, {8,1}, {7,2}};  int main () {int I, n;n = 200000;//establishes kdtreenode* root = null;vector<point> xlist;for (i = 0; i < n; i++) {point P;P.X[0] = rand ()% n;p.x[1] = rand ()% n;//p.x[0] = a[i][0];//p.x[1] = A[i][1];xlist.push_back (p);//if (i==0)//printf ("%.") 2f,%.2f ", P.x[0], p.x[1]);} printf ("\ n"); clock_t t1 = Clock (), insert (root, xlist, 0), clock_t t2 = Clock ();p rintf ("Build KDT time =%d\n", t2-t1); /showtree (root);//printf ("==================================== End of tree\n");//KDT SearchPoint p;p.x[0] = 7;p.x[1] = 7;float Best = maxdist;float ans = maxdist;int deep = 0;T1 = Clock (), for (i = 0; i < xlist. Size (); i++) {p = Xlist[i];best = query (root, p, MAXDIST, deep); ans = ans < best ans:best;} printf ("Kdtree best =%f\n", best);p rintf ("COUNTKDT =%d\n", countkdt), t2 = Clock ();p rintf ("KDT time =%d\n", t2-t1);//  Violence Law/*t1 = clock (); float Best2 = Maxdist;int Count2 = 0;for (int j = 0; J < N; j + +) {p = Xlist[j];best2 = Maxdist;for (i = 0; I < n; i++) {Count2++;float dist2 = Getdist (P, xlist[i]); if (Dist2 < best2) {best2 = Dist2;}}} printf ("O (n): Best2 =%f\n", best2), t2 = Clock ();p rintf ("O (n) time =%d\n", t2-t1);p rintf ("%d\n", Count2); */return 0;} /*n = 10^4, the point of the nearest distance to each point Kdt:o (NLOGN) KDT Build time = 102msKDT Time = 25msexe count = O (NLOGN) = 24 * 10^4 violence Law: O (n^2) time = 4351exe count = O (n^2) = 10^8*/



locality-sensitive Hashing,lsh,Https://en.wikipedia.org/wiki/Locality-sensitive_hashing
A local sensitive hash is actually a bucket method, and the core idea is that when the similarity of two samples is close, the two samples are more likely to fall into the same bucket.
2 of these requirements are:
Similarity (x1, x2) <= sim1, the probability of >=p1 (large, for example 0.95) makes x1,x2 in the same bucket
Similarity (x1, x2) > Sim2,< p2 (small, for example, 0.05) the probability of x1,x2 fall in the same bucket


Therefore, and the general use of the hash function is not the same idea, the general hash function want each sample hash as far as possible, and the hash here would like to close the sample hash into a bucket

Todo

"Common algorithm" kdtree, locally sensitive hash lsh, in nearest neighbor-based algorithm, when n is particularly large (TODO)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.