[E2lsh source code analysis] e2lsh source code overview and main data structures

Source: Internet
Author: User

In the previous section, we introduced the basic principle of p stable distribution lsh (http://blog.csdn.net/jasonding1354/article/details/38237353), in the next blog, I will be based on e2lsh open source code, the source code of e2lsh is annotated to lay the foundation for understanding the basic principles of LSH and the extended learning of similarity search in the future.


1. Code Overview

E2lsh core code can be divided into three parts:

  • Localitysensitivehashing. cpp-- Mainly contains the LSH-based rnn (R-near neighbor) data structure. Its main function is to construct a data structure based on parameters to query data objects;
  • Buckethashing. cpp-- Contains common hash tables for the hash bucket. Its main function is to build a hash table, add a hash bucket to the table, and query the hash bucket;
  • Selftuning. cpp-- Contains a function that computes the optimal parameters of the rnn data structure.

Other code instructions:

  • Geometry. h-- Include the definition of the logarithm data point (Data Type Ppoint );
  • Nearneighbors. cpp, nearneighbors. h-- Function interface that contains e2lsh core code;
  • Random. cpp, random. h-- Contains a pseudo-random number generator;
  • Basicdefinitions. h-- General type definition and macro definition;
  • Utils. cpp, utils. h-- Contains some common functions (such as copy vectors ).

2. Main Data Structure

(1)Rnearneighborstructuret(Defined in localitysensitivehashing. h) -- r near neighbor data structure. This structure contains the parameters for building the data structure, descriptions of the hash function family GI, indexes of data points in the structure, and pointers to the L hash tables used to store the hash bucket.

typedef struct _RNearNeighborStructT {  IntT dimension; // dimension of points.  IntT parameterK; // parameter K of the algorithm.  IntT parameterL; // parameter L of the algorithm.  RealT parameterW; // parameter W of the algorithm.  IntT parameterT; // parameter T of the algorithm.  RealT parameterR; // parameter R of the algorithm.  RealT parameterR2; // = parameterR^2  // Whether to use <u> hash functions instead of usual <g>  // functions. When this flag is set to TRUE, <u> functions are  // generated (which are roughly k/2-tuples of LSH), and a <g>  // function is a pair of 2 different <u> functions.  BooleanT useUfunctions;  // the number of tuples of hash functions used (= # of rows of  // <lshFunctions>). When useUfunctions == FALSE, this field is equal  // to parameterL, otherwise, to <m>, the number of <u> hash  // functions (in this case, parameterL = m*(m-1)/2 = nHFTuples*(nHFTuples-1)/2  IntT nHFTuples;  // How many LSH functions each of the tuple has (it is <k> when  // useUfunctions == FALSE, and <k/2> when useUfunctions == TRUE).  IntT hfTuplesLength;  // number of points in the data set  Int32T nPoints;  // The array of pointers to the points that are contained in the  // structure. Some types of this structure (of UHashStructureT,  // actually) use indeces in this array to refer to points (as  // opposed to using pointers).  PPointT *points;  // The size of the array <points>  Int32T pointsArraySize;  // If <reportingResult> == FALSE, no points are reported back in a  // <get*> function. In particular any point that is found in the  // bucket is considered to be outside the R-ball of the query point  // (the distance is still computed).  If <reportingResult> == TRUE,  // then the structure behaves normally.  BooleanT reportingResult;    // This table stores the LSH functions. There are <nHFTuples> rows  // of 

(2)Uhashstructuret(Defined in buckethashing. h)--This structure defines the hash table used to map the hash bucket. Resolve conflicts through linked lists.

There are two types of hash tables:Ht_cmd_listAndHt_hybrid_chains.

Ht_short_list corresponds to the version of the linked list of the hash table; ht_hybrid_chains corresponds to the hash table with hybrid storage.

Both hash tables have pointers to the hash functions H1 (·) and H2.

typedef struct _UHashStructureT {  // The type of the hash table (can take values HT_*). when  // <typeHT>=HT_LINKED_LIST, chains&buckets are linked lists. when  // <typeHT>=HT_PACKED, chains&buckets are static arrays. when  // <typeHT>=HT_STATISTICS, chains are static arrays and buckets only  // count # of elements.  when <typeHT>=HT_HYBRID_CHAINS, a chain is  // a "hybrid" array that contains both the buckets and the points  // (the an element of the chain array is of type  // <HybridChainEntryT>). all chains are conglamerated in the same  // array 

(3)Rnnparameterst(Defined in localitysensitivehashing. h) -- contains the struct of necessary parameters for constructing the rnearneighborstructuret data structure.

typedef struct _RNNParametersT {  RealT parameterR; // parameter R of the algorithm.  RealT successProbability; // the success probability 1-\delta  IntT dimension; // dimension of points.  RealT parameterR2; // = parameterR^2  // Whether to use <u> hash functions instead of usual <g>  // functions. When this flag is set to TRUE, <u> functions are  // generated (which are roughly k/2-tuples of LSH), and a <g>  // function is a pair of 2 different <u> functions.  BooleanT useUfunctions;  IntT parameterK; // parameter K of the algorithm.    // parameter M (# of independent tuples of LSH functions)  // if useUfunctions==TRUE, parameterL = parameterM * (parameterM - 1) / 2  // if useUfunctions==FALSE, parameterL = parameterM  IntT parameterM;  IntT parameterL; // parameter L of the algorithm.  RealT parameterW; // parameter W of the algorithm.  IntT parameterT; // parameter T of the algorithm.  // The type of the hash table used for storing the buckets (of the  // same <g> function).  IntT typeHT;} RNNParametersT, *PRNNParametersT;

(4)Ppoint(Defined in geometry. h)--Struct used to store data points. This structure contains the coordinates, the sum of squares of the data point norm, and the subscript index of the data point P in the dataset.

typedef struct _PointT {  IntT index; // the index of this point in the dataset list of points  RealT *coordinates;  RealT sqrLength; // the square of the length of the vector} PointT, *PPointT;

Reprinted please indicate the author and Article Source: http://blog.csdn.net/jasonding1354/article/details/38331229

E2lsh Source: http://download.csdn.net/detail/jasonding1354/7704277

References:

1. M. DATAR, N. immorlica, P. indyk, and V. mirrokni, "locality-sensitivehashing Scheme Based on p-stable distributions," Proc. Symp. computationalgeometry, 2004.

(2) A. Andoni, P. indyk. e2lsh: exact Euclidean locality-sensitive hashing. http://web.mit.edu/andoni/www/LSH/.2004.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.