1. Introduction
In the traditional LSH, SSH, PCA-ITQ and other hashing algorithms, the essence is to use the ultra-flat face data points division, but in D-dimensional space, at least the need to d+1 a super plane to form a closed, compact area. The ball hashing method uses the hypersphere to divide the data, and in any dimension, only 1 aspheric surfaces can be used to form a closed area. With the ball hashing method, the maximum distance of the samples within each region is averaged smaller, indicating that the samples in each region are more compact. This is more in line with neighboring meanings and is better suited for similar searches.
2. Binary Code Embedding Function
Ball hash Function Family $h (x) = (h_1 (x), h_2 (x), H_3 (x), ... h_l (x)) $. L is the number of bits encoded in a hash, where each hash function is in fact a hyper-sphere, each of which divides the space into two parts of the ball and the outside of the ball. The hash function is as follows:
Among them, $p _k$ and $t_k$ are the sphere and radius respectively, $d (P_k, x) $ represents the Euclidean distance between the point X and the Globe $p_k?$. If the point-to-sphere distance is greater than the radius, it is encoded as-1, otherwise the encoding is 1.
In order to compare the effects of different compactness on the results of the region based on the hyper-plane and the region based on the hyper-sphere, the paper makes the following two experiments:
The y-axis in the left image represents the average of the maximum distance in the data points in the hash space under the same encoding, and the x-axis is a different code length. The experimental results show that the region based on the formation of super spherical surface is more compact, and the original data can be encoded well by using less code length, and the y-axis of the right image is the maximum distance of the data points in the original space corresponding to the same encoding, and the number of the same bit is +1 for the x-axis of the two-digit point In addition to the validation of the compactness of the spherical region, the experimental results also show that if the two data points have more of the same characteristics, the closer the two data points are (the more similar).
For the experimental results on the right, the intuitive understanding is that a class A with a, B, c three features, if the data $x_1$ and $x_2$ all have these three characteristics (corresponding bit is +1), then we can roughly determine $x_1$ and $x_2$ belong to category A; but if $x_1$ and $x _2$ do not have these three characteristics (corresponding bits are 1), then we can only determine that $x_1$ and $x_2$ are not category A, but can not be concluded that $x_1$ and $x_2$ belong to the same kind of conclusion.
3. Distance between Binary Codes
The traditional hashing method uses Hamming distance as a method to measure the distance in the data points, but Hamming distance cannot characterize the region compactness well. Therefore, in the ball hashing method, a new distance measurement method shpherical Hamming distance,shdis used:
Among them, the numerator is the number of different bits in two codes, and the denominator is the number of the corresponding bit is +1. Obviously, when the number of two data corresponding to the bit is +1, the smaller the corresponding SHD distance, the larger the opposite, it is good to reflect the use of the region based on the super-spherical characteristics of the compact.
4. Independence between Hashing Functions
The balance and independence of hash functions are also limited in the sphere hashing method.
Balance of:
Independence:
The specific illustrations are as follows:
5. Iterative optimization
At initialization, an m-sized subset of S is generated from the original dataset, and a random selection of C data points in the subset S is used as the initial sphere, and the selection of the initial center of the center should approximate the distribution of the dataset in space to reduce the cost of the optimization behind it. After the selection of the sphere, the radius can be obtained according to the limits of balance and independence. Then, the ball hashing function training can be divided into two stages.
First Stage: adjust the spherical sphere according to the limitation of balance, make the value of $o_{i,j}$ as close as possible to $4/m$. In this process, the forces that define the two sphere are as follows:
In order to satisfy the equilibrium condition, when the two balls overlap too much, the repulsion force should be created to separate them, and when the two balls are too far away, they should be attractive so that they are close to one another. The principle of the above formula is to achieve repulsion and attraction by using the symbolic positive and negative relationship of $ (o_{i,j}-M/4) $ and $ (p_i-p_j) $. The $4/m$ in the denominator is to ensure that the size of the force is not affected by the size m of the dataset.
Second Stage: when the sphere is updated by force, we adjust the size of the radius $t_k$ by the restriction of independence.
In the first stage, the ideal condition is that the mean and standard deviation of the $o_{i,j}$ are $m/4$ and 0 respectively, but this is prone to overfitting, so we set two thresholds--10% and 15% for mean and standard deviation, and the algorithm has the best performance under these two thresholds.
The process of the whole ball hashing algorithm is summarized as follows:
Ball hash (spherical Hashing)