Background
In the clustering of large samples, because of the computational cost of K-means, some samples are randomly selected for clustering, and the clustering centers are obtained. However, it is often needed to get the nearest cluster center for each sample, which is commonly used in index construction, eg. OPQ (Pami 2014), inverted Multi-Index (Pami 2014). algorithm Steps
Set a eigenvector p (1*2000), 2000 is a feature dimension. The cluster center matrix is C (256*2000), 256 is the center number, and 2000 is the characteristic dimension.
1. Normalization of data
\quad calculates the sum of squares of P-dimensions and obtains p_norm, which is a floating-point number.
2. Normalization of Cluster Center
\quad each row of C, compute the sum of each dimension squared, get c_norm:256*1
3. P_norm extended to 1*256
4. Calculation p_norm = P_norm+c_norm;
5. Compute point P to 256 center distance vector dis=−2c∗p+p_norm dis = -2c * p + p\_norm
6. Calculates the index of the smallest data in dis, that is, the cluster center analysis of the nearest point P
Why is the formula in step 5th getting a distance?
A: Set C in a cluster center vector (C1,C2,..., c2000) (c_1, c_2, ..., c_{2000}), the sample point P for (P1,p2,..., p2000) (P_1, p_2, ..., p_{2000}), then the 1th step of the normalization That is p21+...+p22000 p_1^2 + ... + p_{2000}^2, the normalization of the 2nd step is c21+...+c22000 c_1^2 + ... + c_{2000}^2, the formula in 5 is actually: −2 (p1∗c1+...+p2 000∗c2000) +p21+...+p22000+c21+...+c22000-2 (p_1*c_1 + ... + p_{2000}*c_{2000}) + p_1^2 + ... + p_{2000}^2 + c_1^2 + ... + C_{2000}^2
That
(P1−C1) 2