How to search for similar data!

Source: Internet
Author: User
The data format is as follows & quot; 10101010001001101100100011000100100100001011100001000010010101000101010101000101 & quot ;..... A total of 256 bits are the only 256 bits and 1024 bits in addition to the one with 0 s. Currently, I have 256 bits and 64 bits... the data format is as follows:
"10101010001001101100100011000100100100001011100001000010010101000101010101000101" ...... a total of 256 bits
That is, except that 1 is the only identifier of 0, 256 bits and 64 bits, there are 1024 bits.
This identifier
Currently, I have 256-bit and 64-bit data records. Currently, I have generated million data records, which are being generated one after another.

String 1 = "10101101001010010111010101100001011101000101010010001000111001101010010101 "...
String 2 = "10101001001011010111010101000001011101000101110010001001111001101010010101 "....
Similarity calculation is as follows:

$len = strlen($hash1);for ($i = 0; $i < $len; $i++){    if ($hash1[$i] !== $hash2[$i])    $count++;}    return  1-($i/$len);

Obtain Similarity
Search requirements:
Reading data from a database is more than 0.9 of the similarity with the "string s". Currently, data is stored in mysql, which is something that is made by an individual and cannot be purchased from commercial database storage, nosql storage and memcache storage can be used. The main programming languages PHP and javascript are pre-processing. I will use these two types!

How can I search for such data !!!!!!

Reply content:

The data format is as follows:
"10101010001001101100100011000100100100001011100001000010010101000101010101000101" ...... a total of 256 bits
That is, except that 1 is the only identifier of 0, 256 bits and 64 bits, there are 1024 bits.
This identifier
Currently, I have 256-bit and 64-bit data records. Currently, I have generated million data records, which are being generated one after another.

String 1 = "10101101001010010111010101100001011101000101010010001000111001101010010101 "...
String 2 = "10101001001011010111010101000001011101000101110010001001111001101010010101 "....
Similarity calculation is as follows:

$len = strlen($hash1);for ($i = 0; $i < $len; $i++){    if ($hash1[$i] !== $hash2[$i])    $count++;}    return  1-($i/$len);

Obtain Similarity
Search requirements:
Reading data from a database is more than 0.9 of the similarity with the "string s". Currently, data is stored in mysql, which is something that is made by an individual and cannot be purchased from commercial database storage, nosql storage and memcache storage can be used. The main programming languages PHP and javascript are pre-processing. I will use these two types!

How can I search for such data !!!!!!

I can think of two optimization points in the algorithm.

First, in the code you determine by bit, As long as $ count is greater than 10% of $ len, there is no need to loop down, and the similarity must be less than 0.9.

Second, since the data is generated by yourself, you can store the hexadecimal number of Split points at the time of production.

For example, for a 1024-Bit String, each 16-bit is a unit, which is divided into 64 units. The 16-bit binary number of each unit is converted into a 4-bit hexadecimal number, and the 1024-Bit String.
During each comparison, compare the values of these 64 units one by one. If there are 58 identical values, the similarity is more than 0.9.

What if there are only 57 identical ones? The remaining 7 hexadecimal numbers are 4*7 = 28 hexadecimal numbers, and then compare them one by one.
If there is a same number, the similarity between the 1024-bit hexadecimal number and the 256-bit hexadecimal number is equal((57*4)+a) / (64*4)It must be greater than 0.9. That is to say, when comparing one by one, a is greater than or equal0.9*64*4 - 57*4No need to judge. The similarity is greater than 0.9.

If onlya-1Are they the same? Different28-(a-1)Bitwise is converted to a binary system and compared one by one according to the above method.
The similarity is the same as that of B.((57*16)+(a-1)*4+b) / 1024It must be greater than 0.9, that is, B is greater than or equal0.9*1024 - 57*16 - (a-1)*4The similarity is greater than 0.9.

The above 57 and A-1 into any x y is the actual situation. Although the calculation workload is large, it is much smaller than one by one.

In addition, we can cache the similarity of every two 16-bit binary units for comparison between 256-bit and 64-bit. Think about it. It seems like a lot ...... Do not continue writing.

The two strings are directly computed, and then the result counts the number of 1, which is similar to the number of a single string.

Learn about the distance between Haiming and cos Similarity

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.