Collected fast one months of information, although not fully understand, but the first slowly write it, perhaps there is a train of thought.
The biggest benefit of open source is that it gives the author a sense of shame about dirty smelly code.
When a recommendation system department began to pay attention to "data cleaning, data column, effect evaluation, data statistics, data analysis" These so-called dirty live dirty, such a recommendation system will be saved.
Ask for GitHub's use.
Simplicity is not equal to stupidity.
Why do I say tired: I am also a habit of talking about the cause and the cause of the people, so the whole brain high-load operation. But this is really not good, learning stupid.
One of the biggest gains is to let me understand the original previously admired a variety of National natural Fund projects, the original can be troubled waters fooled past, not high efficiency does not say, there may be many mistakes, hey, I will not say.
I. Source of the problem
Find LSH found, this is Google's current page to re-program. But what's the connection between Simhash and LSH? In advance, Simhash is essentially a kind of lsh. Because of its local sensitivity (this section of local sensitivity refers to very similar text, even if only one character, the sequence after the MD5 may be very different, but the sequence after Simhash may be only a few different), so we can use the Hamming distance to measure the similarity of the Simhash value.
Simhash is the algorithm that Google uses to handle massive text-to-weight. Google produced, you know. The best thing about Simhash is to convert a document to a 64-bit byte, call it a feature word, and then judge whether the distance between them is <n (based on experience, the n is generally 3), you can tell if two documents are similar.
Google produced, simple and practical.
second, the algorithm analysis2.1 algorithm pseudo-code
1, participle, the need to judge the text Word segmentation form the characteristics of this article word. Finally, to form a sequence of words to remove the noise word and add weights to each word, we assume that the weights are divided into 5 levels. For example: "U.S." 51 district "Employees said there were 9 flying saucers, had seen the gray aliens" ==> participle after "the United States (4) 51 District (5) Employees (3) said (1) Internal (2) has (1) 9 (3) UFO (5) Zeng (1) See (3) Gray (4) Aliens 5), brackets is to represent the importance of the word in the whole sentence, the larger the number the more important.
2, hash, through the hash algorithm to turn each word into a hash value, such as "the United States" through the hash algorithm is calculated as 100101, "51" by the hash algorithm calculated as 101011. So our string becomes a string of numbers, remember the beginning of the article, to turn the article into a digital calculation to improve the similarity computing performance, now is the process of dimensionality reduction.
3, weighted, through the 2-step hash generation results, you need to follow the weight of the word to form a weighted number of strings, such as "the United States" hash value of "100101", through the weighted calculation of "4-4-4 4-4 4"; "51" has a hash value of "101011", weighted to "5-5 5-5 5 5 ".
4, merges, the above each word calculates the sequence value to accumulate, becomes only one sequence string. For example "The United States" of "4-4-4 4-4 4", "51-zone" of "5-5 5-5 5 5", each one to accumulate, "4+5-4+-5-4+5 4+-5-4+5 4+5" = = "9-9 1-1 1 9". Here, as an example, only two words are counted, and the real calculation needs to accumulate the serial strings of all the words.
5, reduced dimension, the 4 step calculated "9-9 1-1 1 9" into 0 1 strings, forming our final simhash signature. If each bit is greater than 0 is 1, less than 0 is recorded as 0. Finally, the result is: "1 0 1 0 1 1".
the difference between 2.2 simhash and traditional hashing
People may have doubts, after so many steps to get so much trouble, not just to obtain a 0 1 string? I directly input this text as a string, using the hash function to generate 0 1 is simple. In fact, the traditional hash function solves the problem of generating unique values, such as MD5, HASHMAP, and so on. MD5 is used to generate a unique signature string, as long as a little more than one character MD5 the two numbers look very far apart, HashMap is also used for key-value pair lookup, easy to insert and find data structure. However, our main solution is the text similarity calculation, to compare is two articles whether acquaintance, of course, we reduced the survival of the hashcode is also used for this purpose. As you can see here, it is understood that the simhash we use could be used to calculate similarity even if the strings in the article were turned into 01 strings, while the traditional hashcode did not. We can do a test, two different text strings with only one character, "Your mother called you home for dinner Oh, go home Luo home luo" and "your mother told you to go home to eat, go home Luo home luo."
The result is calculated by Simhash:
1000010010101101111111100000101011010001001111100001001011001011
1000010010101101011111100000101011010001001111100001101010001011
Calculated by Hashcode as:
1111111111111111111111111111111110001000001100110100111011011110
1010010001111111110010110011101
As you can see, similar text is only part of the 01 string changes, and ordinary hashcode can not do, this is the charm of the local sensitive hash. At present, Broder proposed shingling algorithm and Charikar Simhash algorithm should be regarded as a better algorithm in the industry.
In Simhash's inventor Charikar's paper, no specific simhash algorithm and proof, "quantum Turing" to obtain the proof Simhash is the random hyper-plane hash algorithm evolved.
Now, with this conversion, we convert the text in the library to the Simhash code and convert it to a long type of storage, where space is greatly reduced. Now we have solved the space, but how to calculate the similarity of two simhash? Is the comparison between the two Simhash 01 how many different? Yes, in fact, we can calculate the similarity of two simhash by Hamming distance (Hamming distance). The number of two simhash corresponding to binary (01 string) values is called the Hamming distance of these two simhash. Examples are as follows: 10101 and 00110 start with the first position, the first, the fifth, the Hamming distance is 3. For A and B binary strings, the Hamming distance is equal to the number of 1 in the result of a XOR B operation (Universal algorithm).
The biggest difference between simhash and ordinary hash is that the traditional hash function can also be used for mapping to compare the repetition of text, but for a document that is likely to have only one byte to be mapped to two completely different hash results, the hash mapping results for similar text are similar for simhash.
Http://www.lanceyan.com/tech/arch/simhash_hamming_distance_similarity.html
Http://blog.sina.com.cn/s/blog_81e6c30b0101cpvu.html
How is the weight assigned? I only know using the TF-IDF algorithm.
Third, the implementation of the algorithm
Who has easy to understand Java or MATLAB code, you can send short messages to me, together for everyone to serve.
Iv. Comparison of other de-weight algorithms4.1 Baidu
Baidu's go-to-weight algorithm is the simplest, is to directly find the article's longest n sentence, do a hash signature. n generally takes 3. Engineering to achieve a huge simple, it is said that the accuracy and recall rate can reach more than 80%.
Baidu's go-to-weight algorithm is not so stupid, right? According to a Baidu chicken nest out of colleagues said so. And I personally think that simple is not equal to the idiot.
4.2 Shingle algorithm
Shingle principle is slightly complicated, not elaborate. Shingle algorithm I think that too academic school, for the project is not friendly, slow, basically unable to deal with massive data.
v. Expansion of the problem
Question: A 8 billion 64-bit fingerprint consists of a set of Q, for a given 64-bit fingerprint F, how to find Q in a few millionseconds and F at most only K (k=3) bit differential fingerprint.
Read the literature.
I think it is possible to learn from the AC automata a class of algorithms to do matching, but the matching rule is the Hamming distance of less than 3. I'm talking about an optimized exact match.
http://grunt1223.iteye.com/blog/964564
http://blog.csdn.net/lgnlgn/article/details/6008498
Http://blog.sina.com.cn/s/blog_81e6c30b0101cpvu.html
Six, Accidental harvest
1.python computing power is really strong, float can represent any length of the number, and corresponding Java, C + + can only be implemented in other ways, such as Java Bigineteger, the corresponding bit operation can only use the class method ... Sweat...
In addition, bit arithmetic is only suitable for integer ... Because the floating-point storage scheme decided not to bitwise operations, if the non-Mughal operation, it is necessary to float.floattointbits, the operation is finished, and then through float.intbitstofloat transformation back. (Java default float,double's hashcode is actually the corresponding floattointbits int value).
2. Baidu bidding ranking system: Phoenix Nest System
3. The great God's feeling about employment study: http://yanyiwu.com/life/2014/10/11/choices-change-my-life.html
Vii. Editor's note
The reference is mixed, that is, the reference in the 1th may also be quoted in the 2nd, but not in 2nd. Why is it so? If all the references in the past are placed directly at the end, this is not convenient for future searches, but the blog park does not want Word as the reference convenient, so it trickery.
Later post, not life also by word processing.
Simhash algorithm of collaborative filtering algorithm based on local sensitive hash