This article environment:
Third party libraries:
Documents sent to Github:https://github.com/w392807287/angelo_tools.git
Simhash Introduction
Not long to write the graduation thesis, it is said to need to check the weight, the document repeated decision is quite curious, so looked at the relevant things. Find Simhash more useful and easy to achieve.
As the name implies Simhash is a hash algorithm, previously in my impression that the hash algorithm is to map an object into a hash value, generally only requires that when two objects are exactly the same hash value, and two similar object hash value does not need to have any relationship. The value of a hash of only one character may differ by 108,000. However, if the hash function is cleverly designed, it can also make similar objects have the same or similar hash value, and using hash to search for similarity is more convenient and quick.
Simhash is such a magical algorithm. It satisfies:
- When the distance of two objects is not greater than D1, the probability that their hash value is equal is not less than P1, such as D (x, y) ≤d1, then P (hash (x) = hash (y)) ≥p1.
- When the distance of two objects is not less than D2, their hash value is the same as the probability is not greater than P2, such as D (x, y) ≥d2, then P (hash (x) = hash (y)) ≥p2.
Simhash can hash a document into a 64-bit binary number so that similar documents have similar binary numbers. For a document, we can put each word or phrase in the text as a feature, statistical characteristics of the frequency of occurrence (of course, can also add the weight of part of speech, how to set, statistical characteristics can be determined depending on the situation). In the following example we use Jieba to do participle.
Target document "Gourd Baby gourd, a vine on Seven flowers", to get the characteristics of the corresponding frequency: (gourd, 0.33), (a root, 0.17, (Rattan, 0.17), (seven, 0.17), (flower, 0.17). Then hash the eigenvalues to make it easy to demonstrate here to 6-bit:
- Gourd Doll: 100100
- A root: 010101
- Rattan on: 101010
- Seven Roses: 111010
- Flowers: 001010
Then we construct a vector for each of our teams according to the bits of the binary number. If one of the binary numbers that a feature is mapped to is 1, the component at its corresponding position is the frequency of the feature, otherwise the inverse number of the frequency. Such as:
Gourd Doll: (0.33,-0.33,-0.33,0.33,-0.33,-0.33)
......
Add vectors to get (0.33,-0.33,0,0,0,-0.66)
For each component, if it is greater than 0 take 1, otherwise take 0, so you can get the binary number of the Simhash, that is, 100000.
In the text, the characteristics of high frequency, the corresponding vector component of the absolute value is greater, the result of the addition of the final vector has a greater effect. Therefore, if the two documents are similar, then the characteristics of their high frequency should be close, and the resulting hash value will be closer. In the search for Google pages, a maximum of 3 bits in a 64-bit hash can be judged as similar documents.
Algorithm implementation
DefSimhash(CLS, S, Re=none, Cut_func=none):If Re:rex = REElse:rex = Re.compile (U ' [\u4e00-\u9fa5]+ ')IfNot Cut_func:cut_func = Cls.cut_func#jieba. Cut cut = [xfor x in cut_func (s) if re.match ( REX, x)] ver = [[v * (int (x) if int (x) > 0 else-1) for x in K] for K, v in cls.hist (cut). Items ()] ver = Np.array (ver) ver_sum = ver.sum (axis= 0) Sim = ' 1 ' if x > 0 else ' 0 ' Span class= "Hljs-keyword" >for x in Ver_sum]) return Sim
First we define the area of interest with a regular, and here we only take the Chinese we are interested in. Then we define the function used by the word breaker, and here we use the Jieba participle.
Then we get the result of the participle:
cut = [x for x in cut_func(s) if re.match(REX, x)]
Get the vector matrix:
ver = [[v * (int(x) if int(x) > 0 else -1) for x in k] for k, v in cls.hist(cut).items()]
To facilitate the calculation we introduce NumPy to help us do the matrix calculation:
ver = np.array(ver) ver_sum = ver.sum(axis=0)
Finally, the result is converted to a two-level hash. Because the 32-bit MD5 we use here gives a hash of the participle result, the last hash value is also 32 bits:
11111101011001101110111100101101
We have used several tool functions:
@classmethoddef hist(cls, cut): 0 for x in set(cut)} for i in cut: _cut[i] += 1 return {cls.hash_bin(k): v/len(cut) for k, v in _cut.items()}
The Hist function converts a word breaker to a feature frequency vector.
hash2bin(cls, hash): d = ‘‘ for i in hash: try: if int(i) > 7: d = d + ‘1‘ else: d = d + ‘0‘ except ValueError: d = d + ‘1‘ return d@classmethoddef hash_bin(cls, s): h = hashlib.md5(s.encode()).hexdigest() return cls.hash2bin(h)
Where the Hash_bin function is used to hash the character into a two-level hash value, the base hash algorithm is 32-bit MD5.
The Hash2bin function maps a 16 binary hash value into a binary hash.
To facilitate comparison we use Hamming distance to determine the similarity of two hash values:
@staticmethoddef haiming(s1, s2): x = 0 for i in zip(s1, s2): if i[0] != i[1]: x += 1 return x
Effect
1993, Nanjing University has such a male dormitory, four boys do not have a girlfriend, so made a combination called "Four kings of famous grass without the Lord". The four Kings insisted on holding a "crouching talk" every night, from various academic discussions on how to get rid of the status of Bachelor. This year in November, the campus of the Sycamore tree leaves withered, making them exceptionally injury. When they were lying on the night of 11th, the symbolism of semiotics suddenly came to visit. November 11, four a 1-word row, not exactly like four bare sticks? These four singles are not exactly in the cleverly told the "famous Four kings of unknown," the bleak?
*
Know that there is a question, childhood lack of love girl, grow up what to do? Maybe in my place, just want to always have someone to accompany. Hi Bao said, I want a lot of love, or is a lot of money, really not, there is health is good. I have a bad habit, often hungry in the middle of the night, climb up to find food. is really hungry to stomach pain, sometimes directly hungry wake up, every time to see the movie lines, sleep is not hungry, I do not believe. Why are you hungry in the middle of the night? The reason, is the university when no one to accompany me to eat, every time is waiting for someone to accompany me, I will go to dinner, and finally to my stomach pain, the passage of time, gradually become accustomed to endure until very late to eat. I do not like a person to eat, do not like a person to go shopping, but also do not like a person to stay, but growth ah, often the less you like the more you have to learn to accept it. (b) Tell me about the last relationship. When I met him, it was because of the bar dinner, he volunteered to find me, accompanied by a spring-like smile. I always thought that he was touched by my beauty, and then asked him why. He said, he first saw so can eat the girl, he was shocked, but there is a feeling to see me eat very mean, as if the food has a soul, let a person's mood inexplicable good up. We first met, because he saw me starving ghosts reincarnation of eating. We are together because he is a good cook, how good is it? Is the kind of you have eaten a meal, you can remember the feeling of a lifetime. Even if I recall him now, my taste buds will respond. He always made me a lot of delicious, afternoon sunshine from the window, the curtains are light green small flowers, the air filled with the smell of rice, we two people sitting in front of the table, while eating, while chatting. I like to go to the market with him to buy vegetables, tomato potato cucumber cabbage, hand carrying these fruits and vegetables food, as I have the world. Once, we from the market back on the road, is clearly the sun high-shine weather, but suddenly under the hail, that is the first time he saw hail, was smashed a bit, then immediately lost his hand in the dishes, both hands to protect me, I silly smacking to pick vegetables, was smashed a. He immediately scolded me and said I was the most delicious girl he had ever seen.
The above is an excerpt from an article in Jane's book.
The two Simhash are
11111101011001101110111100101101
00101101001010110001100000101110
The Hamming distance is 16.
Know that there is a question, childhood lack of love girl, grow up what to do? Maybe in my place, just want to always have someone to accompany. Hi Bao said, I want a lot of love, or is a lot of money, really not, I have a bad problem, often will be hungry at midnight, climb up to find food. is really hungry to stomach pain, sometimes directly hungry wake up, every time to see the movie lines, sleep is not hungry, I do not believe. The reason, is the university when no one to accompany me to eat, every time is waiting for someone to accompany me, I will go to dinner, finally I am hungry to stomach pain, as time goes by I do not like a person to eat, also do not like a person to go shopping, do not like a person to stay, but growth ah, often the more do not like to learn to When I met him, it was because of the bar dinner, he volunteered to find me, accompanied by a spring-like smile. I always thought that he was touched by my beauty, and then asked him why. He said, he first saw so can eat the girl, he was shocked, but there is a feeling to see me eat very mean, as if the food has a soul, let a person's mood inexplicable good up. We first met, because he saw me starving ghosts reincarnation of eating. We are together because he is a good cook, how good is it? Is the kind of you have eaten a meal, you can remember the feeling of a lifetime. Even if I recall him now, my taste buds will respond. He always made me a lot of delicious, afternoon sunshine from the window, the curtains are light green small flowers, the air filled with the smell of rice, we two people sitting in front of the table, while eating, while chatting. I like to go to the market with his hand to buy vegetables, tomato potato cucumber cabbage, hand carry these fruits and vegetables food, once, we go back from the market on the road, is clearly the sun, the weather, but suddenly under the hail, that is the first time he saw hail, was smashed a bit, then immediately lost his hand in the dishes, Hands to protect me, I silly smacking to pick vegetables, was smashed a. He immediately scolded me and said I was the most delicious girl he had ever seen.
This paragraph is the second paragraph slightly modified, Simhash:
00100101001010110000100000101110
The Hamming distance to the second segment is 2
Can see the effect is still very obvious.
Everything that can be serialized can be hashed, and it can be compared to a similar degree. Simhash belongs to the local sensitive hash (local-sensitive Hashing, LSH), and the next time you talk about how to compare the similarity of the picture, use the perceptual hash (perceptual Hashing).
--simhash of document weight based on hash