In simple terms, the Simhash algorithm is to quickly search for a simhash set with a difference of less than k digits from a massive amount of text. Here, each text can be represented by a simhash value, A simhash has 64bit, similar texts, and 64bit, and the experience value of k in this paper is 3. The disadvantage of this method is as obvious as the advantage. There are two main points. For short text, the K value is very sensitive. The other is that the algorithm uses space for time, and the system memory cannot afford it.
Copy codeThe Code is as follows:
#! /Usr/bin/python
# Coding = UTF-8
Class simhash:
# Constructor
Def _ init _ (self, tokens = '', hashbits = 128 ):
Self. hashbits = hashbits
Self. hash = self. simhash (tokens );
# ToString Function
Def _ str _ (self ):
Return str (self. hash)
# Generate a simhash Value
Def simhash (self, tokens ):
V = [0] * self. hashbits
For t in [self. _ string_hash (x) for x in tokens]: # t is the normal hash value of token.
For I in range (self. hashbits ):
Bitmask = 1 <I
If t & bitmask:
V [I] + = 1 # Check whether the current bit is 1. If yes, set this bit to + 1.
Else:
V [I]-= 1 # Otherwise, this bit-1
Fingerprint = 0
For I in range (self. hashbits ):
If v [I]> = 0:
Fingerprint + = 1 <I
Return fingerprint # the fingerprint of the entire document is the sum of the final bits> = 0
# Finding the Hamming distance
Def hamming_distance (self, other ):
X = (self. hash ^ other. hash) & (1 <self. hashbits)-1)
Tot = 0;
While x:
Tot + = 1
X & = x-1
Return tot
# Similarity
Def similarity (self, other ):
A = float (self. hash)
B = float (other. hash)
If a> B: return B/
Else: return a/B
# Generate hash values for source (a Python built-in hash value of a variable-length version)
Def _ string_hash (self, source ):
If source = "":
Return 0
Else:
X = ord (source [0]) <7
M = 1000003
Mask = 2 ** self. hashbits-1
For c in source:
X = (x * m) ^ ord (c) & mask
X ^ = len (source)
If x =-1:
X =-2
Return x
If _ name _ = '_ main __':
S = 'this is a test string for testing'
Hash1 = simhash (s. split ())
S = 'this is a test string for testing also'
Hash2 = simhash (s. split ())
S = 'nai nai ge xiong cao'
Hash3 = simhash (s. split ())
Print (hash1.hamming _ distance (hash2), "", hash1.similarity (hash2 ))
Print (hash1.hamming _ distance (hash3), "", hash1.similarity (hash3 ))