Judging content similarity using Simhash and Hamming distance

Last Update:2015-09-29 Source: Internet

Author: User

Tags bitmask

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction to Algorithms

Simhash is similar to hash, is a kind of special information fingerprint, commonly used to compare the similarity of the article, compared with the traditional hash, the traditional hash is responsible for the original content as random as possible to map to a characteristic value, and ensure that the same content must have the same eigenvalues. And if the two hash values are equal, then the original data is equal to a certain probability. But it is very difficult to judge the content of the article by the traditional hash, because the traditional hash only indicates its particularity, and it can't be used as the basis of the similarity comparison.

Simhash was originally used by Google, and its value not only provides information about whether the original value is equal, but also calculates the degree of difference in the content.

Algorithm principle

Simhash was proposed by Charikar in 2002, referring to the similarity estimation techniques from rounding algorithms. This paper introduces the main principle of the algorithm, in order to facilitate understanding as far as possible without using mathematical formula, divided into these steps:

1, participle, the need to judge the text Word segmentation form the characteristics of this article word. Finally, to form a sequence of words to remove the noise word and add weights to each word, we assume that the weights are divided into 5 levels. For example: "U.S." 51 district "Employees said there were 9 flying saucers, had seen the gray aliens" ==> participle after "the United States (4) 51 District (5) Employees (3) said (1) Internal (2) has (1) 9 (3) UFO (5) Zeng (1) See (3) Gray (4) Aliens 5), brackets is to represent the importance of the word in the whole sentence, the larger the number the more important.

2, hash, through the hash algorithm to turn each word into a hash value, such as "the United States" through the hash algorithm is calculated as 100101, "51" by the hash algorithm calculated as 101011. So our string becomes a string of numbers, remember the beginning of the article, to turn the article into a digital calculation to improve the similarity computing performance, now is the process of dimensionality reduction.

3, weighted, through the 2-step hash generation results, you need to follow the weight of the word to form a weighted number of strings, such as "the United States" hash value of "100101", through the weighted calculation of "4-4-4 4-4 4"; "51" has a hash value of "101011", weighted to "5-5 5-5 5 5 ".

4, merges, the above each word calculates the sequence value to accumulate, becomes only one sequence string. For example "The United States" of "4-4-4 4-4 4", "51-zone" of "5-5 5-5 5 5", each one to accumulate, "4+5-4+-5-4+5 4+-5-4+5 4+5" = = "9-9 1-1 1 9". Here, as an example, only two words are counted, and the real calculation needs to accumulate the serial strings of all the words.

5, reduced dimension, the 4 step calculated "9-9 1-1 1 9" into 0 1 strings, forming our final simhash signature. If each bit is greater than 0 is 1, less than 0 is recorded as 0. Finally, the result is: "1 0 1 0 1 1".

Schematic diagram:

We can do a test, two different text strings with only one character, "Your mother called you home for dinner Oh, go home Luo home luo" and "your mother told you to go home to eat, go home Luo home luo."

The result is calculated by Simhash:

1000010010101101111111100000101011010001001111100001001011001011

1000010010101101011111100000101011010001001111100001101010001011

By comparing the number of bits of the difference, we can get two strings of text difference, the number of differences, called "Hamming distance", usually think the Hamming distance <3 is a highly similar text.

Algorithm implementation

The code here is quoted from the blog: http://my.oschina.net/leejun2005/blog/150086, here to express thanks.

The code implementation uses the HANLP instead of the original word breaker.

 PackageCom.emcc.changedig.extractengine.util;/*** Function:simhash To determine text similarity, the sample process supports Chinese <br/> * date:2013-8-6 am 1:11:48 <br/> *@authorJune *@version0.1*/Importjava.io.IOException;ImportJava.math.BigInteger;Importjava.util.ArrayList;ImportJava.util.HashMap;Importjava.util.List;Importcom.hankcs.hanlp.seg.common.Term; Public classsimhash{PrivateString tokens; PrivateBigInteger Intsimhash; PrivateString Strsimhash; Private intHashbits = 64;  PublicSimhash (String tokens)throwsIOException { This. Tokens =tokens;  This. Intsimhash = This. Simhash (); }     PublicSimhash (String tokens,intHashbits)throwsIOException { This. Tokens =tokens;  This. hashbits =hashbits;  This. Intsimhash = This. Simhash (); } HashMap<string, integer> wordmap =NewHashmap<string, integer>();  PublicBigInteger Simhash ()throwsIOException {//defining feature vectors/arrays        int[] v =New int[ This. Hashbits]; String Word=NULL; List<Term> terms = SEGMENTATIONUTIL.PPL ( This. Tokens);  for(term term:terms) {word=Term.word; //hash Each word as a set of fixed-length sequences. For example, an integer of 64bit.BigInteger T = This. hash (word);  for(inti = 0; I < This. hashbits; i++) {BigInteger bitmask=NewBigInteger ("1"). Shiftleft (i); //create an integer array of length 64 (assuming that you want to generate a 64-bit digital fingerprint, or another number),//for each word after the hash of the sequence to judge, if it is 1000 ... 1, then the first bit of the array and the end one plus 1,//the middle of the 62-bit minus one, that is, every 1 plus 1, every 0 minus 1. Until all the hash of the word sequence is judged complete.                if(T.and (bitmask). Signum ()! = 0)                {                    //here is a vector that calculates all the characteristics of the entire document and//The actual use here requires +-weights, such as word frequency, rather than simple +1/-1,V[i] + = 1; }                Else{V[i]-= 1; }}} BigInteger fingerprint=NewBigInteger ("0"); StringBuffer Simhashbuffer=NewStringBuffer ();  for(inti = 0; I < This. hashbits; i++)        {            //4, the last array to judge, greater than 0 of the record is 1, less than or equal to 0 of 0, get a 64bit digital fingerprint/signature.            if(V[i] >= 0) {fingerprint= Fingerprint.add (NewBigInteger ("1"). Shiftleft (i)); Simhashbuffer.append ("1"); }            Else{simhashbuffer.append ("0"); }        }         This. Strsimhash =simhashbuffer.tostring (); returnfingerprint; }    PrivateBigInteger Hash (String source) {if(Source = =NULL|| Source.length () = = 0)        {            return NewBigInteger ("0"); }        Else        {            Char[] Sourcearray =Source.tochararray (); BigInteger x= Biginteger.valueof ((Long) sourcearray[0]) << 7); BigInteger m=NewBigInteger ("1000003"); BigInteger Mask=NewBigInteger ("2"). POW ( This. hashbits). Subtract (NewBigInteger ("1"));  for(CharItem:sourcearray) {BigInteger temp= Biginteger.valueof ((Long) item); X=x.multiply (M). XOR (Temp). and (mask); } x= X.xor (NewBigInteger (string.valueof (Source.length ())); if(X.equals (NewBigInteger ("-1")) {x=NewBigInteger ("-2"); }            returnx; }    }    /*** Calculate Hamming distance * *@paramOther * By comparison value *@returnHamming distance*/      Public inthammingdistance (Simhash other) {BigInteger x= This. Intsimhash.xor (Other.intsimhash); intTot = 0;  while(X.signum ()! = 0) {tot+ = 1; X= X.and (X.subtract (NewBigInteger ("1"))); }        returntot; }     Public intgetdistance (String str1, String str2) {intdistance; if(Str1.length ()! =str2.length ()) {Distance=-1; }        Else{distance= 0;  for(inti = 0; I < str1.length (); i++)            {                if(Str1.charat (i)! =Str2.charat (i)) {Distance++; }            }        }        returndistance; }     PublicList<biginteger> subbydistance (Simhash Simhash,intdistance) {        //divided into several groups to check        intNumeach = This. hashbits/(distance + 1); List<BigInteger> characters =NewArraylist<biginteger>(); StringBuffer Buffer=NewStringBuffer ();  for(inti = 0; I < This. Intsimhash.bitlength (); i++)        {            //returns True when and only if the specified bit is set            BooleanSR =simHash.intSimHash.testBit (i); if(SR) {Buffer.append ("1"); }            Else{buffer.append ("0"); }            if((i + 1)% Numeach = = 0)            {                //turn binary into BigIntegerBigInteger Eachvalue =NewBigInteger (Buffer.tostring (), 2); Buffer.delete (0, Buffer.length ());            Characters.add (Eachvalue); }        }        returncharacters; }     Public Static voidMain (string[] args)throwsIOException {String s= "The traditional hash algorithm is only responsible for mapping the original content randomly to a signature value," + "is equivalent to the pseudo-random number generation algorithm. The resulting two signatures, if equal, indicate that the original content is equal under a certain probability; "+" if unequal, no information is provided except that the original content is unequal, because even if the original content is only one byte apart, the signature generated by "+" is also likely to vary enormously. In this sense, to design a hash algorithm, "+" for similar content generated by the signature is also similar, is a more difficult task, because its signature value in addition to provide the original content is equal information, "+" can also provide additional unequal Information about the degree of variance of the original content. "; Simhash Hash1=NewSimhash (S, 64); //Delete the first sentence and add two interfering stringss = "is equivalent to the pseudo-random number generation algorithm in principle. The resulting two signatures, if equal, indicate that the original content is equal under a certain probability; "+" if unequal, no information is provided except that the original content is unequal, because even if the original content is only one byte apart, the signature generated by "+" is also likely to vary enormously. In this sense, to design a hash algorithm, "+" for similar content generated by the signature is also similar, is a more difficult task, because its signature value in addition to provide the original content is equal information, "+" interference 1 can also provide additional Information that is not equal to the degree of variance of the original content. "; Simhash Hash2=NewSimhash (S, 64); //add a sentence before the first sentence and join four interfering stringss = "The input of the imhash algorithm is a vector, and the output is an F-bit signature value." For the sake of presentation, "+" assumes that the input is a collection of features of a document, with a certain weighting for each feature. "+" Traditional 4 hash algorithm is only responsible for the original content as evenly as possible to randomly map to a signature value, "+" the principle of how big this difference 3 is equivalent to the pseudo-random number generation algorithm. The resulting two signatures, if equal, "+" means that the original content is equal to a certain probability, and if not equal, no information is provided except that the original content is unequal, "+" because even if the original content is only one byte apart, the resulting signature is also likely to vary enormously. In this sense, "+" to design a hash algorithm, similar to the resulting signature is also similar, is a more difficult task, because its signature value in addition to provide the original "+" content is equal information, interference 1 can also provide additional Unequal primitive again to interfere with the information of the degree of difference of 2 content. "; Simhash Hash3=NewSimhash (S, 64); intDIS12 =hash1.getdistance (Hash1.strsimhash, Hash2.strsimhash);        System.out.println (Hash1.strsimhash);        System.out.println (Hash2.strsimhash);        System.out.println (DIS12); System.out.println ("============================================"); intDIS13 =hash1.getdistance (Hash1.strsimhash, Hash3.strsimhash);        System.out.println (Hash1.strsimhash);        System.out.println (Hash3.strsimhash);    System.out.println (DIS13); }}

Judging content similarity using Simhash and Hamming distance

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More