Simhash Similar hashing algorithm

Source: Internet
Author: User

Preface

Recently reading Dr. Wu's << mathematical beauty >> This book, got a lot of inspiration and thought, which mentioned a concept---information fingerprint. General normal people refer to this concept, the first thought of the word should be a hash mapping algorithm, any object is mapped to a separate variable, generally this variable is a unique number, of course, does not rule out the possibility of a hash collision. On a single object, using a hash algorithm to do a mapping, compare the object is consistent, this is certainly possible, but if you want to use a hash algorithm to do some of the similarity between the calculation, perhaps the traditional hashing algorithm is not necessarily the best choice, if the entire article as a long string to calculate, Accuracy is not guaranteed because any character changes in the string may cause subsequent values to vary greatly. So if a single individual goes to compute, how do you measure these seemingly unrelated hashes with a valid criterion? Fortunately we are standing on the shoulders of giants, there is a kind of called Simhash algorithm just ingenious solves our problem.

background of the Simhash

Simhash earlier when used in Google's web crawler, used to repeat the page to the weight. It can be said that Simhash is to calculate the Web page information fingerprint. Simhash was put forward by Moses Chariker in 2002.

Simhash Algorithm principle

Simhash's design is very ingenious, here to give a practical use of the scene, so that everyone intuitively feel the magic of the algorithm. For example, compare the similarity of 2 pages, then the key to the comparison is the various words inside, fake with T1, T2.t3 and so on, each word will be its own information fingerprint, the simplest is the hashcode value of the string, and the role of Simhash is to make a summary of these values, to get a hash of the entire Web page. The following is a 2-large module.

I. Extension. The last step of the hash of the words to take the remainder, do a bit limit, such as the final display needs to be 8 bits of the binary number, will be the hash value of 2 of the 8 is the remainder of 256, convenient after the addition and subtraction of the weight operation. A 8-bit array is then initialized, and the default value on each bit is 0. The following steps are the key steps for the first module.

because just have t1,t2,t3 and so on the hash value of these words, and then take the remainder is a 8-bit binary integer, the following is the traversal operation, take the first word of the information fingerprint, assuming 10000010, (casually assumed), The total hash value at this time is:

R1=1 0+W1

R2=0 0-W1

R3=0 0-w1

R4=0 0-W1

R5=0 0-W1

R6=0 0-W1

R7=1 0+W1

R8=0 0-W1

The rule is very simple, is according to the bits, according to 1 plus 0 minus rule, do the corresponding weight operation, w refers to the weight of the word in the Web page, this can be done with word-breaker, and then statistical frequency, with the frequency as the weight value to calculate, of course, the default weights are the same, are 1,. Then the second word is traversed, the same as the process of operation, just to do the T1 based on the weight of the operation, rather than the initial value.

Second, contraction. The final total hash of the Web page, computed from the weight value, is instantiated, and if the value of a bit is greater than 0, the value at this location is set to 1, otherwise it is set to 0. For example, the following is the weight of the various words of the Gaga minus minus, the final result is

0.1, 0.1,0.1,0.1,-0.8,-0.3,0.4,0.5

Then the final result will be

11110011

The final similarity comparison can be compared with the same number of values at the best hash corresponding location. If 2 pages are the same, similar hashes must be the same, and if there are very few words with low weights, their similar hashes may be the same.

code implementation of the algorithm

Give the main implementation class of an algorithm, all the code links and test data:https://github.com/linyiqun/lyq-algorithms-lib/tree/master/Simhash

Package Simhash;import Java.io.bufferedreader;import Java.io.file;import java.io.filereader;import java.io.ioexception;/** * Similar hash Algorithm tool class * * @author Lyq * */public class Simhashtool {//binary hash bits private int hashbitnum;//same Number of bits minimum threshold private double minsupportvalue;public simhashtool (int hashbitnum, double minsupportvalue) {this.hashbitnum = Hashbitnum;this.minsupportvalue = Minsupportvalue;}  /** * Compare Article Similarity * * @param news1 * Article path 1 * @param news2 * Article path 2 */public void Comparearticals (String NewsPath1, String newsPath2) {string content1; String Content2;int samenum;int[] hasharray1;int[] hasharray2;//Read the word segmentation result content1 = Readdatafile (newsPath1); Content2 = Readdatafile (newsPath2); hashArray1 = Calsimhashvalue (content1); hashArray2 = Calsimhashvalue (Content2);// Compare the same number of hash digits samenum = 0;for (int i = 0; i < Hashbitnum; i++) {if (hasharray1[i] = = Hasharray2[i]) {samenum++;}} Compare with minimum threshold if (Samenum > This.hashbitnum * this.minsupportvalue) {System.out.println (String.Format ("similarity is%s"), exceeding the threshold of%s, so news 1 is similar to News 2, Samenum * 1.0/hashbitnum, minsupportvalue);} else {System.out.println (String.Format ("The similarity is%s, less than the threshold%s, so news 1 is not similar to News 2", Samenum * 1.0/hashbitnum, Minsupportvalue));}} /** * Calculate similar hash values for text * * @param content * News content Data * @return */private int[] Calsimhashvalue (String content) {int in Dex;long hashvalue;double weight;int[] binaryarray;int[] resultvalue;double[] hasharray; String W; String[] words; News news;news = new News (content); News.statwords (); hasharray = new Double[hashbitnum];resultvalue = new Int[hashbitnum] ; words = Content.split (""); for (String str:words) {index = Str.indexof ('/'); if (index = =-1) {continue;} w = str.substring (0, index);//Gets the weight value, according to the word frequency weight = news.getwordfrequentvalue (w); if (weight = =-1) {continue;} The calculation of the hash value HashValue = Bkdrhash (w);//The Take-over position becomes n-bit hashvalue%= Math.pow (2, hashbitnum);//into binary form binaryarray = new Int[hash Bitnum];numtobinaryarray (Binaryarray, (int) hashValue); for (int i = 0; i < binaryarray.length; i++) {//If this position is 1, add weight if (Binaryarray[i] = = 1) {Hasharray[i] + = weight;} else {//= 0 minus weight operation Hasharray[i]-= Weight;}}} Array shrink operation, re-changed to binary data form for (int i = 0; i < hasharray.length; i++) {if (Hasharray[i] > 0) {resultvalue[i] = 1 According to the sign of the value ;} else {Resultvalue[i] = 0;}} return resultvalue;} /** * numbers into binary form * * @param binaryarray * converted binary array form * @param num * To convert digital */private void Numtobinar Yarray (int[] binaryarray, int num) {int index = 0;int temp = 0;while (num! = 0) {binaryarray[index] = num% 2;index++;num /= 2;} Swap the front and rear of the array for (int i = 0; i < BINARYARRAY.LENGTH/2; i++) {temp = Binaryarray[i];binaryarray[i] = Binaryarray[bina Ryarray.length-1-i];binaryarray[binaryarray.length-1-i] = temp;}} /** * BKDR Word hash Algorithm * * @param str * @return */public static long Bkdrhash (String str) {int seed = 31;/* 31 131 1313 13131 131313 etc.. */long hash = 0;int i = 0;for (i = 0; i < str.length (); i++) {hash = (hash * seed) + (Str.charat (i));} hash = Math.Abs (hash); return hash;} /** * reading data from a file */private string Readdatafile (String filePath) {File File = new file (FilePath); StringBuilder Strbuilder = null;try {BufferedReader in = new BufferedReader (new FileReader (file)); String Str;strbuilder = new StringBuilder (); while ((str = in.readline ()) = null) {strbuilder.append (str);} In.close ();} catch (IOException e) {e.getstacktrace ();} return strbuilder.tostring ();} /** * Word breaker for news content using Word breakers * * @param srcpath * News file path */private void Parsenewscontent (String srcpath) {//TODO Aut O-generated Method Stubint Index; String Dirapi; String Despath;dirapi = System.getproperty ("User.dir") + "\\lib";//assembly Output Path Value index = Srcpath.indexof ('. '); Despath = srcpath.substring (0, index) + "-split.txt"; try {ICTCLAS50 testICTCLAS50 = new ICTCLAS50 ();//The path of the required library for the word breaker, initialize if (tes Tictclas50.ictclas_init (Dirapi.getbytes ("GB2312")) = = False) {System.out.println ("Init fail!"); return;} Convert the file name string type to byte type byte[] Inputfilenameb = Srcpath.getbytes ();//The output file name after word processing, the file name string to byte type byte[] Outputfilenameb = DESPATH.GEtbytes ();//File participle (the first parameter is the name of the input file, the second parameter is the file encoding type, the third parameter is whether to mark the speech set 1 yes,0//No, and the fourth parameter is the output file name) Testictclas50.ictclas_ Fileprocess (inputfilenameb, 0, 1,OUTPUTFILENAMEB);//exit Word breaker testictclas50.ictclas_exit ();} catch (Exception ex) {ex.printstacktrace ();}}}
Result output:

The similarity is 0.75, exceeding the threshold of 0.5, so news 1 is similar to news 2 with a similarity of 0.875, exceeding the threshold of 0.5, so news 1 is similar to news 2

Reference Documents

Baidu Encyclopedia

<< The beauty of Mathematics >> Second Edition-Dr. Wu

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Simhash Similar hashing algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.