System Learning hash Algorithm (hash algorithm) __ algorithm

Source: Internet
Author: User
Tags comparison generator hash printf rand strcmp strlen blizzard
System Learning hash Algorithm (hash algorithm) Reprint please indicate the source. Preface:

The origin of the "system learning hash Algorithm" in this paper. After seeing the " 11, thoroughly parse hash table algorithm from beginning to end" This article, the original text does not have blizzard hash fast reason analysis as well as with other hash method than can fast how much, unifies oneself before research MonetDB's database also involves the hash Join related content, so decided to achieve a simple hash and blizzard hash comparison, but in the search for data found in the "String hash function evaluation" This article, it is very interesting to refer to the source code in the article, wrote their own experimental test procedures, found the problem, explained in detail below.


first, the basic knowledge of the carding summary.

There are three questions to consider when preparing to implement a hash algorithm:

First: The choice of the hash function.

Second: The solution of hash conflict.

Third: The size of the filling factor selection. Filling factor a=n/m. where M is the number of buckets in the hash table, and n is the number of keywords. The larger the loading factor, the more serious the hash conflict is.


For the first, the general situation does not need their own consideration, to find someone else to design, but also very suitable for their own hash function with the line. Therefore, the design problem of hash function is not considered (the current understanding is that)

For the second article, the book wrote a bunch of methods, in practice, the most seen is the "zipper method", blizzard hash should be called linear detection and re-hashing method. There is the re-hashing, cuckoo hash should be used in this way, which is good or bad, it is different, it will be used to compare, and then analyze.

For the third, the memory of the textbook on Java, the top said that the best value range is 0.75--0.8. Not sure, but the data structure of the textbook on the hash algorithm performance and loading factor a value of the mathematical deduction, follow-up detailed study.
Two, string hash function evaluation (note: Refer to Liu's column, "Evaluation of String Functions" in this article. But there is my thinking, the test results and the original text has left the place.

Hash lookup is well known for its O (1) Search performance and is widely used in applications where high performance requirements are found. Its basic ideas are:
(1) Create a fixed-length linear hash table, generally can be initialized to specify length;

(2) The hash function is designed, and the keyword key is scattered into the hash table. The hash function design is the most important, the uniform distribution, the collision probability is small all in it;

(3) Usually use the Zipper method to solve the hash conflict problem, that is, scattering to the same hash table key word, in the form of a linked list (also known as bucket buckets);

(4) Given the keyword key, you can locate the target within the time complexity of O (1) + O (m). Wherein, M is the zipper length, namely the barrel depth.

Hash application, the string is the most common keyword, the application is very common, now the programming language basically provides a string hash table support. The string hash function is very many, the common main have Simple_hash, Rs_hash, Js_hash, Pjw_hash, Elf_hash, Bkdr_hash, Sdbm_hash, Djb_hash, Ap_hash, crc_ Hash and so on. Their C language implementation is shown in Appendix Code: HASHFUNCTION.C, HASHTESTS.C. So all these string hash functions, who are well-acquainted with it. The benchmark for evaluating hash functions is the following two indicators:

(1) Distribution of hashes

That is, the use of the bucket Backet_usage = (number of buckets used)/(total number of barrels), the higher the ratio, indicating good distribution, is a good hash design.

(2) Average barrel length

That is Avg_backet_len, the average length of all used buckets. Ideally this value should be = 1, the smaller the conflict occurs, the better the hash design.

Hash function calculation is generally very concise, so in the cost of computational time complexity of the discrimination is very little, there is no comparison.


Design of the evaluation scheme:

The first step: randomly generate 1000 strings, each with a length of 10. Write these 1000 strings to a file test.txt. As the next step to establish a hash table input. (the number of generated strings is up to you)

The second step: using the above mentioned various string hash functions, hash hash simulation. (Note: Crc_hash has not been changed.)

The third step: statistical output, the distribution of the hash and average barrel length of two indicators for evaluation and analysis. (Is it possible to evaluate with variance and mean variance ...) , the question of a temporary existence)


The results of the experiment are as follows:

For an explanation of the string in the table, see the following comment:

printf ("Bucket_len =%d\n", Phashtable->mbucketlen); The number of buckets in the hash table
printf ("Hit_count =%d\n", hit_count); Number of non-repeating elements of the hash table
printf ("buket Conflict count =%d\n", conflict_count); Number of buckets in conflict
printf ("Longest hash entry =%d\n", max_link); Length of the longest chain
printf ("average hash entry length =%.2f\n", avg_link); The average length of the linked list
printf ("Backet usage =%.2f%\n", backet_usage); The usage rate of the bucket of hash table

Hash_function_name Bucketcount Bucket_len Hit_count Bucket conflict count Longest hash entry Averge Hash Entry length Bucket usage String count
Simple_hash 1000 1000 1000 264 5 1.59 62.8% 1000
Rs_hash 1000 1000 1000 259 5 1.58 63.2% 1000
Js_hash 1000 1000 1000 267 5 1.59 62.9% 1000
Pjw_hash 1000 1000 1000 124 18 8 12.5% 1000
Elf_hash 1000 1000 1000 124 18 8 12.5% 1000
Bkdr_hash 1000 1000 1000 267 5 1.56 63.9% 1000
Sdbm_hash 1000 1000 1000 274 5 1.59 62.7% 1000
Djb_hash 1000 1000 1000 270 6 1.57 63.5% 1000
Ap_hash 1000 1000 1000 271 6 1.60 62.5% 1000

The filling factor used in the above experimental results is 1, the loading factor is smaller, and the result of hash function can be evaluated.

In the experimental results, the experimental results of Pjw_hash and elf_hash function are very poor.

The origin of the above hash function is very interesting, waiting to be excavated ... , see friends If you understand the source of these functions, please inform, thank you.

Another: The above results if there are objections, please leave a comment, thank you.


Here is the source code for the experiment:

The first part: the string randomly generated code (note: This part is also an adaptation of the network on a buddy's code, a moment can not find the source, who saw please inform, I added on the citation, thanks to respect the work of others):

#include <stdio.h> #include <stdlib.h> #include <string.h> #include <ctype.h> #include <time .h> #define STRINGSIZE #define STRINGCOUNT 1000//If this function is constantly called in a loop of a program, then it is ineffective though it is also used by the system's time function to initialize the random number generator, but the program's// Execution speed is too fast, may execute 1000 cycles the number of seconds returned is the same time return timestamp/* void get_rand_str (char s[],int num) {//definition randomly generated string table char *str = "0123456789ABCD
 EFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ ";
 int i,lstr; 			LSTR = strlen (str);//computes the string length Srand ((unsigned int) time ((time_t *) NULL));//Use System times to initialize the random number generator for (i = 0; i < num-2; i++)
 Returns the corresponding string {s[i]=str[(rand ()%LSTR)] by the specified size;
 } s[i++]= ' \ n ';
 s[i]= ' + ';
printf ("%s", s);			} */int main () {FILE *fp1;		Defines a file stream pointer to open the read file char text[10];
    Defines an array of strings that are used to store the read character int i=0,j=0,lstr;
	Char *str = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ"; LSTR = strlen (str);//calculate string length FP1 = fopen ("D:\\test.txt", "r+");//read-only mode open file A.txt//while (fgets (TEXT,1024,FP1)!=null)/ /Line-reading FP1 the contents of the file pointed to in text Srand (unsigned int) Time ((time_t *) NULL));//Use System times to initialize the random number generator for (j=0;j<stringcount;j++) {for (i = 0; i < STRINGSIZE-2; I
		 + +)//returns the corresponding string {text[i]=str[(rand ()%LSTR)] by the specified size;
		 } text[i++]= ' \ n ';
		 text[i]= ' + '; Fputs (TEXT,FP1);//write the content to FP1 point to the file} fclose (FP1);//Close file A.txt, have open will have to close}


The second part: the string hash function evaluation code:

Hashfunction.c

#include <stdio.h> #include <string.h> #include "hashTest.h"/* A simple Hash Function */unsigned  
    int Simple_hash (char *str) {Register unsigned int hash;  
  
    Register unsigned char *p;  
  
    for (hash = 0, p = (unsigned char *) str; *p; p++) hash = $ * hash + *p;  
Return (hash & 0x7FFFFFFF);  
         }/* RS Hash Function */unsigned int rs_hash (char *str) {unsigned int b = 378551;  
         unsigned int a = 63689;  
  
         unsigned int hash = 0;  
                 while (*STR) {hash = hash * A + (*str++);  
         a *= b;  
} return (hash & 0x7FFFFFFF);  
  
         }/* JS hash Function */unsigned int js_hash (char *str) {unsigned int hash = 1315423911;  
         while (*STR) {hash ^= (hash << 5) + (*str++) + (hash >> 2)); } return (hash & 0x7FFFFFFF); 
}/* P. J. Weinberger Hash Function */unsigned int pjw_hash (char *str) {unsigned int bitsinunigned  
         int = (unsigned int) (sizeof (unsigned int) * 8);  
         unsigned int threequarters = (unsigned int) ((Bitsinunignedint * 3)/4);  
  
         unsigned int oneeighth = (unsigned int) (BITSINUNIGNEDINT/8);  
         unsigned int highbits = (unsigned int) (0xFFFFFFFF) << (bitsinunignedint-oneeighth);  
         unsigned int hash = 0;  
  
         unsigned int test = 0;  
                 while (*STR) {hash = (hash << oneeighth) + (*str++); if (test = hash & highbits)! = 0) {hash = (hash ^ (Test >> thre  
                 equarters)) & (~highbits));  
}} return (hash & 0x7FFFFFFF); }/* ELF Hash Function */unsigned int elf_hash (char *str) {unsignedint hash = 0;  
  
         unsigned int x = 0;  
                 while (*STR) {hash = (hash << 4) + (*str++);  
                         if ((x = hash & 0xf0000000l)! = 0) {hash ^= (x >> 24);  
                 Hash &= ~x;  
}} return (hash & 0x7FFFFFFF);  }/* BKDR Hash Function */unsigned int bkdr_hash (char *str) {unsigned int seed = 131;//31 131 1313  
         13131 131313 etc...  
  
         unsigned int hash = 0;  
         while (*STR) {hash = hash * seed + (*str++);  


} return (hash & 0x7FFFFFFF);  
  
         }/* SDBM hash Function */unsigned int sdbm_hash (char *str) {unsigned int hash = 0;  
         while (*STR) {hash = (*str++) + (hash << 6) + (hash << +)-hash; } return (hash& 0x7FFFFFFF);  
  
         }/* DJB hash Function */unsigned int djb_hash (char *str) {unsigned int hash = 5381;  
         while (*STR) {hash + = (hash << 5) + (*str++);  
} return (hash & 0x7FFFFFFF);  
         }/* AP hash Function */unsigned int ap_hash (char *str) {unsigned int hash = 0;  
         int i;  
                         for (i=0; *str; i++) {if ((I & 1) = = 0) {  
                 Hash ^= (Hash << 7) ^ (*str++) ^ (hash >> 3));  } else {hash ^= ((hash << one) ^ (*str++) ^ (hash  
                 >> 5)));  
}} return (hash & 0x7FFFFFFF);  
    }/* CRC Hash Function */* unsigned int crc_hash (char *str) {unsigned int nleft = strlen (str); unsigned long long sum = 0;  
    unsigned short int *w = (unsigned short int *) str;  
  
     
    unsigned short int answer = 0; Our algorithm are simple, using a-accumulator (sum), we add//sequential-bit words to it, and at the end 
      
    , fold back all the//carry bits from the top to the lower bits.  
        while (Nleft > 1) {sum + = *w++;  
    Nleft-= 2;  }//mop up an odd byte, if necessary if (1 = = Nleft) {* (unsigned char *) (&answer  
        ) = * (unsigned char *) W;  
    sum + = answer; }//Add back carry outs from top-bits to low-bits//Add hi-to-low sum = (sum  
    ;>) + (sum & 0xFFFF);  
   Add carry sum + = (sum >> 16);  
  
    Truncate to + bits answer = ~sum;  
Return (answer & 0xFFFFFFFF);
 }
*/


Hashtests.c

#include <stdio.h> #include <stdlib.h>///do not know where to use those head file///#include <sys/types. h>//#include <sys/stat.h>//#include <fcntl.h>//#include <errno.h> #include <string. h> #include "hashTest.h"//#include "md5.h" #define String_len 255///one Atom of the chain when build the have  
    H table struct Atomofbucketchain {unsigned char *pkey;   
struct Atomofbucketchain *pnext;  
  
};  
    struct chainofhashtable {unsigned int mhitcount;  
    unsigned int mentrycount;  
struct Atomofbucketchain *pkeys;  

};
	struct HashTable {unsigned int mbucketlen;
struct chainofhashtable *ptable;

};  

unsigned int (*phashfunc) (char *str);  Choose which hash function to be used void Choosehashfunc (char *phashfuncname) {if (0 = = strcmp (Phashfuncname,  
    "Simple_hash")) Phashfunc = Simple_hash;  
  else if (0 = = strcmp (phashfuncname, "Rs_hash")) Phashfunc = Rs_hash;  else if (0 = = strcmp (phashfuncname, "Js_hash")) Phashfunc = Js_hash;  
    else if (0 = = strcmp (phashfuncname, "Pjw_hash")) Phashfunc = Pjw_hash;  
    else if (0 = = strcmp (phashfuncname, "Elf_hash")) Phashfunc = Elf_hash;  
    else if (0 = = strcmp (phashfuncname, "Bkdr_hash")) Phashfunc = Bkdr_hash;  
    else if (0 = = strcmp (phashfuncname, "Sdbm_hash")) Phashfunc = Sdbm_hash;  
    else if (0 = = strcmp (phashfuncname, "Djb_hash")) Phashfunc = Djb_hash;  
   else if (0 = = strcmp (phashfuncname, "Ap_hash")) Phashfunc = Ap_hash;  
   else if (0 = = strcmp (phashfuncname, "Crc_hash"))//Phashfunc = Crc_hash;  
else Phashfunc = NULL;  }///build the hash table void buildhashtable (unsigned char *pkey, struct HashTable *phashtable) {unsigned int  
    Mhashvalue = Phashfunc (pKey)% phashtable->mbucketlen;  
  
    struct Atomofbucketchain *p=null; p = Phashtable->ptable[mhashvalUe].pkeys;
		while (p) {if (0 = = strcmp (PKey, P->pkey)) {break;  
    } p = p->pnext;  
        } if (p = = NULL) {p = (struct Atomofbucketchain *) malloc (sizeof (struct atomofbucketchain));  
            if (p = = NULL) {printf ("malloc in buildhashtable filled");  
        return;///must have ' return ', otherwise failure will not stop.  
        } P->pkey = StrDup (PKey);  
        P->pnext = phashtable->ptable[mhashvalue].pkeys;  
        Phashtable->ptable[mhashvalue].pkeys = p;  
    phashtable->ptable[mhashvalue].mentrycount++;  
} phashtable->ptable[mhashvalue].mhitcount++;  
         
    }///initial Hash table void Hashtableinit (struct HashTable *phashtable) {unsigned int i; if ((NULL = = phashtable) | |  
    (null==phashtable->ptable))  
        {printf ("Hashtableinit:malloc phashtable or PTable failed");  
    Return } for (i = 0; i < phashtable->mBucketLen;
		i++) {phashtable->ptable[i].mhitcount=0;
		phashtable->ptable[i].mentrycount=0;
    phashtable->ptable[i].pkeys=null;  
    }}///free space hash table used void freehashtable (struct HashTable *phashtable) {unsigned int i;  
  
    struct Atomofbucketchain *pfront, *pback; if ((NULL = = phashtable) | |  
	(null==phashtable->ptable))
		{printf ("hash table has been free");  
	Return } for (i = 0; i < phashtable->mbucketlen; i++) {Pfront = phashtable->ptable[i].pk  
        Eys  
            while (pfront) {pback = pfront->pnext;  
            if (Pfront->pkey) free (pfront->pkey);  
            Free (Pfront);  
        Pfront = Pback;  
}} free (phashtable->ptable);  
	}///Show statistic result void Showtestsresult (struct HashTable *phashtable) {int backet = 0, sum = 0;
    unsigned i=0, max_link=0;  
int conflict_count = 0, hit_count = 0;    Double Avg_link, Backet_usage;   
        for (i = 0; i < phashtable->mbucketlen; i++) {if (Phashtable->ptable[i].mhitcount > 0)  
            {backet++;  
            Sum + = phashtable->ptable[i].mentrycount; if (Phashtable->ptable[i].mentrycount > Max_link) {max_link = phashtable->ptable  
            [I].mentrycount;  
            } if (Phashtable->ptable[i].mentrycount > 1) {conflict_count++;  
        } Hit_count + = phashtable->ptable[i].mhitcount; 
    }} backet_usage = Backet/1.0/phashtable->mbucketlen * 100;  
  
    Avg_link = Sum/1.0/backet;   printf ("Bucket_len =%d\n", Phashtable->mbucketlen);	The number of buckets in the hash table///printf ("Hash_call_count =%d/n", hash_call_count);					Set the number of strings for hash table printf ("Hit_count =%d\n", hit_count); Set the number of non-repeating elements of the hash table printf ("buket conflict count =%D\n ", Conflict_count);			Number of buckets of conflict printf ("longest hash entry =%d\n", max_link);  Length of the longest chain printf ("average hash entry length =%.2f\n", avg_link);			The average length of the list is printf ("Backet usage =%.2f%\n", backet_usage);  
    The usage rate of the bucket of hash Table}//void usage () {printf ("Usage:hash_func_name [backet_len]\n");  
    printf ("hash_func_name:\n");  
    printf ("/tsimple_hash\n");  
    printf ("/trs_hash\n");  
    printf ("/tjs_hash\n");  
    printf ("/tpjw_hash\n");  
    printf ("/telf_hash\n");  
    printf ("/tbkdr_hash\n");  
    printf ("/tsdbm_hash\n");  
    printf ("/tdjb_hash\n");  
   printf ("/tap_hash\n");  
printf ("/tcrc_hash\n");
	} int main (int argc, char *argv[]) {FILE *fp;
	int mstringcount=0;
	unsigned char pkey[10];
	struct HashTable *phashtable=null;
	Parameter input char hashfunctionname[10],bucketcount[10];
	printf ("Input hashfunctionname\n");
	Gets (Hashfunctionname);
	printf ("Input bucketcount\n");

	Gets (Bucketcount); phashtable= (struct HASHTABLe*) malloc (sizeof (struct HashTable));
		if (null==phashtable) {printf ("malloc hash Table filled");
	return-1;
		}/* if (argc<=1) {usage ();
    return-1;
    } if (2==ARGC) {usage ();   
	} *///Phashtable->mbucketlen = Atoi (argv[1]);  
	Phashtable->mbucketlen = Atoi (Bucketcount); phashtable->ptable= (struct chainofhashtable*) malloc (sizeof (struct chainofhashtable) * Phashtable->mbucketlen
 
    ); if (! ( fp = fopen ("D:\\test.txt", "R"))///Assuming that the file has been generated, you need to supplement the function that automatically generates the string.
    Saves the resulting string in a file.
		{printf ("Open source file filled");
    return-1;
    } hashtableinit (phashtable);  
	Choosehashfunc (Argv[0]); 

    Choosehashfunc (Hashfunctionname);  
		while (Fgets (PKEY,10,FP)!=null)//Read through the contents of the FP1 point to the text of the file {mstringcount++;  
    Buildhashtable (pkey,phashtable);  
  
    } fclose (FP);
	Showtestsresult (phashtable);	printf ("String Count:%d", mstringcount); The number of strings that establish the hash table freehashtable(phashtable);
return 0;   }

Third, the realization of blizzard hash and the comparison and analysis with the above hash function

The mathematical derivation of the performance of loading factor and hash algorithm.

Five, the realization and discussion of cuckoo hash algorithm.

Six, the origin of the string hash function in the second part.

July 6, 2014

To be continued .....






Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.