Elfhash-Excellent string hashing algorithm

Source: Internet
Author: User
Tags hash
1. String hash:Let's start with the string hash. In many cases, we may get a large number of strings, each string may or may not repeat C unlike Python has a dictionary type of data structure, we have no way to string as a key value to save, So we need a hash function to keep each string as much as possible to minimize the conflict situation should be set a unique shaping data, convenient for our preservation, here we introduce a string hash algorithm
Now, there are a lot of string hash algorithms are very good, this article mainly face Elfhash algorithm to express, relatively more clear 2.ELFhashFirst I need to declare that the string hash algorithm Elfhash the formation of the three-column uniformity of the algorithm I will not prove that according to other Daniel's description, the Elfhash algorithm has excellent efficiency for raised here strings and short strings, and the following data refers to the experimental data of the Liu great God:

Hash application, the string is the most common keyword, the application is very common, now the programming language basically provides a string hash table support. The string hash function is very many, the common main have Simple_hash, Rs_hash, Js_hash, Pjw_hash, Elf_hash, Bkdr_hash, Sdbm_hash, Djb_hash, Ap_hash, crc_ Hash and so on. Their C language implementation is shown in Appendix Code: HASH.H, Hash.c. So all these string hash functions, who are well-acquainted with it. The benchmark for evaluating hash functions is the following two indicators:

(1) Distribution of hashes

That is, the use of the bucket Backet_usage = (number of buckets used)/(total number of barrels), the higher the ratio, indicating good distribution, is a good hash design.

(2) Average barrel length

That is Avg_backet_len, the average length of all used buckets. Ideally this value should be = 1, the smaller the conflict occurs, the better the hash design.

Hash function calculation is generally very concise, so in the cost of computational time complexity of the discrimination is very little, there is no comparison.

The evaluation scheme is designed like this:

(1) The 200M video file as the input source, the size of the 4KB block to calculate the MD5 value, and as a hash keyword;

(2) Each of the above mentioned string hash function, hash hash simulation;

(3) Statistical results, the distribution of the hash and the average barrel length of two indicators for evaluation and analysis.

Test procedure See Appendix Code HASHTEST.C, test results as shown in the table below. From this result we can also see that these string hash function is not the same as the secondary, difficult to decide the high and low, so the actual application can be based on preferences. Of course, the best practical test, after all, the application characteristics are not the same. Several other sets of test results are similar and are not given here.

hash Function Number of barrels Total Hash calls Max Barrel Length Average barrel length Bucket Utilization%
Simple_hash 10240 47198 16 4.63 99%
Rs_hash 10240 47198 16 4.63 98.91%
Js_hash 10240 47198 15 4.64 98.87%
Pjw_hash 10240 47198 16 4.63 99%
Elf_hash 10240 47198 16 4.63 99%
Bkdr_hash 10240 47198 16 4.63 99%
Sdbm_hash 10240 47198 16 4.63 98.9%
Djb_hash 10240 47198 15 4.64 98.85%
Ap_hash 10240 47198 16 4.63 98.96%
Crc_hash 10240 47198 16 4.64 98.77%

So in practical application we can choose casually, this article for Elfhash
3. Principle:First of all, we need to be clear at the beginning of the 1.unsigned int has 4 bytes, 32 bits 2. XOR Operation 0 is the unit element, any number and 1 XOR or equivalent to take the inverse 3.unsigned unsigned type of data right-shift operation is the logical right shift (left high auto-complement 0) 4. The core of the Elfhash algorithm is "impact"
Attach the code First:
unsigned int elfhash (char *str)
{
	unsigned int hash=0;
	unsigned int x=0;
	while (*STR)
	{
		hash= (hash<<4) +*str;     1
		if ((X=hash & 0xf0000000)!=0)         //2
		{
			hash^= (x>>24);   Affects 5-8-bit, one-time   3
			hash&=~x;   Clear the high four-bit    4
		}
		str++;   5
	}
	return (hash & 0x7fffffff);    6 
}

Explanation: First our hash result is a unsigned int type of data: 0000 0000 0000 0000 1.hash Left 4 bits, insert str (a CHAR has eight bits) here I have always been skeptical attitude, then the first byte of the high four bit is not disorderly? Actually this is our first blend, we did it on purpose, here we need to pay attention to mark it, we made the first four bits in the first byte of the high four-bit 2.x here with 0xf0000000 to get the hash of the fourth byte of the height of the four bits, and with the height of the third bit as a mask to make a second blend Here we first declare, because our Elfhash emphasizes that each character has to have an effect on the final structure, so we move to the left to a certain extent will swallow the highest four bits, so we want to have the highest four bit first to affect the string, then let him be swallowed, then all the effects are superimposed, This is repeated multiple times to ensure hash uniformity, to prevent the occurrence of a large number of collisions with 3.x mask right shift 24 bits moved to just 5-8 bits where the second blend of 5-8 bits is 4. We emptied the high four bits at regular intervals, in fact this operation is completely unnecessary, but the algorithm requires that Because our next left shift will automatically swallow these four bits//here in doubt, will not reduce the range of our hash. 5.str increments, introducing the next character to make a blend 6. Returns an unsigned number with the highest sign missing (in order to prevent an overflow that is caused by a signed time) as the last hash value
4.Code:
/* #include "iostream"
#include "Cstdio"
#include "CString"

using namespace std;

unsigned int a=0x80;

int main ()
{
	printf ("%d\n", a>>1);   Unsigned number logical right Shift 
	return 0
} */

#include "iostream"
#include "Cstdio"
#include "CString"

using namespace std;

unsigned int elfhash (char *str)
{
	unsigned int hash=0;
	unsigned int x=0;
	while (*STR)
	{
		hash= (hash<<4) +*str;
		if ((X=hash & 0xf0000000)!=0)
		{
			hash^= (x>>24);   Affects 5-8-bit, one-time 
			hash&=~x;   Clear the high four-bit 
		}
		str++;
	}
	Return (hash & 0x7fffffff); 
}

int main ()
{
	char data[100];
	memset (data,0,sizeof (data));
	scanf ("%s", data);
	printf ("%d\n", Elfhash (data));
	return 0;

Finally, according to my thinking, the maximum amount of space elfhash can hash is a few billion of data. If remove hash&=~x this sentence will not enlarge the scope of our hash, as far as possible to use space, I next week to ask the data structure of the teacher good. 5. Application:When we operate on memory addresses, we can hash the memory address of the data because the memory address of each data is unique, so we only need a step to get the hexadecimal representation of the memory address.
sprintf (data, "%0x", &now_data);
The first data holds the memory space of our reserved string (array of strings) in the middle of the saved binary form and finally our memory space to take the address.
Using this idea, we can clearly understand the problem of the intersection of the linked list to build a new solution, we take the hash of our memory space can be, can again O (n) to complete the search

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.