Classical hash algorithm for strings

Source: Internet
Author: User
Tags md5 blizzard

1 overview

The time efficiency of the list lookup is O (N), and the log2n,b+ tree is log2n by the binary method, but the time efficiency of searching the hash list is O (1).

Design efficient algorithms often need to use the hash list, the constant level of the search speed is any other algorithm can not be compared to the structure of the hash table and the different methods of conflict to achieve the efficiency of course has a certain impact, but the hash function is the most important part of the hash list, This paper attempts to analyze the performance of the string hash function used in some classical software in terms of execution efficiency, discretization and space utilization.

Create the fastest hash table (and Blizzard Dialog)

Let's start with a simple question, if there's a huge array of strings, and then give you a separate string that lets you look up from this array to see if there's a string and find it, what would you do?

One method is the simplest, honest from the beginning to the end, a comparison, until found, I think as long as the design of the people can make such a program.

The most appropriate algorithm is the use of Hashtable (hash table), first introduced the basic knowledge, the so-called hash, is generally an integer, through some kind of algorithm, can be a string "compressed" into an integer, this number is called hash, of course, anyway, A 32-bit integer is not able to correspond back to a string, but in the program, two string computed hash value can be very small, below look at the hash algorithm in MPQ

unsigned long hashstring (char *lpszfilename, unsigned long dwhashtype)

{

unsigned char *key = (unsigned char *) lpszfilename;

unsigned long seed1 = 0x7fed7fed, seed2 = 0xEEEEEEEE;

int ch;

while (*key!= 0)

{

ch = toupper (*key++);

seed1 = crypttable[(dwhashtype << + ch) ^ (seed1 + seed2);

SEED2 = ch + seed1 + seed2 + (Seed2 << 5) + 3;

}

return seed1;

}

Blizzard this algorithm is very efficient, known as the "one-way Hash", for example, the string "unitneutralacritter.grp" The result of this algorithm is 0XA26067F3.

int Gethashtablepos (char *lpszstring, somestructure *lptable, int ntablesize)

{

int nhash = hashstring (lpszstring), Nhashpos = nhash% Ntablesize;

if (lptable[nhashpos].bexists &&!strcmp (lptable[nhashpos].pstring, lpszstring))

return nhashpos;

Else

return-1; Error value

}

See this, I think everyone is thinking a very serious question: "What if two strings in the hash table corresponding to the same position?", after all, an array of capacity is limited, this possibility is very large. There are many ways to solve this problem, the first thing I think of is to use "linked list", thanks to the data structure of the university taught this hundred Test Lark Magic weapon, I encountered a lot of algorithms can be converted into a linked list to solve, as long as the hash table in each entry to hang a linked list, save all the corresponding string on OK.

However, the method used by blizzard programmers is a more subtle approach. The rationale is that instead of using a hash value in the hash table, they use three hash values to verify the string. If it is possible to say that two different strings are bound by a hash algorithm, but the entry points are identical with the three different hashing algorithms, that's almost certainly impossible, and the odds are 1:1.,888,946,593,147,86e,+22, probably 10. One of the 22.3-time points is safe enough for a game program.

Now back to the data structure, blizzard use of the hash table does not use a linked list, and the "deferred" way to solve the problem, look at this algorithm:

int Gethashtablepos (char *lpszstring, mpqhashtable *lptable, int ntablesize)

{

const int hash_offset = 0, hash_a = 1, hash_b = 2;

int nhash = hashstring (lpszstring, Hash_offset);

int Nhasha = hashstring (lpszstring, hash_a);

int NHASHB = hashstring (lpszstring, Hash_b);

int nhashstart = nhash% ntablesize, nhashpos = Nhashstart;

while (lptable[nhashpos].bexists)

{

if (Lptable[nhashpos].nhasha = = Nhasha && LPTABLE[NHASHPOS].NHASHB = = NHASHB)

return nhashpos;

Else

Nhashpos = (nhashpos + 1)% Ntablesize;

if (Nhashpos = = Nhashstart)

Break

}

return-1; Error value

}

1. Calculates the three hash value of the string (one to determine the position and the other two to verify)

2. Look at this position in the hash table

3. Is this position empty in the hash table? If empty, the string does not exist, and returns

4. If present, check that the other two hash values are also matched and, if so, that the string is found and returns

5. Move to the next position, if it is already out of bounds, it is not found, return

6. To see if it is back to the original position, if it is, then return did not find

7. Back to 3

/////////////////////////////////////////////////////////

Some of the other hash functions:

/////////////////////////////////////////////////////////

2 Introduction to the classic string hash function

The author has read a large number of classic software original code, the following describes a number of classic software in the string hash function appears.

2.1 The string hash function that appears in PHP

Static unsigned long hashpjw (char *arkey, unsigned int nkeylength)

{

unsigned long h = 0, G;

Char *arend=arkey+nkeylength;

while (Arkey < Arend) {

h = (H << 4) + *arkey++;

if ((g = (H & 0xf0000000))) {

H = h ^ (g >> 24);

H = h ^ g;

}

}

return h;

}

String hash functions that appear in 2.2 OpenSSL

unsigned long Lh_strhash (char *str)

{

int i,l;

unsigned long ret=0;

unsigned short *s;

if (str = NULL) return (0);

L= (strlen (str) +1)/2;

s= (unsigned short *) str;

for (i=0 I

Ret^= (s[i]<< (i&0x0f));

return (ret);

} */

/* The following hash seems to work very down on normal text strings

* No collisions on/usr/dict/words and it distributes on%2^n quite

* Very, not as good as MD5, but still good.

*/

unsigned long lh_strhash (const char *c)

{

unsigned long ret=0;

Long N;

unsigned long V;

int R;

if ((c = = NULL) | | (*c = = "))

return (ret);

/*

unsigned char b[16];

MD5 (C,strlen (c), b);

Return (b[0]| ( B[1]<<8) | (b[2]<<16) | (b[3]<<24));

*/

n=0x100;

while (*C)

{

v=n| (*C);

n+=0x100;

r= (int) (V>>2) ^v) &0x0f;

ret= (ret (32-r));

ret&=0xffffffffl;

Ret^=v*v;

C + +;

}

Return ((ret>>16) ^ret);

}

In the following measurements we mark the above two functions as OPENSSL_HASH1 and OPENSSL_HASH2 respectively, and we do not test the implementation functions of the MD5 algorithm in the above implementation.

2.3 The string hash function that appears in MySQL

#ifndef new_hash_function

/* Calc HashValue for a key * *

static UINT Calc_hashnr (const byte *key,uint length)

{

Register UINT nr=1, nr2=4;

while (length–)

{

Nr^= ((NR & +NR2) * ((UINT) (UCHAR) *key++)) + (NR << 8);

nr2+=3;

}

Return ((UINT) NR);

}

/* Calc HashValue for a-key, case indepenently * *

static UINT Calc_hashnr_caseup (const byte *key,uint length)

{

Register UINT nr=1, nr2=4;

while (length–)

{

Nr^= ((NR & +NR2) * ((UINT) (UCHAR) ToUpper (*key++))) + (NR << 8);

nr2+=3;

}

Return ((UINT) NR);

}

#else

/*

* FOWLER/NOLL/VO Hash

*

* The basis of the hash algorithm is taken from a idea sent by email to the

* IEEE Posix P1003.2 mailing list from Phong Vo (KPV at research.att.com) and

* Glenn Fowler (gsf at research.att.com). Landon Curt Noll (Chongo at toad.com)

* Later improved on their algorithm.

*

* The magic is in the interesting relationship between the special prime

* 16777619 (2^24 + 403) and 2^32 and 2^8.

*

* This hash produces the "fewest collisions of any function" we ' ve seen so

* Far, and works in both numbers and strings.

*/

UINT CALC_HASHNR (const byte *key, uint len)

{

const BYTE *end=key+len;

UINT Hash;

for (hash = 0; key < end; key++)

{

Hash *= 16777619;

Hash ^= (UINT) * (uchar*) key;

}

return (hash);

}

UINT Calc_hashnr_caseup (const byte *key, uint len)

{

const BYTE *end=key+len;

UINT Hash;

for (hash = 0; key < end; key++)

{

Hash *= 16777619;

Hash ^= (UINT) (UCHAR) ToUpper (*key);

}

return (hash);

}

#endif

In MySQL, the string hash function is also case-sensitive, and we use case-insensitive string hash functions in our tests, and we write the above two functions as MYSQL_HASH1 and MYSQL_HASH2 respectively.

2.4 Another classic string hash function

unsigned int hash (char *str)

{

Register unsigned int h;

Register unsigned char *p;

For (h=0, p = (unsigned char *) str; *p; p++)

H = * H + *p;

return h;

}

3 Testing and Results

3.1 Test Instructions

As you can see from the classic string hash function given above, some related to the problem of string size sensitive, our test only to consider the string case sensitive functions, in addition to the function in the above functions need length parameters, some do not need length parameters, which has a certain effect on the efficiency of the function itself, Our tests will make a slight change to the function, all using the length parameter and deleting the calculated length code that appears inside the function.

We used to test the hash list using the classic zipper method to resolve the conflict, in addition, we use the static allocation bucket (hash chain table length) method to construct the hash list, this is mainly to simplify our implementation, does not affect our test results.

Test text using the Word table, the test process from an input file to read all the words to construct a hash table, the test content is the total number of function calls, function total call time, the maximum zipper length, the average length of the zipper, bucket utilization (used barrels accounted for the ratio), Where the total number of calls to the function refers to the total number of times the hash function is called, in order to test the function execution time, this value is magnified in the test process, the function total call time refers to the total execution time of the hash function, and the maximum zipper length refers to the maximum zipper length appearing during the chain table construction using the Zipper method, Average zipper length refers to the average length of the zipper.

The machines used during the test are configured as follows:

PIII600 notebook, 128M memory, Windows Server operating system.

3.2 Test Results

The following is a test result of constructing a hash list of all the repeated words in two different text files, the number of function calls in the test results is magnified 100 times times, and the corresponding function call time is magnified 100 times times.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.