The fastest hash table algorithm

Source: Internet
Author: User
Tags blizzard

Let's start with a simple question: Have a huge string array, and then give you a separate string that lets you find out if there's a string from this array and find it, what would you do? There is a way to the simplest, honestly from the tail, a comparison, until found, I think as long as the people who have learned the program design can make such a process, but if there are programmers to give such a program to the user, I can only use no language to evaluate, maybe it really can work, but ... This is the only way.

The most appropriate algorithm is the use of Hashtable (hash table), the first introduction of the basic knowledge, the so-called hash, is generally an integer, through an algorithm, you can put a string "compressed" into an integer. Of course, in any case, a 32-bit integer cannot correspond back to a string, but in the program, the two strings calculated by the hash value of equal may be very small, the following look at the hash algorithm in MPQ:

function One, the following function produces a length of 0x500 (10 in Number: 1280) crypttable[0x500]

void Preparecrypttable ()
{
unsigned long seed = 0x00100001, index1 = 0, Index2 = 0, I;

for (index1 = 0; index1 < 0x100; index1++)
{
for (Index2 = index1, i = 0; i < 5; i++, index2 + = 0x100)
{
unsigned long temp1, temp2;

Seed = (Seed * + 3)% 0x2aaaab;
Temp1 = (Seed & 0xFFFF) << 0x10;

Seed = (Seed * + 3)% 0x2aaaab;
Temp2 = (seed & 0xFFFF);

CRYPTTABLE[INDEX2] = (Temp1 | temp2);
}
}
}

function Two, The following function calculates the hash value of the lpszFileName string, where Dwhashtype is the type of hash, in the following function three , Gethashtablepos function call this function two, It can take a value of 0, 1, 2; the function returns the hash value of the lpszFileName string:

unsigned long hashstring(char *lpszfilename, unsigned long dwhashtype)
{
unsigned char *key = (unsigned char *) lpszfilename;
unsigned long seed1 = 0x7fed7fed;
unsigned long seed2 = 0xEEEEEEEE;
int ch;

while (*key! = 0)
{
ch = toupper (*key++);

seed1 = crypttable[(dwhashtype << 8) + ch] ^ (seed1 + seed2);
SEED2 = ch + seed1 + seed2 + (Seed2 << 5) + 3;
}
return seed1;
}


Blizzard's algorithm is very efficient, known as the "one-way hash" (a one-way hash is a algorithm that's constructed in such A means that deriving th E original string (set of strings, actually) is virtually impossible). For example, the string "UNITNEUTRALACRITTER.GRP" results from this algorithm is 0XA26067F3.

is not the first algorithm to improve, to compare the Hash value of the string can be, the answer is, far from enough, to get the fastest algorithm, you can not do one-by-two comparisons, usually constructs a hash table (hash table) to solve the problem, hash table is a large array , the capacity of this array is defined according to the requirements of the program, for example, 1024, each hash value by the modulo operation (MoD) corresponds to a position in the array, so as long as the comparison of the hash value of the string corresponding to the position is not occupied, you can get the final result, think of what is this speed? Yes, it is the fastest O (1), now take a closer look at this algorithm:

typedef struct
{
int Nhasha;
int NHASHB;
Char bexists;
......
} somestructrue;
A possible definition of a struct?

function Three, the following function for the hash table to find whether the target string, there is the return to find the string hash value, no, return-1.

int gethashtablepos(har *lpszstring, somestructure *lptable)
Lpszstring the string to look for in the hash table, lptable the hash table that stores the hash value of the string.
{
int nhash = hashstring (lpszstring); Call the above function two to return the hash value to find the string lpszstring.
int nhashpos = nhash% Ntablesize;

if (lptable[nhashpos].bexists &&!strcmp (lptable[nhashpos].pstring, lpszstring))
{//If the found hash value exists in the table, and the string to be searched is the same as the string in the corresponding position in the table,
return nhashpos; The hash value found after returning the call function two above
}
Else
{
return-1;
}
}


Seeing this, I think we all think of a very serious problem: "What if two strings are in the same position in the hash table?" "After all, an array of capacity is limited, and that's a big possibility. There are many ways to solve this problem, my first thought is to use " linked list ", thanks to the University of the data structure taught this hundred test lark, I encountered many algorithms can be converted into a linked list to solve, as long as the hash table at each entry to hang a linked list, save all the corresponding string is OK. There seems to be a perfect ending to this, and if the problem is left to me alone, I'm probably going to start defining the data structure and writing the code.

However, the method used by blizzard programmers is a more sophisticated approach. The rationale is that they do not use a hash in the hash table but instead use a three hash value to validate the string.

MPQ uses a file name hash table to track all internal files. But the format of this table is somewhat different from the normal hash table. First, it does not use a hash as the subscript, the actual file name is stored in the table for validation, in fact, it does not store the file name at all. Instead, 3 different hashes are used: a subscript for the hash table, and two for validation. These two authentication hashes replace the actual file names.
Of course, this will still occur with 2 different filenames hashes to 3 identical hashes. But the average probability of this happening is: 1:1.,888,946,593,147,86e,+22, which should be small enough for anyone. Now go back to the data structure, the hash table used by Blizzard does not use a linked list, and the "deferred" way to solve the problem, look at this algorithm:

function Four,lpszstring is the string to find in the hash table; lptable is a hash table that stores the hash value of the string; Ntablesize is the length of the hash table:

int gethashtablepos(char *lpszstring, mpqhashtable *lptable, int ntablesize)
{
const int hash_offset = 0, hash_a = 1, hash_b = 2;

int nhash = hashstring (lpszstring, Hash_offset);
int Nhasha = hashstring (lpszstring, hash_a);
int NHASHB = hashstring (lpszstring, Hash_b);
int nhashstart = nhash% Ntablesize;
int nhashpos = Nhashstart;

while (lptable[nhashpos].bexists)
{
/* If the string is only judged in the table, compare the two hashes to
* The strings in the struct are compared. This will speed up the operation? Reduce the space occupied by the hash table? This
* How does the method generally apply? */
if (Lptable[nhashpos].nhasha = = Nhasha
&& LPTABLE[NHASHPOS].NHASHB = = NHASHB)
{
return nhashpos;
}
Else
{
Nhashpos = (nhashpos + 1)% Ntablesize;
}

if (Nhashpos = = Nhashstart)
Break
}
return-1;
}

The above procedure explains:

1. Calculate the three hashes of the string (one to determine the location and two for the checksum)
2. Look at this position in the hash table
3. Is this position empty in the hash table? If NULL, the string does not exist and returns-1.
4. If present, check if the other two hashes match, and if so, the string is found and the hash value is returned.
5. Move to the next position, and if you have moved to the end of the table, go back to the beginning of the table to continue querying
6. See if it is back to its original position, and if it is, then return to not found
7. Back to 3

OK, this is the fastest hash table algorithm described in this article. What, not fast enough?:D. Welcome, everybody criticize.

--------------------------------------------
Add 1, a simple hash function:

/*key is a string, ntablelength is the length of the hash table
* The hash value obtained by this function is distributed fairly evenly */
unsigned long gethashindex (const char *key, int ntablelength)
{
unsigned long nhash = 0;

while (*key)
{
Nhash = (nhash<<5) + Nhash + *key++;
}

Return (nhash% ntablelength);
}

Supplement 2, a complete test procedure:
The array of hash tables is fixed length, if too large, then waste, if too small, can not show efficiency. The appropriate array size is key to the performance of the hash table. The size of a hash table is preferably a prime number. Of course, depending on the amount of data, there will be a different hash table size. For applications with a small amount of data, the best design is to use a dynamically variable-size hash table, so if you find that the hash table size is too low, such as when the element is twice times the size of the hash table, we need to enlarge the hash table size, which is generally one-fold.


Here is the possible value of the hash table size:

17, 37, 79, 163, 331,
673, 1361, 2729, 5471, 10949,
21911, 43853, 87719, 175447, 350899,
701819, 1403641, 2807303, 5614657, 11229331,
22458671, 44917381, 89834777, 179669557, 359339171,
718678369, 1437356741, 2147483647

The following is the complete source code for the program, which has been tested under Linux:

#include <stdio.h>
#include <ctype.h>//Thanks Citylove.
cryttable[] is stored in the HashString function will be used in some of the data, in the preparecrypttable
function inside initialization
unsigned long crypttable[0x500];The following function generates a crypttable[0x500 with a length of 0x500 (10 binary number: 1280)]
void Preparecrypttable ()
{
unsigned long seed = 0x00100001, index1 = 0, Index2 = 0, I;For (index1 = 0; index1 < 0x100; index1++)
{
for (Index2 = index1, i = 0; i < 5; i++, index2 + = 0x100)
{
unsigned long temp1, temp2;seed = (seed * + 3)% 0x2aaaab;
Temp1 = (Seed & 0xFFFF) << 0x10;seed = (seed * + 3)% 0x2aaaab;
Temp2 = (seed & 0xFFFF);Crypttable[index2] = (Temp1 | temp2);
}
}
}The following function calculates the hash value of the lpszFileName string, where Dwhashtype is the type of the hash,
The function is called in the following Gethashtablepos function, which can take a value of 0, 1, 2;
Returns the hash value of the lpszfilename string;
unsigned long hashstring (char *lpszfilename, unsigned long dwhashtype)
{
unsigned char *key = (unsigned char *) lpszfilename;
unsigned long seed1 = 0x7fed7fed;
unsigned long seed2 = 0xEEEEEEEE;
int ch;While (*key! = 0)
{
ch = toupper (*key++);seed1 = crypttable[(dwhashtype << 8) + ch] ^ (seed1 + seed2);
SEED2 = ch + seed1 + seed2 + (Seed2 << 5) + 3;
}
return seed1;
}Test the three hash values of argv[1] in main:
./hash "Arr/units.dat"
./hash "Unit/neutral/acritter.grp"
int main (int argc, char **argv)
{
unsigned long ulhashvalue;
int i = 0;if (argc! = 2)
{
printf ("Please input the arguments/n");
return-1;
}/* Initialize array: cryttable[0x500]*/
Preparecrypttable ();/ * value in the print array cryttable[0x500] */
for (; i < 0x500; i++)
{
if (i% 10 = = 0)
{
printf ("/n");
}printf ("%-12x", Crypttable[i]);
}Ulhashvalue = hashstring (argv[1], 0);
printf ("/n----%x----/n", ulhashvalue);Ulhashvalue = hashstring (argv[1], 1);
printf ("----%x----/n", ulhashvalue);Ulhashvalue = hashstring (Argv[1], 2);
printf ("----%x----/n", ulhashvalue);return 0;
}
July, Wuliming, Pkuoliver
Source: Http://blog.csdn.net/v_JULY_v.

The fastest hash table algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.