Advanced Hash Table

Source: Internet
Author: User
Tags repetition

Go

This section involves mathematics super many, various number theory knowledge, all kinds of unknown feel Li! Looked a few times, just barely read some, so this

A little simple introduction of two kinds of hash table, so that nonsense said wrong.

The main points of knowledge in this lecture are: 1. Global hash and Construction 2. Perfect hash

1. Whole-domain hashing and structuring

Before you introduce a global hash, consider a disadvantage of a common hash. Give Charles the example: if you

and a competitor doing compiler for a company, the company asks you to share the code.

(O (╯-╰) o), you have to do a good job after the company to judge the standard is that you give each other a sample of testing, who is more efficient to buy who.

Then, the disadvantage of the ordinary hash out: to arbitrary hash function h, there is always a set of keys, so that

, to a slot I. That is, I can always find a set of key values so that they all map to the same slot, so that the efficiency

It's almost like leaving the list.

The solution is: independent of the key value, randomly Select the hash function. This is the same as in the fast queue to avoid the worst case of randomization

Version is similar. But the choice of hash function of the global domain is not random, otherwise it will not be able to play the ideal performance.

The definition of a global hash is given below:

Set U is the global domain of key, set \ (\mathcal{h}\) is a finite set of hash functions, each of which maps u to a

{0,1,.., m-1}, which is inside the slot of table. If for all unequal \ (X,y\in u\), there is

In other words, for the x and y of any unequal key, select a hash function from the hash function set, the two keys

The probability of a conflict is 1/m

More visually, when I randomly select a hash function, it's like throwing a dart in the area, falling in the red area below.

There will be a conflict, this probability is 1/m

Here is a theorem that explains why the whole domain function is good:

Set h is a randomly selected function h from the full domain set \ (\mathcal{h}\) of the hash function. h is used to map an arbitrary n-key to M of table T

Slot, for the given key value X, we have:

Theorem: e[#collision with x]<n/m

Proof: Set \ (c_x\) is a random variable representing the number of key values that conflict with key X, set \ (c_{xy}\) is the indicator variable, i.e.

\ (e[c_{xy}]=1/m\) and \ (C_x=\sum_{y\in t-\{x\}}c_{xy}\), the

The proof!

This theorem is intended to illustrate that the randomized selection of this global hash can achieve the desired effect of a hash table. Watch this.

N/m is the previously defined load factor

A method for constructing a global hash is now given:

First select a large enough prime number p, so that all the key values are between 0-p-1. and set \ (z_p\) means {0,1,..., p-1}, set

\ (z_p^*\) means {,.., p-1}. Because the number of slots m is less than the number of keys, all m<p.

Then we can design the hash function, set arbitrary \ (A\in z_p^*,b\in z_p\), and then

\ (H_a,b (k) = ((ak+b) mod p) mod m\)

All of these hash function families are:

\ (\mathcal{h}_{p.m}=\{h_{a,b}:a\in z_p^*, B\in z_p\}\)

For example: Select P=17,m=6,\ (h_{3,4} (8) =5\). Each hash function maps \ (z_p\) to \ (z_m\). We also

You can see that this hash function family has a total of P (p-1) hash functions

The proof of the whole-domain hash function constructed for this kind of construction method is skipped, which involves a lot of mathematical knowledge and is not good.

2. Perfect Hash

When the key value is static (that is, fixed), we can involve the scenario so that the worst-case query performance is also excellent, which is

Perfect Hash . In fact, a set of static keywords is used in many places. A set of reserved words for a language, a CD-ROM

Collection of file names in the. While the perfect hash can be found in the worst case with O (1) complexity, the performance is excellent.

The idea of a perfect hash is to use a two-level framework with a global hash at each level

The structure of the perfect hash is like. Specifically, the first level and the hash with the linked list are very similar, except that the first level conflicts with the following

is not a linked list, but a new hash table. Behind that hash structure, we can see that the front end stores some of the basic hash tables

Properties: M hash table slot number; A, a and a global hash function to determine the two values (usually randomly selected and then determined), followed by

Hash table.

To ensure that there is no conflict, the number of each level two hash table is the first level mapped to the square of the number of elements in this slot, which guarantees the entire

The hash table is very sparse. A theorem is given below to see the effect of setting m=n^2 more clearly.

Theorem: Set \ (\mathcal{h}\) is a kind of global hash function, the number of slots in the hash table m=n^2. Well, if we use a random

The function \ (h\in\mathcal{h}\) maps n keys to a table. The number of conflicts is expected to be at most.

Proof: According to the definition of a global hash, the probability of a table with 2 keys in a given function is 1/m, i.e. 1/n^2

And a total of (c_n^2\) possible key-value pairs, then the expectation of the number of collisions is

\ (C_n^2\cdot 1/n^2=n (n-1)/2\cdot 1\n^2 < 1/2\) Certificate of Completion!

In order to understand the conflict from expectation to probability, the following inference is introduced

Inference: The probability of a perfect hash without conflict is at least 1/2.

Proof: Here is the main use of an inequality Markov ' s inequality-to any nonnegative random variable x, we have

pr{x≥t}≤e[x]/t

Using this inequality, let t=1, can get the probability of collisions greater than 1 is up to 1/2

Because the number of each table slot in the second layer is the element n^2 in this table, you may feel that the storage space is very large, in fact, can be proven

Ming \ (E[\sum_{i=0}^{m-1}\theta (n_i^2)]=\theta (n) \), because the evidence is quite complicated, so I skipped the%>_<%

Go

Let's take a look at a topk topic: The search engine logs all the retrieved strings used by the user each time they are retrieved by the log file, each query string is 1-255 bytes in length.    assume that there are currently 10 million records (these query strings have a high degree of repetition, although the total is 10 million, but if you remove the duplicates, no more than 3 million. The higher the repetition of a query string, the more users are queried for it, the more popular it is. ), please count the most popular 10 query strings, requiring no more than 1G of memory to use. How to answer? TOPK has said before, looking for the smallest number of K. But how do we deal with query? 10 million records, each record is 255Byte, it is obvious to occupy 2.375G of memory, it is obviously not possible to use internal sorting, no matter what the internal sort. This time can be sorted out, merge sort can be solved. But the topic also said that the removal of the most 300w,300w can completely put in memory, but how to put a 1000W string into memory it? That's what we're going to say next, Hsah table can be completely solved. Don't worry, listen to me carefully. Say the hash before you talk about the Direct addressing table , which is similar to bloomfilter and bit vectors. If the key field is small, that is, the keywords are not many, and are within a certain range. Then we can completely put the keyword as an array subscript, each keyword into a slot in the hash table. This is also the one by one mapping, and the mapping results do not change.

This looks pretty good, the operation is also very simple, each operating time cost is O (1).

It's really good. But it only works when the keyword is distributed in a smaller range , and also requires that the keywords are not equal ... If there are 100 integers, it is 64 bits, some are small, some are super large. This time you define the size of the array is not 2^64-1, can you bear it? Do you still use this method?  2 numbers, 1 and 100000000, the size of the array you define must also be 100000000, in order to conform to the direct addressing method just now. It's a waste of memory. So, the direct addressing is good, but the limit is too much . Keywords do not repeat, the range of keywords is small. Next I formally introduce the hash table. What is a hash table? Hash table is also called hash list. a hash table is a data structure that is accessed directly from a key value. that is, it accesses records by mapping key code values to a location in the table to speed up the search, and the array of records is called a hash table. The query speed of the built hash table is the constant level O (1), which makes it more nice. Hash mapping is the use of a hash function to map the keyword key to the slot of the hash table. This mapping is not a one by one map, and even a string can be mapped to an integer. for the direct addressing table just now the key is mapped to the key slot, and the hash function maps the key to the H (key) slot. The size of the table we want to define is just the number of keywords, not the size of the keyword range and the repetition of the keyword, so that memory is not wasted. and inserting a required time is also O (1), but we use the hash table is mostly for query, the time complexity of the query is also O (1).

Unfortunately this may be a problem, what's the problem? The different keywords are mapped into the same slot, and the collision collision appears. What does that do? Can't you give up your key words? Do we expect them not to collide? This is unrealistic. But don't worry, we can pull a list in that slot and put all the keywords that map to the slot inside the list. This is called the Zipper method, also called link method .

for each slot to give him a linked list, then usually the same hash value of the same map to the same slot to be placed in the list, the size of the list changes at any time .

 the time complexity of querying and inserting is also constant-level O (1). Solved it? NO. Generally good,  if all the hash values are the same, it is said that the keywords are mapped to an identical slot, the hash table becomes a linked list, and memory consumption is larger than the linked list. The time complexity of the  query is O (N). Isn't this a pit? So what are we going to do with it? You're not? Alas alas, there are always exceptions, this can not blame hash table, can only say that the hash function is too poor, if you choose a good hash function, this situation will not appear ( NO , the following will be explained, think).  an ideal hash function will evenly map all the keywords to different slots in the hash table . In other words, the situation is the worst case, the average zipper method is still very good. Let's analyze the average of the  zipper method :  assume that each keyword key is mapped to any slot in the hash table with the same record, and that each keyword key is independent of each other, which is simply consistent hash. The  assumes that there are n keywords, and the hash table has m slots. So what is the probability that two keys are mapped to the same slot by a hash function? 1/m, independent of each other and unaffected. The  defines the load factor α=n/m, which is the average number of keywords in each slot in the hash table. (α>1 <1 =1)  is ni=α for key words in the list of subscript I of the hash table. The query time is then O (1+α), including the discovery success and the lookup failure.  theorem: Under the hypothesis of simple and consistent hash, the hash table of the collision is solved by the link method, and the successful finding and finding failure in the average case is O (1+α).  α is the average number of keywords in the slot. If N=o (m), Α=n/m=o (M)/m=o (1). Then the time complexity of the query is the constant level O (1). In general, M and N are an order of magnitude.  to avoid the worst case of chaining, it is important to choose a good hash function . 
Let's take a look at the usual hash functions. In addition to Fahahifa , it is also called the residue remainder method. A method that maps a keyword to a slot by dividing the key by the number of slots M. The hash function is H (k) =k Mod m. As an example, m=12,k=100,h (100) = 4. And if m=2k, then whatever K is, the value of H (k) is a 0 and an odd number, which means that as long as the odd and 0 slots are occupied, the other even slots are wasted. If m=2^r, then the value of H (k) is the low R bit of K (into binary). The result is that a slot has a lot of keywords. So the average m-value is not as close to 2 as the integer power, but also prime. Although this is good, but division after all in the computer operation is not fast, so we will talk about a multiplication hash. Multiply hashing : Multiply A keyword by A (0<a<1), the decimal number of its result multiplied by M rounding. The hash function is H (k) =[m (KA Mod 1)]. Its advantage is that there is no requirement for m, generally choose 2 of the integer Power (hehe). Assuming that the computer word length is w bit, K to W bit binary, a=s/2^w (<0s<2^w), m=2^p, then

This is better, a value is not required. The best a= (√5-1)/2. There are some squares to take the middle method, folding method and so on.
Said all these hash functions. Let's take a look at how to avoid collisions. Is there any other way to avoid collision other than linking method? Open addressing method, of course. Open addressing , and the link method is not the list, all the keywords are placed in the slot, if the hash value of the same slot already has a keyword, then hash query again until you find an empty slot into the keyword key. The query sequence is also critical, but it is related to the first hash value. The query sequence is not necessarily 0 1 2 3....m, but it's just m! One, thehash table has m! Sequence of Queries . For each keyword the query sequence is H (k,0), H (k,1), H (k,2) ... h (k,m). It's easy to query and insert, but it's really a hassle to delete .

Where is the trouble of removing it? Because we want to hash many times, such as k=496, the first hash (496,0) = 586, but found that the slot 586 has the key word 370, the second hash (496,1) = 204, found that the slot 204 has a keyword 37, the second hash (496,2) = 304, found slot 304 empty, put the keyword 496. If you delete the value 370,370 at slot 586. Then I query 496, the first hash to get 586, found that the slot empty, then 496 does not exist, but 496 is obviously just inserted. So the deletion is not just the removal of the finished, to make a Mark Del, so as not to affect the hash, and again inserted when the mark indicates that the empty slot can be inserted, when the query to see this tag can be bypassed. So how do we construct an open addressing hash function? linear probing : H (k,i) = (H1 (k,0) +i) Mod m,h and H1 can be the same or different. So our query sequence is from the first hash value start one after the query empty slot until found, only the first hash value can be, very simple. But this function can cause problems, cluster problems. This means that a long sequential sequence is not an empty slot, and there are a series of empty slots before it, so that if the hash value of the keyword is in this sequence, it will result in useless traversal, even to the M slot, and the 0-beginning sequence has a lot of empty slots. This wasted a lot of time.
Two probes: Let's not do this one by one the query empty slot, but the interval of the query. You can change I to i^2 or h (k,i) = (h (k) +c*i+c*i^2), which will be much better, but this will also result in clustering, two of times. If the initial query values for the two keywords are the same, then their query sequence is the same, and two times the cluster is slightly shorter and less harmful. However, the query sequence of the two detection methods is only m type, and the hash table query sequence is m! Two But next we say a better, there are m^2 kinds of query sequences.
Double hash : H (k,i) = (H1 (k) +i*h2 (k)) Mod m, this hash method can greatly reduce the cluster phenomenon. H1 and H2 have M-type query sequences, so H has a m^2 query sequence. the value of M at this time is an integer power of 2, and the hash value of the H2 function will always produce an odd number.

But while open addressing is good, the worst is still poor, avoiding the worst. So let's analyze the average and look at the expectations. Α=n/m is the average number of keywords in the slot, and α must be less than or equal to 1 for open addressing, because each slot is placed at most one keyword. Suppose that each keyword key is mapped to any slot in the hash table with the same record, and that each keyword key is independent of each other, which is simply consistent hash. Suppose there are n keywords and a hash table with M slots. For a failed lookup, the probability of the first lookup failure is n/m (because there are n numbers in m), what is the probability of a second lookup failure? (n-1)/(m-1), because the previous one has been excluded, no longer query. The probability of the first-time lookup failure is (n-i+1)/(m-i+1) (<n/m). So the number of times to look for e=1+n/m (1+ (n-1)/(m-1) (1+ ...) +)) =1+α (1+ (1+α (1+ ...))) &LT;1+Α+Α^2+Α^3+...=1/(1-α). Therefore, the number of failed lookups is 1/(1-α), and the successful lookup is the same, the search failed can not be inserted? That's an empty slot.theorem: Under the assumption of a consistent hash, for an open-addressable hash table, the average number of successful lookups and lookups failures is 1/(1-α). If 1/(1-α) = 0.5, you need to query 2 times, and 1/(1-α) = 0.9, you need to query 10 times, so the general situation would like to 1/(1-α) A little better, so that the number of queries less. Do not think that the higher the use of hash, hash is very dense only good, that will make the query speed is very slow.What are we using hash for, not just for quick search and insertion? If the speed is gone, what else do we want it to do? Good hash function is really important, butthe good hash function inevitably collide, always find a set of keywords can be mapped to the same slot with your given hash function, the query time becomes O (n) worst case. This, and hash function does not have any relation, we say for a long while, p use not? It's a bit of a pit. This time we can not help but think of randomness, if we randomly give you a hash function, then you will not be able to give me the worst case. This is the whole domain hash.the idea of a global hash is that the execution algorithm begins to randomly select a function from a set of well-designed hash functions, and there is no way to cause the worst case for a given set of keywords. If the global hash function set is H, and the keyword set is u, then the probability of a key collision with a keyword different from U is 1/m. So that means there are | H|/m a hash function that satisfies this condition.theorem: If H is selected from the hash function of the whole domain hash set h, then the n keywords are mapped into m slots. The expected number of query failures is α, and the expected count of successful queries is α+1. The time complexity of the constant level can be obtained by using the global hash, because N=o (m), α=o (1); The reason we use hash is that its average time complexity can reach O (1). Most of the places we use hash is to query, if we just want to create a static query hash Table, then we can get better results. That's the full hash.Full Hash: The time complexity of finding in the worst case is the hash technique of O (1). The implementation is to use a two-level hash table, each level of the hash function using a global hash function. The first level of the hash table and there is no difference, the keyword map to the slot, but if a collision, we use the idea of the zipper, but do not do the list, for each collision slot I again build a small hash table hi, and the size of the hash table MI is the collision key Ni Square, that mi= Ni^2.

theorem: For a hash function h that is selected from a global hash set, the N keyword is mapped to a hash table in the m=n^2 slot, and the probability of collisions is less than the one in the field.Simple proof: For M slots two different keywords collision probability is 1/m=1/n^2, and the number of 2 keywords from the N keyword to select the combination of N (n-1))/2, then N (n-1))/2*1/n^2<1/2. Then for a two-level hash table, as long as the m=n^2 is satisfied, it can achieve a low probability collision of constant time query. But will the level two hash table have too much space complexity? Of course not, if the first-level hash table is done well, then the two level of space complexity will certainly be good.theorem: For a hash function h selected from a global hash set, the N keyword is mapped to a hash table in the M=n slot, and for a two-level hash table size of ni=mi^2, the expected spatial complexity of a fully hashed hash table is less than 2n, which is O (n).Here to extend a hashing algorithm, which isD-left Hashing。 D-left hashing in the D is a plurality of meanings, first look at 2-left hashing. 2-left hashing refers to dividing a hash table into two halves of equal length, called T1 and T2 respectively, with a hash function for T1 and T2, H1 and H2. When a new key is stored, it is calculated with two hash functions, resulting in two addresses H1[key] and H2[key]. At this point you need to check the H1[key] position in the T1 and the H2[key] position in the T2, which location has been stored (collision) key more, and then store the new key in a low-load location. The comparison is the number of keys (including collisions) that have been stored in the location of the two hash function mappings, not the number of keys already in the two sub-tables. If the two sides are the same, for example, two positions are empty or all store a key, the new key is stored in the left T1 sub-table, 2-left also come. When looking for a key, you must make a hash of two times and find two positions. Understanding 2-left Hashing,d-left Hashing is a good understanding, it is only the extension of the former. 2-left hashing fixed the number of child tables is 2,d-left hashing more flexible, the number of child table is a variable d, but also means that the number of hash function is d. In D-left hashing, the entire hash table is divided into D-left-to-right sub-tables, each of which corresponds to a separate hash function. When adding a new key, the key is computed at the same time by the D hash function, producing a position of D independent, and then adding the key to the lightest-loaded position (bucket). If there are multiple places with the lightest load, add the key to the leftmost sub-table with the lightest load. Similarly, if you are looking for a key, you need to find the D position at the same time. Ok,hash The table said first to come here, after learning new knowledge or update. Because the hash algorithm is profound, this is just bucket. You should learn more later. Understand these just topk problem has become very good solved, think about it, I will not say more.

Advanced Hash Table

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.