Vernacular algorithm (6) Hash Table from theory to practice (I)

Source: Internet
Author: User

The general mathematical method to deal with practical problems is to first extract the essential elements of the problem and then regard it as a possibility system infinitely wider than reality, the substantive relationships in this system can be demonstrated and understood through general reasoning, and can be summarized into general formulas, which are applicable to any special circumstances.
                                        -- R.A. Fisher

In the complexity of a solution, the theoretical or conceptual part is usually only a small part. Theory cannot do practical work-or it cannot be a theory. From theory to practice, a series of inventions are required. From practical to more practical and general, more complexity is often required. Sometimes, this process is far beyond the scope of science and becomes a paradise for artists. Sometimes, this process introduces too much unnecessary complexity, just because of human selfishness, stupidity, and shortsightedness.
Science cannot and cannot handle miracles. Science can only deal with repeated events, but its art is different. Art is "That's it ". Before a creation was born, it was Nothing-it had no reason, no signs; after its birth, it existed, reasonable, natural, and beautiful. The algorithm we are talking about, as a practical science, has both the scientific side and the artistic side. As a science, its structure can be analyzed, its behavior can be predicted, its attributes can be quantified, and its correctness can be proved. As an art, after an algorithm is born, sometimes we can only say "It can work", that's all. How does it come to this world, we don't know anything -- there's no "because ...... So ......", It is not simple from general to special. Creation seems to be mysterious to life. We can put on a beautiful scientific coat for the creation and appreciate its internal consistency. However, the most fascinating part of creativity cannot be described at all.
So when we take a theoretical and practical journey to the hash, don't be surprised if you are aware of some unexplained spans. Without these spans, we can design a program to invent these algorithms, and the algorithms we want to learn will be completely different.

O (n) Search and O (1) Search, two models

What should I do if I want to know whether there is a "ghost" in the book of "Ilia essay selection? We can only start from the first line on the first page and look back at the word until it is found. If we do not find the last word on the last page, we will know that this word is not in this book. Therefore, the complexity of this work is O (n ).
Suppose there is such an accounting copybook, which has only nine pages and each page has an uppercase Number:

When the accountant wants to practice the word "yellow", as long as she knows the relationship between the page number and the content in advance, she can go directly to the 7th page to find the complexity of O (1. Through this model, we know that to achieve O (1) Complexity search, three conditions must be met:
1. The content (such as uppercase numbers) stored in a storage unit (such as a page) must correspond to the address of the storage unit (such as the page number) one by one.
2. This one-to-one correspondence relationship (for example, the upper-case number "yellow" is on the 7th page) must be known in advance.
3. The storage unit can be read randomly. Here, "random reading" means that each storage unit can be read in any order, and the time required for each read is the same. In contrast, reading a song on the tape is not random-it is easier to listen to 5th songs than to listen to the first song.

Implement O (1) Search on the computer

First look at the computer hardware. The computer's memory supports random access. From its name RAM (random-access memory), we can see that it is a little proud of this.
Now that the hardware supports, we can prepare to simulate the accounting profession on the computer. The first task is to apply for nine storage units from the operating system. There is a small problem here. The address of the storage unit we get may not start from 1 to 9, but from 134456. Fortunately, we don't need to deal with the operating system directly. Advanced languages will handle these things for us. When we create an array in advanced language, it is equivalent to applying for a continuous storage space. The subscript of the array is the address of each storage unit (abstract. In this way, the first O (1) complex container SingleIntSet can be easily completed, and it can only store 0 ~ 9 The 10 digits:

public class SingleIntSet{    private object[] _values = new object[10];    public void Add(int item)    {        _values[item] = item;    }    public void Remove(int item)    {        _values[item] = null;    }    public bool Contains(int item)    {        if (_values[item] == null)            return false;        else            return (int)_values[item] == item;    }}

Test:

Static void Main (string [] args) {SingleIntSet set = new SingleIntSet (); set. add (3); set. add (7); Console. writeLine (set. contains (3); // outputs true Console. writeLine (set. contains (5); // output false}

New Term: When an integer array is created using advanced languages (for example, int [] values = new int [10]), values [7] is no longer referred to as "one storage unit ", because the size of a storage unit is one byte, In a 32-bit operating system, the size of values [7] is 4 bytes, so we need to use a new term, call values [7]Slot).

SingleIntSet2 (to be honest, I really don't like this name. Who will like it ?!)

New requirements! Similarly, you only need to save 10 numbers, but this time it is not to save 0 ~ 9, but needs to save 10 ~ 19. What should I do? It is easy to implement the ing function H () between values and addresses in a slot:

public class SingleIntSet2{    private object[] _values = new object[10];    private int H(int value)    {        return value - 10;    }    public void Add(int item)    {        _values[H(item)] = item;    }    public void Remove(int item)    {        _values[H(item)] = null;    }    public bool Contains(int item)    {        if (_values[H(item)] == null)            return false;        else            return (int)_values[H(item)] == item;    }}

Use 10 ~ Number in the range of 19:

Static void Main (string [] args) {SingleIntSet2 set = new SingleIntSet2 (); set. add (13); set. add (17); Console. writeLine (set. contains (13); // outputs true Console. writeLine (set. contains (15); // output false}

The house is not enough. Do you want to sleep on the road?

This time, 10 numbers are still stored, but the range of numbers is 0 ~ 19. How to store 20 numbers in 10 slots? What else can I do? Two people can stay in one room. Slightly modify the H () function, and the other code remains unchanged:

public class SingleIntSet3{    private object[] _values = new object[10];    private int H(int value)    {        if (value >= 0 && value <= 9)            return value;        else            return value - 10;    }    // ...}

Test:

Static void Main (string [] args) {SingleIntSet3 set = new SingleIntSet3 (); set. add (3); set. add (17); Console. writeLine (set. contains (3); // outputs true Console. writeLine (set. contains (17); // outputs true Console. writeLine (set. contains (13); // outputs false set. add (13); Console. writeLine (set. contains (13); // outputs true Console. writeLine (set. contains (3); // outputs false. however, true should be output !}

The result of the last row is incorrect! It is impossible for two people to stay in one room, and the data cannot stand the grievance. However, there is a way for meters, unless 1) We know all 10 inputs in advance; 2) These 10 inputs will not be changed once determined; otherwise, no matter how to design H () functions cannot avoid the situation where two people stay in one room. Then we will say that it happened.Collision).

Use the Link Method to handle collisions

The simplest way to handle a collision isChaining). The link method is to place two people in the collision room, but share one public address. For simplicity, each slot of the array can point to a linked list:

public class SingleIntSet4{    private object[] _values = new object[10];    private int H(int value)    {        if (value >= 0 && value <= 9)            return value;        else            return value - 10;    }    public void Add(int item)    {        if (_values[H(item)] == null)        {            LinkedList<int> ls = new LinkedList<int>();            ls.AddFirst(item);            _values[H(item)] = ls;        }        else        {            LinkedList<int> ls = _values[H(item)] as LinkedList<int>;            ls.AddLast(item);        }    }    public void Remove(int item)    {        LinkedList<int> ls = _values[H(item)] as LinkedList<int>;        ls.Remove(item);    }    public bool Contains(int item)    {        if (_values[H(item)] == null)        {            return false;        }        else        {            LinkedList<int> ls = _values[H(item)] as LinkedList<int>;            return ls.Contains(item);        }    }}

The test result is as follows:

Static void Main (string [] args) {SingleIntSet4 set = new SingleIntSet4 (); set. add (3); set. add (17); Console. writeLine (set. contains (3); // outputs true Console. writeLine (set. contains (17); // outputs true Console. writeLine (set. contains (13); // outputs false set. add (13); Console. writeLine (set. contains (13); // outputs true Console. writeLine (set. contains (3); // outputs true}

How can 2.1 billion people use 10 addresses?

Okay, with the link method, we have enough houses to deal with possible collisions. However, we still hope that the smaller the chance of a collision, the better, especially when we set the range of values from 0 ~ 19 to 0 ~ Int. MaxValue. Is there any way to map 2.1 billion values into 10 values and minimize collision?

Division hash

H (k) = k mod m
Here, k is the value in the slot, and m is the size of the array (for simplicity, this example is fixed as 10 ). In this way, we can obtain the IntSet in the range of the first positive integer:

Public class IntSet {private object [] _ values = new object [10]; private int H (int value) {return value % 10;} // other parts are the same as SingleIntSet4}

TEST How IntSet. H () works:

Console. WriteLine (H (3); // output 3Console. WriteLine (H (13); // output 3Console. WriteLine (H (17); // output 7

Only one collision occurred! It works as well as SingleIntSet4.H () in the handwritten version. Why is the division hash effective? Once the magic is revealed, it is always plain:
First, if you still have an elementary school course, you should remember that the remainder of the larger number divided by 10 must be between 0 and ~ If you use this as a subscript to access the array, you do not have to worry about cross-border access.
Second, let h () Get the number of k of 1 and let h () Get the number of k of 2 is the same, so that it is not prone to collision.
3. Let h () Get that k of 1 is 1, 11, 21, 31 ...... 101, 111, 121 ...... That is to say, the k-value comparison that causes the collisionScattered. This is very important, because when IntSet is used, the stored values are often adjacent, such as age, serial number, ID card number, and so on.
Note that m should not be in the form of 2 power, that is, 2 p. h (k) will be equal to the lowest p bit of k binary. Take m = 23 = 8 as an example, as shown in:

Take k = 170 as an example, h (k) = 170 mod 8 = (27 + 25 + 23 + 0*22 + 21 + 0*20) mod 23 = (24*23 + 22*23 + 23 + 0*22 + 21 + 0*20) mod 23 = 0*22 + 21 + 0*20
That is to say, only the lowest p bit cannot be divisible by 2 p. What's the problem? The problem is that we do not want to assume the k distribution, so we usually expect the value of h (k) to depend on all the bits of k rather than the lowest p bit. Day knows k won't be "11010000, 00110000, 10010000 ......" This way (assuming that an idiot operating system prefers to first assign an object Id at a high level, and we want to use this Id as k, the cup will happen ).
After the User specifies the size of the array, we need to find a prime number that is closest to the nearest value as the actual m value. To speed, we prefill the common prime number in a prime number table, the new IntSet2 allows users to specify its capacity:

Public class IntSet2 {private object [] _ values; public IntSet2 (int capacity) {int size = GetPrime (capacity); _ values = new object [size];} private int H (int value) {return value % _ values. length ;}// prime number table private readonly int [] primes = {3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89,107,131,163,197,239,293,353,431,521,631,761,919,110 3, 1327,159 7, 1931,233 3, 2801,337 1, 4049,486 1, 5839, 7013,841 9, 10103,121 43, 14591,175 19, 21023,252 29, 30293,363 53, 43627,523 61, 62851,754 31, 90523,108 631, 130363,156 437, 187751,225 307, 270371,324 449, 389357,467 237, 560689,672 827, 807403,968 897, 1162687,139 5263, 1674319,200 9191, 2411033,289 3249, 3471899,416 6287, 4999559,599 9471, 7199369}; // determine whether candidate is a prime number private bool IsPrime (int candidate) {if (candidate & 1 )! = 0) // It is an odd number {int limit = (int) Math. sqrt (candidate); for (int divisor = 3; divisor <= limit; divisor + = 2) // divisor = 3, 5, 7... the square root of candidate {if (candidate % divisor) = 0) return false;} return true;} return (candidate = 2); // except for 2, none of the other even numbers} // If min is a prime number, min is returned; otherwise, the prime number private int GetPrime (int min) is returned) {// query the prime number for (int I = 0; I <primes. length; I ++) {int prime = primes [I]; if (prime> = min) return prime;} // when min exceeds the range of the prime number table, explores every odd number after min until the next prime number is found for (int I = (min | 1); I <Int32.MaxValue; I ++ = 2) {if (IsPrime (I )) return I;} return min;} // other parts are the same as IntSet}

Note: The primes, IsPrime (), and GetPrime () Functions of the prime table are all Hashtable. cs from the source code of. net framwork2.0.

Multiplication hash

H (k) = memory m (kA mod 1) memory )⌋
A is A constant greater than 0 and less than 1. For example, A = 2654435769/232 can be used. KA mod 1 indicates taking the fractional part of kA. C # the code can be like this:

private readonly double A = 2654435769 / Math.Pow(2, 32);int H(int value){    return (int)(_values.Length * (value * A % 1));}

For more information about the history of the magic number and how to use the bit operation of the computer to implement H () Faster, see P138 introduction to algorithms.
The disadvantage of the multiplication hash method is that it is not as uniform as the Division hash method. You can compare k to 0 ~ 1000 satisfies the distribution of k m = 100, h (k) = 1:

Division hash method, h (k) = k mod 100 k h (k) span 1 1-101 1 100201 1 100301 1 100401 1 100501 1 100601 1 100701 1 100801 1 100901 1 100 1

 

Multiplication hash, h (k) = 100 * (kA mod 1) k h (k) SPAN 34 1-123 1 89178 1 55267 1 89411 1 144500 1 89644 1 144733 1 89788 1 55877 1 1 89

So far, there are three regrets:
1. Only positive integers are supported.
2. Although simple and direct, the link method is not the only way to deal with collision. The Hashtable Of the. net framework is a better open addressing method.
3. You can only specify the container size when creating the container, but cannot expand automatically.

Let's take a breath and stay in the next article to continue fighting.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.