Data structure: hash (hashing)

Source: Internet
Author: User
Tags hash strlen

Statement: This article is for learning data structure and Algorithm analysis (third edition) Clifford A.shaffer of the study notes, the Code has reference to the book's sample code. Hashing Method

Map key values to locations in the array to access the records this process is called hashing (hashing)
A function that maps key values to locations is called a hash function (hashing functions), usually denoted by H.
The array that holds the record is called the hash table, expressed in HT.
A location in a hash table is called a slot (slot).
The number in the hash table HT is represented by the variable M, and the slot is labeled from 0 to M-1.
Hashing methods generally do not apply to applications that allow multiple records to have the same key code value. Hashing methods generally do not apply to range retrieval.
For a hash function h and two key code values K1, K2, if H (K1) =b= H (K2), where B is a slot in the table, then K1 and K2 have conflicts over the hash function h for B.
For conflicts, there are open hash and closed hash resolution. hash Function

Technically, any function that maps all possible key code values to a hash table slot is a hash function.
Here is a common method for hashing function squared

The square-fetch method is a hash method used for numeric values. The key code to calculate the square, take the middle of several, hash to the corresponding position.
The reason for this calculation is because most bits or all bits of the key code contribute to the result.
For example, a four-bit key code with a base of 10, hashed to a hash table of length 100. If the key code is 4567, the

4567 * 4567 = 20857489

The middle two bits taken are 57. (Number of digits equals ln100) Folding Method

This is a hashing method for strings, which is calculated by summing up the ASCII values of the strings and modulo the M.

int h (char* str)
{
    int sum, I;
    for (sum = i = 0; Str[i]! = ' + '; ++i)
        sum + = static_cast<int> (Str[i]);
    return sum% M;
}

A bad thing about this approach is that if the value of sum is smaller than M, it produces a poorer distribution. To interpret a string as an unsigned integer and hash method

Because you can't find the name of this method, just call it.
This hashing method is computed by interpreting the string as an unsigned integer of size 4, summing, and then modulo.

int Sfold (char* str)
{
    unsigned int sum = 0;
    unsigned int* Intkey = Static_cast<decltype (intkey) > (str);
    int len = strlen (str)/4;
    for (int i =0;i<len;++i)
        sum + = Intkey[i];

    int extra = strlen (str)-len*4;
    Char temp[4];
    Intkey = Static_cast<decltype (intkey) > (temp);//or reinterpret_cast?
    Intkey[0] = 0;
    for (int i =0;i<extra;++i)
        temp[i] = str[len*4+i];
    Sum + = intkey[0];
    return sum%m;
}

The purpose of using an unsigned integer is to avoid a negative result when the modulo is being evaluated.
In the process, if the number is too large, an overflow occurs. But as a hash function, it doesn't matter. (meaning, let it overflow) Open hash Method

Although the purpose of the hash function is to minimize conflicts, some conflicts are unavoidable.
Conflict resolution techniques can be divided into two categories: open hash Method (open hashing, also called single-chain method, separate chaining) and closed-hash method (closed hashing, also called Open address method, open addressing)
The hash method resolves the conflict by recording the conflict outside the table, while the closed hash method is another empty slot in the table where the conflict is recorded.

It looks like it's a good implementation, so I just realized it.
In my implementation, M is 11, and the hash function is as follows:

    int h (const key& k) Const
    {
        return k*7%m;
    }

To tell you the truth, this is not a good hash function. Closed Hash Method

The closed hash method stores all records directly in a hash table.
Closed hashing methods are centralized hashing: bucket hashing, linear probing, two probing, double hashing method. Bucket type Hash

A bucket hash is a hash of a slot divided into buckets (buckets).
Divide the M slots in the hash list into B buckets, each containing a m/b slot.
The hash function assigns each record to the first slot in a bucket. If the slot is already occupied, it is searched sequentially along the bucket until an empty slot is found.
If there is no empty slot, then the record is assigned to an overflow bucket with unlimited capacity.

The Insertion Order is

9 30 27) 4 8

There are 6 slots, 3 buckets of hash table, hash function is

int h (int i)
{
    return i%b;
}

I think this should be clear.
A simple variant of a bucket hash is to hash the key code into the slot first. When the slot is full, the key code is hashed into the other slots in the same bucket. If there is no empty slot, it is hashed into the overflow bucket.

The hash function is then:

int h (int i)
{
    return i%m;
}
linear probing

Linear probing is a more common hash function that does not use a bucket hash method, but rather allows records to be stored in any empty slot in the hash table.
The conflict resolution strategy is to produce a set of slots that are likely to place the record, and the first slot is the base slot for that key code. If the base slot is occupied, the next slot is searched until the record is stored.
This set of slots is called the profiling sequence produced by the Conflict resolution strategy (probe sequence).
The profiling sequence is generated by the P function of the profiling function (probe functions).
Note that the profiling function returns the offset from the initial position, not one slot of the hash table.
The function of linear probing is similar to the following:

int p (int k, int i)
{
    return a*i+b;
}

Where I is the first probe parameter, K is the key code, a and B are constants.
Use the following:

Return (h (k) +p (k, i))%M;

In order to be a probe column that travels through all the slots, a must be associated with M-biotin.

Linear probing causes: Basic aggregation (primary clustering)
Consider sequential insertions:

9 30 27) 4 8

Add the use of the most basic exploratory function return I; M is 6
So 9 of the profiling sequence:

3 4 5 0 1 2

The same is true for the 27 probe sequence.
To insert more records, that is, most of them together.
The tendency to bring records together is the basic aggregation.
A good probing function should be to make their probing sequence diverge.
The solution to the basic aggregation problem is to use two probes or pseudo-random probing. two probes

The functions of two probes are as follows:

int p (int k, int i)
{
    return a*i*i+b*i+c;
}

Where a, B, and C are constants.
The flaw with two probes is that, in some specific cases, only specific slots are probed.
Consider M for 3,p (k, i) = I*i;
Then the hash to slot 0 will only be probed to 0, 1, without probing 2.
However, it is possible to find better results on a low-overhead basis.

When the hash table length is a prime number, and the probe function is P (k, i) = i*i, at least one half of the slots in the table can be accessed.
If the hash table is an exponent of 2, and the probe function is P (k, i) = (i*i+i)/2, all slots in the tables can be accessed by the probe sequence. pseudo-random probing

In pseudo-random probing, the first slot in the probing sequence is (h (k) + ri) mod M, and RI is a random sequence of numbers between 1 and M-1.
All insertions and retrievals use the same pseudo-random sequence.

Although two probes and pseudo-random probing can solve the basic aggregation problem, if the hash function gathers in a base slot, it remains clustered. This problem is called two-time aggregation (secondary clustering)
Solving two aggregation problems you can use double hashing method for double hashing

The form of a double hash method:

int p (int k, int i)
{
    return i*h2 (k);
}

H2 is the second hash function
A good double hash implementation method should ensure that all probe sequence constants are mutually related to the table M length.
One way is to set M as a prime number, and H2 to return a value between 1<=h2<=m-1.
Another method is to give an M value, set m = 2m, and let H2 return an odd value between 1 and 2m. conclusion of closed hashing method

The additional lookup cost of each new insert operation increases sharply when the hash table is nearly half full.
If you also consider access patterns, ideally, records should be sorted along the profiling sequence according to the frequency of access. Delete

When deleting a record, take into account the two point: delete should not affect subsequent searches. The deleted slots should be used for subsequent insert operations.

This problem can be solved by placing a special tag in the deleted slot. This marker is called a tombstone (tombstone).

If you do not want to use a tombstone tag, you can also make a partial reorganization at the time of deletion, or periodically reorganize the hash list.

This paper realizes the method of open hashing and the method of closed hashing.
The code can be found on my github.
My GitHub
–end–

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.