Search Algorithm ---- hash table

The hash method is different from sequential search, binary search, binary sorting tree, and B-tree search. It uses direct addressing technology instead of keyword comparison. Under ideal conditions, you can find the keyword to be queried without any comparison. The desired time is O (1 ).

**Concept of hash**

1. Hash

Set a set of all possible keywords as U (full set for short ). The set of actual (actually stored) KEYWORDS is recorded as K (| K | ratio | u | much smaller ).

The hash method uses function H to map u to the subscript (M = O (| u |) of the table t [0 .. M-1 )). In this way, the keyword in U is used as the independent variable, and the operation result of the function H is the storage address of the corresponding node. So that the search can be completed within O (1) time.

Where:

① H: U → {0, 1, 2 ,..., M-1}, usually referred to as H as a hash function ). Hash function H is used to compress the subscript range to be processed, so that the number of | u | values to be processed is reduced to M, thus reducing the space overhead.

② T is the hash table ).

③ H (Ki) (KI ε U) is the key word for the ki node storage address (also known as the hash value or hash address ).

④ The process of storing a node in the hash list based on its key word's hash address is called hashing)

**3. Hash conflicts**

(1) Conflict

Two different keywords are mapped to the same table because the hash function has the same value. This phenomenon is called collision or collision. The two conflicting keywords are called synonym ).

In [Example], K2 = K5, but h (K2) = H (K5), so the storage address of K2 and K5 is the same.

(2) conditions for preventing security conflicts

The best way to resolve conflicts is to avoid conflicts. To achieve this, two conditions must be met:

① One is | u | ≤ m

② The second is to select an appropriate hash function.

This only applies to scenarios where | u | is small and the keywords are known in advance. At this time, the hashed function H is carefully designed to completely avoid conflicts.

(3) conflicts cannot be completely avoided.

Generally, H is a compressed image. Although | K | ≤ m, but | u |> M, No matter how H is designed, it is impossible to completely avoid conflicts. Therefore, you can only minimize conflicts when designing H. At the same time, you also need to determine how to solve the conflict so that the conflicting synonyms can be stored in the table.

(4) Factors Affecting the conflict

In addition to H, the frequency of conflicts is also related to the extent to which tables are filled.

Set M and N to indicate the number of completion points of the table length and table population respectively, and then define α = N/m as the load factor of the hash list ). The larger the α value, the more tables are filled, and the chance of conflict increases. Generally, α ≤ 1 is used.

**Construction of Hash Functions**

1. There are two criteria for selecting a hash function: simple and even.

The Calculation of hash functions is simple and fast;

Even means that for any keyword in the keyword set, the hash function can map it to any position in the tablespace with an equal probability. That is to say, the hash function can randomly and evenly distribute the subset K in the table's address set {0, 1 ,..., M-1 minimize the conflict.

**2. Common Hash Functions**

For simplicity, it is assumed that the keywords are defined on the natural number set.

(1) China and France

Specific Method: First, increase the difference of similarity by finding the Square Value of the keyword, and then take the number of digits in the middle as the hash function value based on the table length. Because the number of digits in the middle of a product is related to each bit of the multiplier, the resulting hash address is more even.

[Example] Get a set of keywords (0111,) after the square

(002.16,0012100, 1020100,1002001, 0012321)

If the table length is 1000, the three digits in the middle can be used as the hash address set:

(100,121,201,020,123 ).

It is easy to implement the corresponding hash function in C:

Int Hash (INT key) {// assume that the key is a four-digit integer.

Key * = key; key/= 100; // calculate the square value first, and then remove the two digits at the end

Return key % 1000; // return the three digits in the middle as the hash address

}

(2) Division Method

This method is the simplest and most commonly used method. It uses the table length m to remove the keywords and take the remaining number as the hash address, that is, H (key) = Key % m

The key to this method is to select M. The selected M should make the hash function value relevant to the keywords as much as possible. M is preferably a prime number.

[Example] If m is the power of the base of the keyword, it is equal to the last digit of the keyword as the address, and it is irrelevant to the high position. As a result, keywords with the same high and low levels are synonymous with each other.

[Example] If the keyword is a decimal integer whose base is 10, then when M = 100, 159,259,359 ,..., And so on.

(3) multiplication and rounding

This method involves two steps: first, use the key keyword to multiply a constant (0 <A <1) and extract the key. the fractional part of A, and then multiply it by m to get the integer. That is:

The biggest advantage of this method is that selecting m is no longer as critical as the division method. For example, you can select an integer power of 2. Although this method applies to any value of A, it works better for some values. Recommended knuth

The C code of this function is:

Int Hash (INT key ){

Double D = key * A; // you may wish to set the existing definitions of A and M.

Return (INT) (M * (D-(INT) D); // (INT) indicates that the expression following the forced conversion is an integer.

}

(4) Random Number Method

Select a random function. The random function value of the keyword is its hash address, that is

H (key) = random (key)

Here, random is a pseudo-random function, but it must be ensured that the value of the function is between M-1.

**Conflict Handling Method**

There are usually two methods to deal with conflicts: open address and chaining. The former stores all nodes in the hash list T [0 .. m-1 usually links the nodes that are synonymous with each other into a single-chain table, and puts the header pointer of this chain table in the hash list T [0 .. m-1.

**1. Open address Method**

(1) Open address method for Conflict Resolution

When a conflict occurs, a probe (or probe) technique is used to form a probe (TEST) sequence in the hash. Search by unit along the sequence until a given keyword is found or an open address is reached (that is, the address unit is empty, when exploring an open address, you can save the address unit of the new node to be inserted ). When an open address is found, no keyword is found in the table, that is, the search fails.

Note:

① When a hash is created using the open addressing method, all units in the table (more strictly speaking, the keywords stored in the unit) must be left blank before the table is created.

② The expression of a null unit is related to a specific application.

[Example] When the keywords are non-negative, "-1" can be used to represent empty units. When the keywords are strings, empty units should be empty strings.

In short, an empty unit should be expressed with a keyword that does not appear.

(2) general form of open address Method

The general form of the open address method is: HI = (H (key) + DI) % m 1 ≤ I ≤ m 1

Where:

① H (key) is a hash function, Di is an incremental sequence, and m is the table length.

② H (key) is the initial probe location, followed by HL, H2 ,..., HM-1, that is, H (key), HL, H2 ,..., The HM-1 forms a probe sequence.

③ If I starts from 0 and D0 is set to 0, h0 = H (key) indicates:

Hi = (H (key) + DI) % M 0 ≤i ≤ m-1

The probe sequence can be abbreviated as hi (0 ≤ I ≤ m-1 ).

(3) requirements of heap loading factor using the open address Method

The open addressing method requires the loading factor α ≤ l of the hash list. In practice, it is recommended to take a value between α 0.5 and 0.9.

(4) method for generating a test sequence

According to the method for forming the probe sequence, the open site method can be divided into linear probe method, secondary probe method, and dual hash method.

**① Linear probing)**

The basic idea of this method is:

Think of the scattered list T [0 .. M-1] As a circular vector. If the initial probe address is D (that is, H (key) = D), the longest probe sequence is:

D, D + L, d + 2 ,..., M-1 ,..., D-1

That is, starting from address d during probe, t [d] is first probe, and t [d + 1],…, Until M-1], then it loops to T [0], t [1],…, Until T [D-1] is explored.

**The probe process ends in three cases:**

(1) If the unit of the current probe is empty, the search fails (if it is inserted, the key is written to it );

(2) If the unit of the current probe contains a key, the search is successful, but insertion means failure;

(3) If no empty unit or key is found when T [D-1] is detected, search or insertion means failure (the table is full at this time ).

**The general form of open address method is used. The probe sequence of the linear probe method is:**

Hi = (H (key) + I) % M 0 ≤ I ≤ M-1 // that is, DI = I

**Construct a discrete list using linear probing**

[Example 9.1] A group of keywords (,) are known, and a hash function is constructed using the remainder method, use the linear probe method to resolve conflicts and construct a hash list of this set of keywords.

A: To reduce conflicts, the filling factor α is usually <L. Here the number of keywords n = 10, take M = 13, then α ≈ 0.77, the hash list is t [0 .. 12]. The hash function is H (key) = Key % 13.

The hash address of the above keyword sequence calculated by the hash function of the remainder method is ).

When the first five keywords are inserted, their corresponding addresses are all open addresses, so they are directly inserted into T [0], t [10), t [2], T [12] and T [5.

When 6th keywords 15 are inserted, the hash address 2 (that is, H (15) = 15% 13 = 2) is occupied by the keywords 41 (15 and 41 are synonymous with each other. Therefore, if the probe H1 = (2 + 1) % 13 = 3, this address is open, so 15 is placed in T [3.

When 7th keywords 68 are inserted, the hash address 3 is occupied by non-synonym 15, so it is inserted into T [4.

When 8th keyword 12 is inserted, hash address 12 is occupied by synonym 38, so probe HL = (12 + 1) % 13 = 0, T [0] is also occupied by 26, and then explores H2 = (12 + 2) % 13 = 1. This address is open, and 12 can be inserted into it.

Similarly, 9th keywords 06 are directly inserted into T [6]. When the last keyword 51 is inserted, because the probe addresses are 12, 0, 1 ,..., 6 is not empty, so 51 is inserted into T [7.

Detailed process of constructing a hash table [see animation demonstration]

**Clustering or accumulation**

When a conflict is solved by using the linear probe method, when the I, I + 1 ,..., When the I + k position has a knot, a hash address is I, I + 1 ,..., All I + k + 1 nodes are inserted on the position I + k + 1. Clustering or accumulation is a phenomenon in which different nodes of the hash address compete for the same next hash address ). This will cause nodes that are not synonyms to be in the same exploration sequence, increasing the length of the exploration sequence, that is, increasing the search time. If the hash function is poor or the loading factor is too large, accumulation will increase.

[Example] In the above example, H (15) = 2, H (68) = 3, that is, 15 and 68 are not synonyms. However, when dealing with the conflict between 15 and 41, 15 occupies T [3] First, which causes insertion of 68, the two non-synonyms that should not conflict will also conflict.

In order to reduce accumulation, instead of exploring an ordered Address Sequence (equivalent to sequential search) Like the linear probe method, the probe sequence should be hashed across the entire hash.

**② Quadratic probing)**

The probe sequence of the secondary probe method is:

Hi = (H (key) + I * I) % M 0 ≤ I ≤ M-1 // that is, DI = I2

That is, the probe sequence is d = H (key), d + 12, D + 22 ,..., .

The defect of this method is that it is not easy to probe the whole hash space.

**③ Double hashing)**

This method is one of the best methods in the open addressing method, and its probe sequence is:

Hi = (H (key) + I * H1 (key) % M 0 ≤i ≤ M-1 // that is, DI = I * H1 (key)

That is, the probe sequence is:

D = H (key), (D + H1 (key) % m, (D + 2h1 (key) % m ,..., .

This method uses two hash functions: H (key) and H1 (key), which are also called double hash function exploration.

Note:

There are many methods to define H1 (key), but no matter what method is used to define, the values of H1 (key) and m must be set, in order to make the conflicting synonym addresses evenly distributed throughout the table, otherwise, it may cause cyclic calculation of the synonym addresses.

[Example] If M is a prime number, then H1 (key) takes any number from M-1 and is mutually prime with M. Therefore, we can simply define it:

H1 (key) = Key % (m-2) + 1

[Example] For example 9.1, we can use H (key) = Key % 13, while H1 (key) = Key % 11 + 1.

[Example] If M is a power of 2, H1 (key) can obtain any odd number between M-1.

**2. Zipper**

(1) zipper Solution

The zipper method solves the conflict by linking all nodes with synonyms in the same single-chain table. If the length of the selected hash is m, you can define the hash as a pointer array T [0 .. s-1] consisting of m head pointers. All nodes with the hash address I are inserted into a single-chain table with the T [I] As the header pointer. The initial values of each component in T should be null pointers. In the zipper method, the filling factor α can be greater than 1, but generally α ≤ 1.

[Example 9.2] A group of keywords are known to be the same as the selected hash function and example 9.1. The hash list of these keywords is constructed using the zipper method to resolve conflicts.

A: similar to example 9.1, the table length is 13, so the hash function is H (key) = Key % 13, and the hash function is t [0 .. 12].

Note:

When the key of H (key) = I is inserted into the I-th single-chain table, it can be inserted on the head of the linked list or at the end of the linked list. This is because the key can be inserted into the table only when it is determined that it is not in the I-th linked list, so the address of the End Node of the chain is known. If you insert a new keyword to the end of a chain and insert the specified group of keywords into the table in sequence, the obtained hash is shown in.

(2) Advantages of zipper

Compared with the open address method, the zipper method has the following advantages:

(1) The zipper method is simple to deal with conflicts without accumulation, that is, non-synonyms will never conflict, so the average search length is short;

(2) because the Node space on each linked list in the zipper method is dynamically applied, it is more suitable for situations where the table length cannot be determined before table creation;

(3) In order to reduce conflicts, the open addressing method requires a small filling factor α, which wastes a lot of space when the node size is large. In the zipper method, α ≥ 1 is recommended, and when the knots are large, the pointer domain added in the zipper method is negligible, thus saving space;

(4) The delete node operation is easy to implement in the hash list constructed by zipper. Simply delete the corresponding node on the linked list. For the hash list constructed by the open address method, the space of the deleted node cannot be empty simply when the deleted node is deleted, otherwise, the search path of the synonym node in the hash list is truncated. This is because in various open address methods, empty address units (that is, open addresses) are the conditions for failed search. Therefore, the delete operation is performed on the hash list that uses the open address method to handle conflicts. The delete mark can only be performed on the deleted node, but cannot be deleted.

(3) Shortcomings of the zipper Method

The disadvantage of the zipper method is that the pointer requires extra space. Therefore, when the node size is small, the open address method saves more space. If the space saved is used to expand the size of the hash, the filling factor can be reduced, which reduces conflicts in the open addressing method and increases the average search speed.

**Operation on the hash table**

The operations on the hash are search, insert, and delete. This is mainly for search, because the purpose of the hash is mainly for fast search, and the search operation is required for insertion and deletion.

**1. Description of the hash type:**

# Define nil-1 // empty node flag depends on the keyword type. This section assumes that the keywords are non-negative integers.

# Define M 997 // The table length depends on the application, but it should generally be based on. Determine M as a prime number

Typedef struct {// hash Node Type

Keytype key;

Infotype otherinfo; // This class depends on the Application

} Nodetype;

Typedef nodetype hashtable [m]; // hash type

**2. Open address-based search algorithm**

The process of searching a hash is similar to that of creating a table. Assuming that the given value is K, the hash address H (k) is calculated based on the hash function H set during table creation. If the address unit in the table is empty, the query fails; otherwise, compare the node in the address with the given value K. If the values are equal, the query is successful. Otherwise, the next address is found based on the method set during table creation. It repeats until an address unit is null (search failed) or the keywords are equal (search successful.

(1) function representation in the general form of the open address Method

Int Hash (keytype K, int I)

{// Calculate the hash address hi, 0 ≤ I ≤ S-1 in the hash list T [0 .. M-1]

// The H below is a hash function. Increment is a function used to evaluate incremental sequences. It relies on methods for resolving conflicts.

Return (H (k) + increment (I) % m; // increment (I) is equivalent to Di

}

If the hash function is constructed using the division method and the open addressing method of linear exploration is used to handle conflicts, the H (K) and increment (I) in the above functions can be defined:

Int H (keytype K) {// calculate the hash address of K using the Division remainder Method

Return K % m;

}

Int increment (int I) {// use the linear exploration method to find the I incremental di

Return I; // if you use the secondary probe method, I * I is returned.

}

(2) General hash search algorithms for open addressing:

Int hashsearch (hashtable T, keytype K, int * POS)

{// Search for K in the hash T [0 .. s-1]. If the hash is successful, 1 is returned. There are two failures: find an open address

// 0 is returned, and-1 is returned if the table is full and not found. * Position in the table when the POs record finds K or an empty knot

Int I = 0; // record the number of probes

Do {

* Pos = hash (K, I); // find the probe address hi

If (T [* POS]. Key = k) return l; // return if the search is successful

If (T [* POS]. Key = nil) return 0; // return if no node is found

} While (++ I <m) // a maximum of M probes can be performed.

Return-1; // If the table is full and not found, the query fails.

} // Hashsearch

Note:

The above algorithm applies to any open addressing method, as long as the hash function H (K) and the incremental function increment (I) in the function hash are provided. To improve the search efficiency, you can directly write the identified hash function and incremental method into the hashsearch algorithm. For the corresponding algorithms, see exercises ].

**3. insert and create a table based on the open address Method**

When creating a table, you must first clear the keywords of each node in the table so that the address is open. Then, you can call the Insertion Algorithm to insert the given keyword sequence to the table in sequence.

The insert algorithm first calls the search algorithm. If the keyword to be inserted is found in the table or the table is full, insertion fails. If an open address is found in the table, the node to be inserted is inserted, the insert operation is successful.

Void hashlnsert (hashtable T, nodetypene W)

{// Insert the new node into the scattered list T [0 .. M-1]

Int POs, sign;

Sign = hashsearch (T, new. Key, & Pos); // query the insert position of New in table t.

If (! Sign) // find an open address POS

T [POS] = new; // Insert new node new, insert successful

Else // The plug-in failed.

If (sign> 0)

Printf ("duplicate key! "); // Duplicate keywords

Else // sign <0

Error ("hashtableoverflow! "); // Table full error, terminate Program Execution

} // Hashlnsert

Void createhashtable (hashtable T, nodetype A [], int N)

{// Create a hash table t [0 .. M-1] based on the node in a [0 .. n-1]

Int I

If (n> m) // when the Open addressing method is used to handle conflicts, the fill factor α must not exceed 1

Error ("load factor> 1 ");

For (I = 0; I <m; I ++)

T [I]. Key = nil; // empty each keyword so that address I is an open address.

For (I = 0; I <n; I ++) // insert a [0 .. n-1] to the scattered list T [0 M-1] in sequence

Hashlnsert (t, a [I]);

} // Createhashtable

**4. Delete**

You cannot delete a hash based on the open addressing method. If a node must be deleted from the hash table, the keyword of the deleted node cannot be set to nil, but it should be set to a specific mark deleted.

Therefore, you must modify the search operation so that it can continue to explore this mark. You also need to modify the plug-in operation so that when it detects the deleted mark, it will regard the corresponding form element as an empty unit and insert the new node into it. This will undoubtedly increase the time overhead and the search time will no longer depend on the fill factor.

Therefore, when you need to delete nodes in the hash list, the zipper is usually used to solve the conflict.

Note:

For more information about the algorithms on the hash when using the zipper method to handle conflicts, see exercises ].

**5. Performance Analysis**

The time of insertion and deletion depends on the search. Therefore, we will only analyze the time performance of the search operation.

Although the hash table establishes a ing between the keyword and the storage location, it is ideal that you can find the keyword to be queried without comparing the keyword. However, due to the conflict, the search process of the hash table is still a process compared with the keyword, however, the average search length of the hash list is much smaller than that of sequential search and binary search that depend entirely on keyword comparison.

(1) Search successful ASL

The search on the hash is better than the sequential search and binary search.

[Example] in the hash list of example 9.1 and example 9.2, under the assumption that the node search probability is equal, the average search length of the linear probe method and the zipper method are:

Asl = (1 × 6 + 2 × 2 + 3 × L + 9 × 1)/10 = 2.2 // linear Probing Method

Asl = (1 × 7 + 2 × 2 + 3 × 1)/10 = 1.4 // zipper Method

When n = 10, the average length of sequential search and binary search (when successful) is:

Asl = (10 + 1)/2 = 5.5 // sequential Lookup

Asl = (1 × L + 2 × 2 + 3 × 4 + 4 × 3)/10 = 2.9 // binary search, which can be obtained from the decision tree.

(2) Search for unsuccessful ASL

For unsuccessful searches, the number of keyword comparisons required for sequential searches and binary searches only depends on the table length. The number of keyword comparisons required for hash searches depends on the number of nodes to be queried. Therefore, in the case of equi probability, you can also define the average search length of the hash list when the search fails to be successful as the average comparison times to be executed for keywords when the search fails.

[Example] in the hash list of example 9.1 and example 9.2, the average length of the linear probe method and the zipper method when the search fails at an equal probability is:

Aslunsucc = (9 + 8 + 7 + 6 + 5 + 4 + 3 + 2 + 1 + 1 + 2 + 1 + 10)/13 = 59/13 ≈ 4.54

Aslunsucc = (1 + 0 + 2 + 1 + 0 + 1 + 1 + 0 + 0 + 0 + 1 + 0 + 3)/13 ≈ 10/13 ≈ 0.77

Note:

① The average length of a hash list constructed by the same hash function and different conflict resolution methods is different.

② The average search length of the hash is not a function of the number of nodes N, but a function of loading factor α. Therefore, when designing a hash, you can select α to control the average length of the hash.

③ α value

The smaller the α, the smaller the chance of conflict, but the smaller the α, the excessive waste of space. If α is selected as appropriate, the average search length on the hash table is a constant, that is, the average search time on the hash table is O (1 ).

④ Difference between the hash method and other search methods

In addition to the hash method, other search methods share the following features: they are all based on comparative keywords. The ordered search is a query of unordered sets. The comparison result of each keyword is "=" or "! = "The average time of the two possibilities is O (n). The rest of the queries are searches for Ordered Sets, each keyword comparison has three possibilities: "=", "<", and ">", and each comparison can narrow down the next search range, so the search speed is faster, the average time is O (lgn ). The hash method directly finds the address based on the keyword, and the desired time is O (1 ).