Notes Collation--algorithm

Source: Internet
Author: User

Basic concepts

A hash table is a data structure that accesses memory storage locations directly from a keyword. The function of establishing this correspondence relationship is called the hash function () through a hash table, where the data element is stored and the key of the data element is established.

Method of constructing hash function

The hash table is constructed by assuming that the number of data elements to be stored is n, a contiguous storage unit of length m (m≥n) is set, with the keyword of each data element as an argument, a hash function that maps to an address of the memory unit, and stores the data element in that memory unit.

From a mathematical point of view, a hash function is actually a mapping of a keyword to a memory unit, so we want to use a hash function to make the hash address computed by the hash function as evenly mapped to a series of memory cells as possible by using the simple operation. There are three key points in constructing a hash function: First, the operation process should be as simple and efficient as possible to improve the efficiency of the insertion and retrieval of the hash table. Second, hash functions should have good hashing to reduce the probability of hash collisions; Thirdly, the hash function should have greater compressibility to save memory. There are several common ways to do this:

    1. Direct addressing method, which is the value of a linear function of the curve keyword is a hash address. Can be simply expressed as:, the advantage is that there will be no conflict, but the disadvantage of space complexity may be very high, applicable to the case of fewer elements;
    2. In addition to the remainder method, it is the remainder of the data element keyword divided by a constant as the hash address, the method is simple to calculate, wide application, is the most frequently used a hash function, can be expressed as: the key of the method is the selection of constants, the general requirements are close to or equal to the length of the hash table itself, This constant has the best effect when it takes prime numbers.
    3. Digital Analysis Method: This method is to take the data element keyword in some of the value of a more uniform number of digits as a hash address method, so as to avoid conflicts, but the method is only suitable for all keywords known cases. It does not work for a more general purpose hash table to be designed.
Hash conflict resolution

When constructing a hash table, there is the problem that, for two different keywords, we get the same hash address when we compute the hash address through our hash function, which we call the hash Conflict ():

Hash collisions are mainly related to two factors: first, the filling factor, the so-called filling factor is the ratio of the number of data elements in the hash table to the size of the hash address space, that is, the smaller the α=n/m,α, the less likely the conflict is, and the greater the likelihood of the conflict; But the smaller the alpha, The storage space utilization of the hash table is also very low, the greater the alpha, the higher the utilization of storage space, in order to take account of the hash conflict and storage space utilization, The alpha is usually controlled between 0.6 and 0.9, while the Hashtable in. NET defines the maximum value of α directly as 0.72 (note: Although Microsoft Official MSDN declares that Hashtable has a default fill factor of 1.0, in fact all filling factors are multiples of 0.72); second, it is related to the hash function used, such as If the hash function is chosen properly, the hash address can be distributed evenly on the hash address space as much as possible, thus reducing the generation of conflicts, but a good hash function depends heavily on a lot of practice, but fortunately the predecessors have summed up the practice of many efficient hashing functions, can refer to the garden of Daniel Lucifer article: Data structure: Hash Table [I].

Hash collisions are often difficult to avoid, and there are a number of ways to resolve the conflict, usually divided into two main categories:

    1. Open addressing method, which is a type of hash address in the event of a hash conflict is an argument, a method of obtaining a new free memory unit address by a hash function (), the hash conflict function of open addressing method is usually a group;
    2. The linked list method, when no conflict occurs, stores the data element directly, and when the conflict arises, the conflicting data elements are stored in a single linked list.
Hashtable and Dictionary

. The classes implemented in net are Hashtable and Dictionary<tkey, tvalue>,hashtable consist of buckets containing collection elements, and buckets are virtual subgroups of elements in the Hashtable. Buckets make searching easier compared to searches in most collections. Dictionary is a generic version of the hash table, the same functionality as Hashtable, and for value types, the performance of a particular type (excluding object) takes precedence over Hashtable, because the Hashtable element is of type object. As a result, boxing and unpacking operations typically occur when storing or retrieving types; In addition, although Microsoft claims that Hashtable is thread-safe, it can allow multiple read threads or a single write thread to access it, but the fact is that it is not thread-safe in the. NET Framework 2.0 The newly introduced dictionary still solves this problem, in which the public static approach is thread-safe, so it can be said that dictionary is non-thread-safe, and that the enumeration process for the entire collection is not thread-safe for both. The entire collection must be locked during the entire enumeration, because this occurs when enumerations and write accesses are competing against each other. If we are using the. NET Framework version 4.0 and above, we can use thread-safe concurrentdictionary; Another important difference is that although they all implement a hash table, they use a completely different hash conflict resolution. The Hashtable approach to conflict resolution is open addressing, while dictionary uses a chain-list approach.

The realization principle of Hashtable

The definition of the hash function in the Hashtable class can be represented by the following recursive formula:

By simple mathematical deduction, it can be concluded that the formula of the Hashtable is the hash function cluster of:

So we have a series of hash functions: When we add elements to the hash table, we try to use these hash functions sequentially until we find the corresponding free memory cell address, which is called a two-degree hash.

In the Hashtable class, the bucket containing the element is defined in the structure bucket:

1 Private structBuckets2 {3      Public Objectkey;4      Public ObjectVal;5      Public intHash_coll;6}

The first two fields are easy to understand, representing the keywords and values in the hash table, and for the third field hash_coll, there are actually two kinds of information: the hash code of the keyword and whether it conflicts, Coll is the abbreviation for Collision (conflict), the field is a 32-bit integer type, The highest bit is the sign bit, when the highest bit is 0 o'clock, indicating that the number is positive, that there is no conflict, 1 is negative, that there is a conflict, and that the remaining bits are used to save the hash code.

Let's look at a simple hash table element additions and deletions process, so that we have a more intuitive understanding of how the hash table works, when we do not specify a specific hashtable capacity of large hours, to perform a set of data insertion operations, At this point the Hashtable class will automatically initialize its capacity to the default minimum value of 3.

  1. Insert element ["Elem1"], according to the Hashtable class hash function, the value of its hash code is, at this time for the first time to insert data, so there is no conflict, directly addressed to Bucket[2], because there is no conflict, so Hash_ The value of Coll is the hash code of its key, and the storage structure is as follows:
  2. Insert element ["ELEM2"], similarly, there is no conflict at this time, the storage structure is as follows:
  3. Insert element [+, "elem3"], at this time the hash table to expand, why the expansion at this time, the hash table filling factor is 2/3= 0.66 is not more than 0.72, in. NET, Microsoft to the filling factor is converted, by the filling factor and the hash table size of the product to obtain the best hash table filling amount is: 3x0.72=2. The size of the Hashtable after expansion is twice times the size of the original table, and in this case the hash table size is 7. After the expansion, the already stored elements of the original hash table must be computed, re-addressed, and the hash table of the new hash table, as the hash function of the hash function itself has not changed, the length of the hash table is changed, as follows:

    After the expansion process is complete, the [ELEM3] is inserted, and now we find that the conflict has been created because the position of bucket[5] already has elements, at which point a two-degree hash is made:

    The space at which position 1 in the hash table is still idle, so that the insert operation, before inserting the element, because Bucket[5] has a conflict, so it needs to be marked, hash_coll the highest position of 1, indicating that there is a conflict, so complete the post-insert hash table structure such as:
  4. Insert element ["ELEM4"], in the same vein, create a conflict and make a two-degree hash:

    , the storage structure of the hash table after the completion of the insert is:
  5. Delete element ["Elem1"], when deleting elements, also need to be addressed according to the hash function, if there is a conflict, then a two-degree hash, but it is noteworthy that the deletion of the conflict markup element (that is, the element's Hash_coll value is negative) and the non-conflicting markup elements are different, When you delete a non-conflicting markup element, the key and value of the element that you want to delete is modified to null and the Hash_coll is set to 0, but when you delete the conflicting markup element, you need to place the hash portion of the hash_coll (that is, 0-30 bits) 0 and the value of the element to null. You also need to point the key of the element to the entire hash table, because when an element with an index of 0 also conflicts, it is not possible to determine whether the position is an empty or a non-empty space, then inserting it again is likely to overwrite the element at index 0. The structure after deleting ["Elem1"] is:
The realization Principle of dictionary

Starting with the. NET Framework 2.0, with the introduction of generics, the class library provides a new namespace System.Collection.Generic, and a generic class such as dictionary is added under the namespace.

Dictionary's hash function is relatively simple, is a simple addition to the remainder of the method, for the conflict resolution, dictionary used the chain table method, but at this time the buckets array has been degenerated into a dedicated storage element location (subscript) of the integer array, The data structure that contains the elements is defined as struct entry, which is specifically used to store elements through an array of type entry entries, the entry is defined as follows:

1 Private structEntry2 {3      Public inthashcode;4      Public intNext;5      PublicTKey key;6      PublicTValue value;7}

The next field represents the subscript for the next element of the array list, and a simple diagram of the data storage structure is as follows:

Let's look at the simple process of dictionary inserting and deleting elements in the same way:

  1. Insert element [3, "elem1"], similar to Hashtable, dictionary initialization capacity is also (if the initialization capacity is not specified), dictionary hash function is very simple, in addition to the remainder method to obtain its hash address directly, At this point in entries[0], the key value of the element and the hash code are stored directly, and the index of the element in the entries array is assigned to buckets[2], such as:
  2. Insert element [elem2], whose hash address is:, the post-insert storage structure is as follows:
  3. Insert element [+, ' elem3 '], the computed hash address is, just no conflict, due to the filling factor (at this time the filling factor is 1) of the constraints, at this time without expansion, the storage structure after inserting the element is:
  4. insert element ["ELEM4"], at this time dictionary capacity is full, must be expanded operation, Dictionary expansion and Hashtable expansion strategy is consistent, the capacity of the dictionary after the expansion of the original dictionary capacity size of twice times the prime number is also 7, and then according to the expansion of dictionary re-addressing, This means that some of the data may cause conflicts, causing the existing linked list to be disrupted and re-organized; Dictionary first copies all the elements in the entries in the dictionary before the expansion to the new entries, followed by a re-addressing, for the first element [20, "Elem1"], the new hash address is:, so the value of buckets[6] is modified to 0 (that is, the element ["Elem1"] in the entries index), the same as for 33:, So, buckets[5]=1, finally processing 40, there is a conflict, When dealing with a conflict through a linked list, dictionary first points the new element's next to the element index of the conflict position buckets[5], and then points buckets[5] to the new element, when an array-based list of only two elements is formed, so the storage structure after the expansion is as follows:

    then insert element ["ELEM4"] and compute its hash address: again, after the conflict, the storage structure after inserting the element is as follows:
  5. Finally insert the element ["ELEM5"], and the conflict occurs again, then the structure after inserting the element is as follows:
  6. Deleting elements is very simple for dictionary, if you delete elements on a non-conflicting chain, it is very simple to find the corresponding element by hashing the algorithm to delete and modify the corresponding element value in buckets to-1, if the element is deleted on the conflict chain, Then there is a simple operation to delete the linked list element, which is left to the reader to think about.
Resources
    • Chenhao- visualization of data structures and algorithms
    • Zhang Yicheng- examining data Structures-Part II: Queues, stacks, and hash tables [translate]
    • Abatei- C # and data structure--hash table (HASHTABLE)
    • Wikipedia – hash Table
    • Angel Lucifer- data structure: Hash Table [I]

Notes Collation--algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.