Algorithm series (12) hashing

Source: Internet
Author: User
Tags rehash

OverviewHash, the general translation to do "hash", there is a direct transliteration of "hash", is the arbitrary length of the input (also known as pre-mapping, pre-image), through the hash algorithm, transformed into a fixed-length output, the output is the hash value. This conversion is a compression map, that is, the space of the hash value is usually much smaller than the input space, the different inputs may be hashed to the same output, but not from the hash value to uniquely determine the input value. Simply, a function that compresses messages of any length to a message digest of a fixed length.
Hash is mainly used in the field of information security encryption algorithm, it has a number of different lengths of information into a cluttered 128-bit encoding, these encoded values are called hash values. It can also be said that the hash is to find a data content and data storage address between the mapping relationship.

A hash table is an algorithm used to perform inserts, deletes, and lookups with a constant average time.

hash Function

The hash table each keyword is mapped to a value in the range of 0 to TableSize-1, which is called a hash function. Because the number of cells is limited, two keywords may be mapped to the same value, and there are some ways to deal with conflicts.

A simple hash function is designed with the keyword string as an example.

Computes the hash value by the and of the ASCII code value of the string.

public static int Hash1 (String key, int tablesize) {int hashval = 0;for (int i = 0; i < key.length (); i++) {Hashval + = Key.charat (i);} return hashval% Tablesize;}

There is an obvious flaw in this approach, which is that hash values are unevenly distributed.

A better hashing method

public static int Hash2 (String key, int tablesize) {int hashval = 0;for (int i = 0; i < key.length (); i++) {hashval = 3 7 * hashval + key.charat (i);} Hashval = hashval% tablesize;if (Hashval < 0) {hashval + = tablesize;} return hashval;}

separation chain method for solving conflictsall elements that are hashed to a value are persisted to a table.


Thought: the element that hashes to the same value is persisted to a linked list.
Disadvantage: Assigning an address space to a new list cell takes time and slows down the algorithm.
Analysis: An unsuccessful lookup requires an average number of nodes to access, and a successful lookup needs to traverse approximately 1+/2 chains.

Simple implementation

public class Separatechainhashtable<t> {private static final int default_size = 100;protected linkedlist[] Lists;pu Blic separatechainhashtable () {this (default_size);} public separatechainhashtable (int size) {lists = new Linkedlist[primenumber.nextprime (size)];for (int i = 0; i < lists. Length i++) {Lists[i] = new linkedlist<t> ();}} Public Boolean contains (T x) {return Lists[hashcode (x)].contains (x);} public void Insert (T x) {linkedlist<t> LinkedList = Lists[hashcode (x)];if (!linkedlist.contains (x)) { Linkedlist.add (x);}} public void Remove (T x) {linkedlist<t> LinkedList = Lists[hashcode (x)];if (Linkedlist.contains (x)) { Linkedlist.remove (x);}} /** * Calculates hash value *  * @param x * @return */private int hashcode (T x) {int hashval = X.hashcode (); hashval = hashval% lists . length;if (Hashval > 0) {hashval + = Lists.length;} return hashval;}}

Open Addressing MethodLoading Factor < 0.5 is usually required (for 1. Guaranteed linear time bounds; 2. Guaranteed to insert successfully), otherwise need rehash ()
Idea: Another way to resolve a conflict is to select another unit to judge when a conflict occurs, until an empty cell appears. That is executed until an empty cell is found, where F () is the function that resolves the conflict. Such tables are also called probing hashes (probing hash tables).
Features: This method uses a tablesize that is larger than the separate link method (because the conflicting elements are assigned to other cells); Typically, the filling factor for such a method is less than 0.5.
Realize:
1) Linear detection: The way to take, easy to generate a single aggregation (primary clustering), produce some snowball heap block, resulting in the hash of any key value in the chunk will require multiple trial units to resolve the conflict. One solution is to use a random conflict resolution method (how do I find it?) ), even if each probe has nothing to do with the previous probe. As you can see, linear probing is not a good idea if the filling factor is greater than 0.5.
2) Square Detection: The problem is that once the filling factor is greater than 0.5, there is no guarantee that empty cells will always be found (if the table size is not a prime number, then less than 0.5 may not find empty cells). The square probe can cause two aggregation (secondary clustering) effects, but its effect is much smaller than a single aggregation, which can be eliminated by using a double hash method, but it also brings additional computational overhead.
3) Double hash (double hashing): One popular way is that the choice of the second hash function is critical, and one option is that R is a prime number smaller than tablesize. Experimental results show that the expected detection frequency of double-hash is almost the same as that of random collision detection, and has good performance. However, due to the need for two hash functions, the calculation is time-consuming (especially under the String type key). Therefore, the actual use of more simple and quick two detection.
Analysis: The standard delete operation cannot be performed in the probe hash (otherwise the subsequent conflicting elements that depend on the current cell will not be deleted), and lazy deletion is required, that is, attaching a data member to the element to characterize its state (active/empty/deleted).
re-hash (ReHash)When using open addressing with square probing, if there are too many elements in the table, the operation time increases and the insert operation may fail, the size of the hash table needs to be expanded (twice times), a new hash function is used to scan the original hash list, and a new hash value for each element is computed.
Re-hashing usually has multiple implementations:

1) When the watch is half full;

2) when the insertion fails;

3) When the table reaches a filling factor (en route strategy).

Usually choose a good cutoff point to adopt a third scenario.

Diffuser Columnsused for large data volume operations where memory cannot be loaded once. Analogy B-Tree implementations: Root D exists in main memory, and the disk contents are indexed. The key question is how to reduce the branching coefficients and how to divide and expand the tree structure. Specific analysis is no longer detailed.
It is important to note that if M (the number of elements per leaf) is too small, it may cause the directory to be too large, requiring a level two index (which increases the size of M in disguise), but there are potentially unavoidable two disk accesses (if main memory is not sufficient to load the two-level index).

three lists in the standard library

The standard library includes a hash implementation of set and map, which is the HashSet class and the HashMap class. The realization of hashset directly with the HashMap. The JDK is implemented using a split-link hash.


code implementation can look at GitHub, address Https://github.com/robertjc/simplealgorithmGitHub code is also constantly improving, some places may have problems, but also please more advice


Welcome to scan QR Code, follow the public number


Algorithm series (12) hashing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.