The principle and cost of hashing

Source: Internet
Author: User

Summing up a sentence, hash is a typical space-changing time, the price is the need to use more space, in addition to the storage of source data, but also to save additional, hash relational data!

Go

Hash table and hash function are the courses in the university data structure, we often use the structure of Hashtable in the actual development, when encountering the key-value pair storage, the performance of Hashtable is higher than ArrayList. Why is it? We are enjoying high performance at the same time, what is the cost (these days to see the red-top businessman Hu Xueyan, classic lines: Before you enjoy this, you must suffer the suffering of others can not bear the humiliation), then the use of Hashtable is not a million-profit business? In this question, do the following analysis, hope to be able to initiate.
1) Hash Why does it have high performance for key-value lookups
Learn the data structure, should know, linear table and tree, record in the structure of the relative position is random, there is no explicit relationship between the record and the keyword, so in the search for records, a series of keyword comparisons, which is based on the comparison, in Java (Array, Arraylist,list) These collection structures are stored in the above way.
For example, now we have a class of students data, including name, gender, age, school number and so on. If the data has

Name Gender Age School Number
Tom Man 15 1
John doe Woman 14 2
Harry Man 14 3

If we look up by name, suppose the lookup function Findbyname (string name);
1) Find "Zhang San"
just once in the first row.
2) find "Harry"
     match in first row, failed,
     match in second row, failed,
      In the third row match, successfully
the above two cases, respectively analyzed the best case, and the worst case, then the average number of lookups should be (1+3)/2=2 times, that is, the average number of lookups (Total records + 1) of 1/2.
Although there are some optimized algorithms that can increase the efficiency of the lookup sort, the complexity remains within the log2n range.
How do I find it more quickly? The effect we expect is to locate the record position at once, with a time complexity of 1 and the fastest. If we make a sequence number for each record, and then let them enter by the number, and we know what rules to number these records, if we look for a record again, we need to first calculate the number of the record through the rule, and then according to the number, in the recorded linear queue, Can easily find the record.
Note that the above description contains two concepts, one for the number of students in the rules, in the data structure, called the hash function, and the other is the order of the students arranged by the rules of the structure, called a hash table.
still take the above students as an example, assuming that the study number is the rule, the teacher has a rule table, in the row of seats in accordance with this rule to sort, find John Doe, first of all, the teacher will judge according to the rules, John Doe number 2, is in the seat of the 2nd position, go directly past, "John Doe, Haha, you son Here it is! "
Look at the general flow:
  
from the above figure, you can see that the hash table can be described as two packages, a package for the record location number, another package for recording, and a set of rules to describe the relationship between records and numbers. How is this rule usually made?
a) Direct addressing method:
     in my previous article on GetHashCode () performance comparisons, the Data GetHashCode () function for shaping is returning the shape itself, In fact, it is based on the method of direct addressing, for example, there is a set of 0-100 of the data to denote the age of the person
then, the direct addressing method consists of a hash table:

0 1 2 3 4 5
0 years old 1 years old 2 years old 3 years old 4 years old 5 years old

.....
Such a way of addressing, simple and convenient, applicable to the meta-data can be expressed in numbers or the original data has a distinct sequence of relations.
b) Digital Analysis method:
There is a set of data that is used to describe the date of birth of some people

Years Month Day
75 10 1
75 12 10
75 02 14

Analysis, the first digits of the year and month are basically the same, resulting in a very high probability of conflict, and the latter three-bit difference is relatively large, so the latter three-bit
C) Square take the middle method
Take the keyword square after the median as the hash address
D) Folding method:
Divide the keyword into parts of the same number of bits, the last part can be different, and then go to the overlay of these parts and (take out carry) as a hash address, such as the data 20-1445-4547-3
can
          5473
+       4454
+          201
=     10128
takes the Carry 1, takes 0128 as the hash address
E) Take the remainder method The remainder is a hash address when the
Fetch keyword is removed by a number p that is not longer than the hash table long m. H (key) =key MOD P (p<=m)
F) Random number method
Select a random function that takes the random function value of the keyword to its hash address, that is, H (key) =random (key), where random is the stochastic function. This method is usually used when the length of the keyword is unequal.

In summary, the rule of a hash function is to make the keyword moderately dispersed into a specified size in a sequential structure, through some kind of conversion relationship. The more dispersed, the less time complexity is found later, the higher the spatial complexity.
2) What do we pay for using hash?
Hash is a typical space-time algorithm, such as the original array of length 100, to its search, only need to traverse and match the corresponding record, from the spatial complexity, if the array is stored in byte type data, then the array occupies 100byte space. Now we use the hash algorithm, we said before the hash must have a rule, constraint key and storage location relationship, then need a fixed-length hash table, at this time, is still an array of 100byte, assuming we need 100byte to record the relationship between the key and position, Then the total space is 200byte, and the size of the table used to record the rule will vary depending on the rule, the size may be variable.
Note: The most prominent problem in the hash table is the conflict, that is, the two key values are calculated by the hash function of the index position is likely the same,
Note: The reason is simple to introduce the hash, is to better learn the LSH algorithm

primary approaches to conflict resolution
Although we do not wish to have a conflict, the likelihood of a conflict in fact still exists. When the value of a keyword is much larger than the length of a hash table, it is not known in advance when the keyword is specified. Conflict will inevitably occur. In addition, when the actual value of the keyword is greater than the length of the Hashtable, and the table is filled with records, if a new record is inserted, not only is there a conflict, but also an overflow occurs. Therefore, dealing with conflicts and overflows are two important issues in hashing technology.
1. Open addressing Method
The approach to conflict resolution with open addressing is to use some sort of probing (also known as probing) technique to form a sniffing sequence in a hash table when a conflict occurs. Finds the specified keyword along this sequence, either until a given key is found, or when an open address (that is, the address cell is empty) (to insert, in the case of an open address, the new node to be inserted is stored in the Address cell). Probing to open addresses while searching indicates that there are no unknown origin keywords in the table, that is, the lookup failed.
Note:
① when creating a hash table with open addressing, all the cells in the table (more strictly, the keywords stored in the cell) must be empty before the tables are built.
The expression of ② empty cell is related to the specific application.
According to the method of forming the probing sequence, the open addressing method can be divided into linear probing method, linear compensation detection method, random detection and so on.
(1) linear probing method (Linear probing)
The basic idea of this method is:
The hash list t[0..m-1] is considered a cyclic vector, and if the initial probe address is D (that is, H (key) =d), the longest probe sequence is:
D,d+l,d+2,...,m-1,0,1,...,d-1
That is, start with address D on probing, first probe T[d], then probe t[d+1], ..., until T[m-1], and then loop to t[0],t[1], ... until you have probed to t[d-1].
the probing process terminates in three cases:
(1) If the current probe unit is empty, it means that the lookup failed (if inserted, the key is written to it);
(2) If the currently probed unit contains a key, the lookup succeeds, but the insertion means failure;
(3) If the discovery of t[d-1] does not find the empty cell and no key is found, then either the lookup or the insertion means that the failure (at this time the table is full).
using the general form of open address method, the probing sequence of the linear probing method is:
Hi= (H (key) +i)%m 0≤i≤m-1//IE di=i
using linear detection method to deal with the conflict, the idea is clear, the algorithm is simple, but there are the following disadvantages:
① processing overflow requires another program. It is generally possible to set up an overflow table specifically for storing records that do not fit in the above hash table. The simplest structure of this overflow table is the sequential table, in which the lookup method can be found in order.
② the hash table set up by the above algorithm, it is very difficult to delete the work. If you want to delete a record from the hash table HT, it is supposed that the location of this record should be empty, but we can not do this, but can only be marked with the deleted tag, otherwise, will affect the future lookup.
③ linear detection method is very easy to produce the phenomenon of heap accumulation. The so-called heap accumulation phenomenon, that is, the records deposited into a hash table are linked together. When dealing with collisions in linear probing, if the successive sequence generating the hash address is longer (that is, the longer the hash address of the different key values is adjacent together), the greater the likelihood of a conflict with the new record when it joins the table. Therefore, a long sequential sequence of hash addresses grows faster than a short sequential sequence, which means that, once a heap is present (along with the conflict), it will cause further heap accumulation.

The principle and cost of hashing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.