I. Overview
This paper will give a rough account of the concept of the hash algorithm, which will be combined with the distributed system load balancer example of the hash of the consistency of the deep discussion. In addition, this paper discusses the universality of hash algorithm in the Mass data processing scheme. At last, from the source code, the application of hash algorithm in the MapReduce framework is analyzed concretely.
Second, hash algorithm
Hash can be a hash function to the arbitrary length of the input into a fixed-length output, you can also map different inputs to the same output, and these output range is controllable, so it has a good compression mapping and equivalent mapping function. These features are applied to the encryption algorithm in the field of information security, where the equivalence mapping feature plays a significant role in the mass data solution, especially in the entire MapReduce framework, which is described in detail in the following chapters in the two. In other words, why hash has this compression mapping and equivalent mapping function, mainly because the hash function in the implementation of the use of the modulo. Let's look at several commonly used hash functions:
• Direct Pick-up method: F (x): = x mod maxm; MAXM is generally not too close to a prime number of 2^t.
• Multiply Rounding Method: F (x): =trunc ((x/maxx) *maxlongit) mod MAXM, mainly used for real numbers.
• Square-Take method: F (x): = (x*x div. mod) 1000000); Square after the middle, each contains more information.
third, hash algorithm in the Mass data processing party application in the case
The main idea of single-machine processing massive data is the same as the MapReduce framework, which is to divide and conquer the massive data into several small parts for processing, and to take into account the usage of memory and the processing concurrency in the process of processing. The more careful processing process is generally divided into several steps (in most cases, a few of which are based on your own actual situation and other workarounds to compare the most realistic approach):
The equivalent mapping is made by the hash modulus. In this way, huge files can be divided into equal numbers (note: Data that conforms to a certain rule must be split into the same small file) into several small files to be processed. This method is very effective for large amounts of data and limited memory.
After we divide large files into small files through hash mapping, we can use a storage structure like HashMap to make frequency statistics about the concerns in small files. The specific practice is to count the item as the HashMap key, the number of occurrences of this item as value.
The third step: After the last step of the statistics, according to the scene requirements often need to be stored in the HashMap data based on the number of occurrences to sort. We can use heap sort, quick sort, merge sort and so on.
Now let's take a look at specific examples:
"Example 1" massive log data, extract the most visited Baidu one day the most number of the IP
Idea: When you see such a business scenario, we should immediately think of these huge amount of gateway log data how much? How many combinations of these IPs are there, and how much storage space is the maximum? Before we solve such problems, we need to know the size of the data before we can develop a solution in general. So now let's assume that these gateway logs have a volume of 3 t. The following is a general analysis of how to resolve this scenario according to our above steps:
(1) First, from these massive data filter out the designated day to visit Baidu's user IP, and write to a large file.
(2) The idea of "divide and conquer" uses hash mapping to divide large files to reduce data size. According to the IP address of the hash (IP)%1024 value, the vast number of IP logs stored in 1024 small files, where the hash function to obtain the value of the partition after the small file numbers.
(3) Read small files one by one, for each small file to build an IP key, the number of occurrences of value hashmap. For how to use HashMap record IP number of times this is relatively simple, because we can read a small file by the program to put the IP into the HashMap key after you can determine whether the IP already exists if it does not exist directly into the number of occurrences recorded as 1, If this IP is already stored, the corresponding value value will be the number of occurrences and then add 1 to OK. Finally, according to the number of IP occurrences, the sorting algorithm is used to sort the data in HashMap, and the IP address with the most occurrences is recorded.
(4) Go to this step, we can get 1024 small files in the most occurrences of the IP, and then use the conventional sorting algorithm to find the overall number of the most frequently occurring IP is OK.
This we need to specifically know what the points are:
First: We pass the hash function: hash (IP)%1024 the large file mapping to 1024 small files, then the size of the 1024 small files is uniform? In addition, we use HashMap to do the IP frequency statistics, the memory consumption is appropriate?
First of all, the size of the small file is divided by the degree of uniformity depends on how we use the hash function, for this scenario is: hash (IP)%1024. A well-designed hash function can reduce conflicts and divide the data evenly into 1024 small files. But although the data is mapped to a different location, the data is still the original data, but instead of representing the form of the original data has changed.
Also, take a look at the second problem: using HashMap to count the memory usage of IP frequency.
To know how often hashmap is counting IP, we have to be aware of the IP combination. 32Bit IP can have up to 2^32 combinations, which means that all IP accounts for up to 4G of storage space. In this scenario, we have divided the large file into 1024 small files based on the hash value of the IP, which means that the 4G IP has been dispersed to 1024 files. Then, in the case of reasonable hash function design, the hashmap of each small file accounts for up to 4g/1024+ the number of times the storage IP corresponds, so the memory is absolutely sufficient for the perfect.
Second: Hash modulo is an equivalent mapping, in other words, the same elements will only be divided into the same small file after the mapping. As far as this scenario is concerned, the same IP will only be split into one of the 1024 small files after the hash function.
"Example 2" given a, b two files, each store 5 billion URLs, each URL accounted for 64 bytes, memory limit is 4G, let you find a, b file common URL?
Idea: Or the same, first hash map to reduce data size, and then statistical sorting.
Specific practices:
(1) Analyze the size of existing data.
Each file has 5 billion URLs per Url64 byte, so each file size is 5g*64=320g. 320G is far beyond the memory-limited 4G, so it cannot be loaded into memory for processing, it needs to be handled by a divide-and-conquer approach.
(2) Hash map to split the file. Read the file by line A, using the hash function: hash (URL)%1000 the URL into 1000 small files, the file is F1_1,f1_2,f1_3,..., f1_1000. Ideally, the size of each small file is about 300m. Then the same way to large file B for the same operation to get 1000 small files, recorded as: F2_1,f2_2,f2_3,..., f2_1000.
After a toss-up we split the large file and split the same URL into the same 2 small files in the same two files, we can actually consider these 2 sets of files as a whole: f1_1&f2_1,f1_2&,f2_2,f1_3&f2_3 ,..., f1_1000&f2_1000. Then we can turn the problem into the same URL for the 1000 small files. Next, for each pair of small files in the same URL, first put each pair of small files in the smaller URL into the hashset structure, and then traverse the corresponding small file in the other file, see if it is just built hashset, if there is a description is the same URL, It is OK to save this URL directly to the result file.
"Example 3" has 10 files, each file 1G, each file is stored in each line of the user's query, each file can be repeated query. Ask you to sort by the frequency of the query.
"Example 4" has a 1G size of a file, each line is a word, the size of the word does not exceed 16 bytes, memory limit size is 1M. Returns the highest frequency of 100 words.
Like Example 3 and example 4 These scenarios can all be solved with our usual old tricks: first hash mapping reduces data size, then statistics load into memory and last sort. You can refer to the above 2 examples for specific practices.
Iv. Ha SH calculation application of the method in the MapReduce framework
Hash algorithm plays a central role in the distributed computing framework MapReduce. Let's take a look at the entire operation of the MapReduce, first of all the raw data through the slice into the map function, the map function of the data will be in the entire ring buffer inside the first order, and then the map output will be based on the key value (the default is the case, also can be customized) The hash map divides the large data map output into n parts (n is the number of reduce) to achieve parallel processing of data, which is the partition phase, in addition to the partition in the MapReduce framework can often determine the tilt of the data, Therefore, it is best to understand the distribution of the data before processing the data.
Next, from the perspective of MAPREUDCE source to study the implementation of partition principle:
The realization of its partition mainly include: Hashpartitioner, Binarypartitioner, Keyfieldbasedpartitioner, totalorderpartitioner these kinds, Where Hashpartitioner is the default. First look at the core implementation of Hashpartitioner:
123456789101112131415161718192021222324252627 |
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package
org.apache.hadoop.mapreduce.lib.partition;
import
org.apache.hadoop.mapreduce.Partitioner;
/** Partition keys by their {@link Object#hashCode()}. */
public
class
HashPartitioner<K, V>
extends Partitioner<K, V> {
/** Use {@link Object#hashCode()} to partition. */
public
int
getPartition(K key, V value,
int
numReduceTasks) {
return
(key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
|
We see the 25th line, where we have seen a lovely hash of the mapping method, the reason for doing so you see here should already know in the heart. In addition, Totalorderpartitioner, Binarypartitioner and several other partitioner implementations are based on the hash method of mapping, but they have to implement their own custom functions and added some logic, For example, the Totalorderpartitioner can be fully sorted. A few other partition source code here is not posted, interested can see for themselves.
Five, Ha SH calculation consistency of law
This section is the last part of this article, the reason is to introduce this part of the content is mainly from the integrity of the hash algorithm, this part of the content and the solution of massive data is not very important, mainly for distributed cache design. Because of this part of the content has been a number of Danale have done a very thorough study and the explanation is quite perfect, the younger brother here directly quoted. So this section refers to Sparkliang's blog.
Consistent hashing algorithm was put forward in the paper consistent hashing and random trees in 1997, and is widely used in the cache system.
1 Basic Scenarios
For example, if you have n cache server (hereafter referred to as cache), then how to map an object to n cache, you are likely to use a common method like the following to calculate the hash value of object, and then map evenly to the n cache;
Hash (object)%N
Everything is running normally, consider the following two cases;
11 Cache server M down (it must be considered in the actual application) so that all objects mapped to the cache m will be invalidated, what to do, need to remove the cache m from the cache, when the cache is N-1, the mapping formula becomes a hash ( Object)% (N-1);
2 because of the access aggravating, need to add the cache, this time the cache is n+1, mapping formula into a hash (object)% (n+1);
What does 1 and 2 mean? This means that suddenly almost all of the caches are dead. For the server, this is a disaster, flood-like access will be directly rushed back to the server;
Consider the third problem, because the hardware capabilities are getting stronger, you may want to add more nodes to do more work, obviously the above hash algorithm can not be done.
Is there any way to change this situation, this is consistent hashing ...
2 hash Algorithm and monotonicity
A measure of the hash algorithm is monotonicity (monotonicity), which is defined as follows:
Monotonicity refers to the addition of a new buffer to the system if some content has been allocated to the corresponding buffer by hashing. The result of the hash should be to ensure that the original allocated content can be mapped to a new buffer without being mapped to another buffer in the old buffer collection.
Easy to see, above the simple hash algorithm hash (object)%n difficult to meet the monotonicity requirements.
Principle of the 3 consistent hashing algorithm
Consistent hashing is a hash algorithm, in a nutshell, when removing/adding a cache, it can change the existing key mappings as small as possible, and satisfy the monotonic requirements as much as necessary.
Here are the basic principles of the consistent hashing algorithm in 5 steps.
3.1 Ring Hash Space
Consider that the usual hash algorithm is to map value to a key value of 32, which is the value space of the 0~2^32-1; we can think of this space as a ring with a first (0) tail (2^32-1), as shown in Figure 1 below.
Figure 1 Ring Hash space
3.2 Mapping objects to the hash space
Next consider 4 objects Object1~object4, the hash function calculated by the hash value of key on the ring distribution 2 is shown.
Hash (object1) = Key1;
... ...
Hash (OBJECT4) = Key4;
Figure 2 Key value distributions for 4 objects
3.3 Mapping the cache to the hash space
The basic idea of consistent hashing is to map both the object and the cache to the same hash value space, and use the same hash algorithm.
Assuming that there are currently a A, a, a and c a total of 3 caches, then its mapping results will be 3, they are in the hash space, the corresponding hash value arrangement.
Hash (cache a) = key A;
... ...
Hash (cache c) = key C;
Figure 3 Key value distributions for cache and objects
Speaking of which, by the way, the cache hash calculation, the general method can use the cache machine's IP address or machine name as a hash input.
3.4 Mapping objects to the cache
Now that both the cache and the object have been mapped to the hash value space using the same hash algorithm, the next thing to consider is how to map the object to the cache.
In this annular space, if you start from the object's key value in a clockwise direction until you meet a cache, the object is stored on the cache because the hash value of the object and cache is fixed, so the cache must be unique and deterministic. Did you find the mapping method for the object and cache?!
Continue with the above example (see Figure 3), then the object Object1 will be stored on cache a according to the above method, and Object2 and object3 correspond to the cache c;object4 corresponding to cache B;
3.5 Review the change of the cache
Said before, through the hash and then the method of redundancy is the biggest problem is not to meet the monotony, when the cache changes, the cache will fail, and then the background server caused a huge impact, now to analyze and analyze the consistent hashing algorithm.
3.5.1 Removing the cache
Consider the assumption that cache B hangs up, and according to the mapping method described above, the only objects that will be affected are those that go counterclockwise through cache B until the next cache (cache C), which is the object mapped to cache B.
So here you only need to change the object Object4 and remap it to cache C; see figure 4.
Figure 4 Cache Map after cache B has been removed
3.5.2 Add Cache
Consider the case of adding a new cache D, assuming that in this ring hash space, cache D is mapped between the object Object2 and Object3. The only things that will be affected are those objects that traverse the cache D counterclockwise until the next cache (cache B), which is also mapped to a portion of the object on cache C, to remap the objects to cache d.
So here you only need to change the object object2 and remap it to cache D; see figure 5.
Figure 5 Mapping relationships after adding cache D
4 Virtual nodes
Another indicator for considering the hash algorithm is the balance (Balance), which is defined as follows:
Balance of
Balance means that the result of the hash can be distributed to all buffers as much as possible, thus allowing all buffer space to be exploited.
The hash algorithm does not guarantee absolute balance, if the cache is less, the object can not be evenly mapped to the cache, such as in the above example, only the deployment of cache A and cache C, in 4 of the objects, cache a only stored object1, and the cache C Stores Object2, Object3, and Object4, and the distributions are very uneven.
To solve this situation, consistent hashing introduces the concept of "virtual node", which can be defined as follows:
Virtual node is the actual node in the hash space of the replica (replica), a real node corresponding to a number of "virtual node", the corresponding number has become "Replication Number", "Virtual node" in the hash space in the hash value.
In the case of deploying only cache A and cache C, we have seen in Figure 4 that the cache distribution is not uniform. Now we introduce the virtual node, and set the "number of copies" to 2, which means there will be 4 "virtual node", the cache A1, the cache A2 represents the cache A;cache C1, the cache C2 on behalf of the cache C; Suppose a more ideal case, see Figure 6.
Figure 6 Mapping relationship after the introduction of "Virtual Node"
At this point, the mapping of the object to the virtual node is:
Objec1->cache A2;objec2->cache A1;objec3->cache C1;objec4->cache C2;
So objects Object1 and Object2 are mapped to cache a, and object3 and Object4 are mapped to cache C; The balance has improved a lot.
After the "Virtual node" is introduced, the mapping relationship is transformed from {object---node} to {Object-and-virtual node}. The mapping relationship 7 is shown when querying the cache of an object.
Figure 7 The cache where the object is queried
The hash calculation of "virtual node" can be based on the IP address of the corresponding node plus the number suffix. For example, assume that the IP address of cache A is 202.168.14.241.
Before introducing "Virtual node", calculate the hash value of cache a:
Hash ("202.168.14.241");
After introducing "virtual node", compute the hash value of the "virtual section" point cache A1 and cache A2:
Hash ("202.168.14.241#1"); Cache A1
Hash ("202.168.14.241#2"); Cache A2
Transfer from http://zengzhaozheng.blog.51cto.com/8219051/1409705
[Turn] the hash algorithm of mass data solving ideas