The efficient implementation of MAP in Go runtime (not using paradigm)

Source: Internet
Author: User
Tags random seed java util
This article, based on my speech in Tokyo, Japan [Gocon Spring 2018] (https://gocon.connpass.com/event/82515/), discusses how map in the Go language is implemented. # # What is a mapping function to understand how map works, we need to discuss the *MAP function * first. A map function is used to map a value to another value. Given a value, we call *key*, and it returns another value, called *value*. "Map (key) →value" Now, map is useless, unless we put in some data. We need a function to add the data to the map ' insert (map, key, value) ' and a function to remove the data from the map ' delete (map, key) ' There are some interesting points in the implementation, such as querying a key that is currently in the map No, but that is beyond the scope of our discussion today. Instead, we focus on these points today; Insert, delete, and how to map key to value. # # Go Map is a hashmaphashmap is a specific implementation of the map I'm going to discuss, because that's the way it's implemented in Go runtime. Hashmap is a classic data structure that provides an average O (1) query time complexity, even in the worst case, with O (n) complexity. That is, normally, the time to execute the map function is a constant. The size of this constant depends partly on how the HashMap is designed, and the change in map access time from O (1) to O (n) depends on its *hash function *. # # # hash function What is the *hash function *? A hash function is used to receive a key of unknown length and then return a fixed-length value. "Hash (key) →integer" "This *hash value* is an integer in most cases, for the reason we'll talk about it. The Hash function and the mapping function are similar. They both receive a key and return a value. However, the hash function differs in that it returns the value from key, not the key. The important features of the hash function of # # # are necessary to discuss the characteristics of a good hash function, because the quality of the hash function determines whether its map function is close to O (1) in its operation complexity. There are two important features in the use of Hashmap.。 The first one is * stability *. The Hash function must be stable. Given the same key, your hash function must return the same value. Otherwise you can't find the data you put in the map. The second feature is a good distribution *. Given two similar keys, the results should be extremely different. This is important because there are two reasons. First, as we will see later, the value in HashMap should be evenly distributed between buckets, otherwise the complexity of access will not be O (1). Second, because the user can control the input of the hash function to some extent, they can also control the output of the hash function. This leads to poor distribution, which in some languages is a way of DDoS attacks. This feature is also called Collision Resistance (collision resistance) *. # # # HashMap's data structure about the second part of HashMap says how the information is stored. [Hashmap-data-structure] (https://raw.githubusercontent.com/studygolang/gctt-images/master/go-hashmap/Gocon-2018-Maps.021-624x351.png) The classic HASHMAP structure is a bucket array in which each item contains a pointer to an Key/value entry array. In the current example, there are 8 buckets in our HashMap (The Go language is so implemented), and each bucket holds a maximum of 8 Key/value entry (also the implementation of the Go language). The use of 2 is convenient for bitmask and shift, without expensive division operations. Since entry is added to the map, assuming there is a well-distributed hash function, the buckets is roughly evenly populated. Once the number of entry in the bucket exceeds a certain percentage of the total, that is, the "load factor" *, then the map doubles the number of buckets and reallocate the original entry. Remember this data structure. Suppose we now have a map to store the project name and the corresponding number of Github star, then how do we insert a value into the map?! [Insert-project-stars] (Https://raw.githubusercontent.com/studygolang/gctt-images/masTer/go-hashmap//screen-shot-2018-05-20-at-20.25.36-624x351.png) We start with the key, pass it to the hash function, and then do the mask operation only take the lowest few to get to the bucket The offset of the array's correct position. This is also the bucket where the entry is to be put, and its hash value ends with 3 (binary 011). Finally we traverse the bucket's entry list until we find an empty location and insert our key and value. If the key already exists, we will overwrite value. [Map (Moby/moby)] (https://raw.githubusercontent.com/studygolang/gctt-images/master/go-hashmap// Screen-shot-2018-05-20-at-20.25.44-624x351.png) Now, we still use this to find value from the map. The process is very similar. Let's start with the key hash operation. Since our bucket array contains 8 elements, we take a minimum of 3 bits, which is the 5th bucket (binary 101). If our hash function is correct, then the string "Moby/moby" after the hash operation will always get the same value. So we know that key does not exist in other buckets. Now we can get the results by doing a linear lookup from the bucket's entry list by comparing the key. # # # # Four Points of HashMap this is a high-level explanation of the classic HASHMAP structure. We have seen that to achieve a hashmap there are four points; 1. You need a hash function that calculates the key. 2. You need an algorithm that determines the equivalence of keys. 3. You need to know the size of the key. 4. You need to know the size of value because it also affects the size of the bucket structure. The compiler needs to know the size of the bucket structure, which determines the step value in memory when you traverse or add data. # # HashMap in other languages before I discuss the implementation of the Go language for HashMap, I'd like to start with a brief introduction to how HashMap is implemented in the other two programming languages. I chose these two languages because they all provide a separate map type to accommodate the various key and value types. ### C + + The first language we want to talk about is C + +. The C + + Standard Template Library (STL) provides the ' std::unordered_map ' typically used as a hashmap implementation. This is the definition of ' std::unordered_map '. This is a template, so the actual value of the parameter depends on how the template is initialized. "' c++template< class key,//the type of the key class T,//The type of the value class Hash = Std::hash<key> The hash function class keyequal = STD::EQUAL_TO&LT;KEY&GT;,//The Key equality function class Allocator = Std::alloca tor< std::p air<const Key, t> >> class unordered_map; "There are a lot of things to talk about, but it's important to have the following: * Templates receive the type of key and value as parameters, so know The size of their paths. * The template has a key type of ' std::hash ' function, so it knows how to hash it to its key value. * Template also has a key type of ' std::equal_to ' function, so know how to compare two key values. Now that we know how the four points of HashMap in C + + ' Std::unordered_map ' are communicated to the compiler, let's take a look at how it actually works. [Std::unordered_map] (https://raw.githubusercontent.com/studygolang/gctt-images/master/go-hashmap//Gocon-2018-Maps.030-624x351.png) First we pass the key to the ' std::hash ' function to get the hash value of the key. Then do the mask and take the sequence number in the bucket array, then traverse the entry list of the bucket and compare the key with the ' std::equal_to ' function. # # # Java The second language we want to talk about is Java. As expected, the HashMap class in JavaType is called ' Java.util.Hashmap '. In Java, ' Java.util.Hashmap ' can only manipulate objects, because almost everything in Java is a subclass of ' java.lang.Object '. Since all objects in Java originate from ' Java.lang.Object ', you can inherit or override the ' hashcode ' and ' equals ' methods. However you cannot directly store 8 basic types; ' Boolean ', ' int ', ' short ', ' Long ', ' byte ', ' char ', ' float ' and ' double ', because they are not subclasses of ' java.lang.Object '. You can neither use them as keys nor store them as value. To break through this limitation, they are implicitly converted to represent their respective objects. Also called boxing. Let's take a look at what it looks like in Java's HashMap, by putting this restriction aside. [Java_hashmap] (https://raw.githubusercontent.com/studygolang/gctt-images/master/go-hashmap//Gocon-2018-Maps.034-624x351.png) First we call the ' hashcode ' method of key to get its hash value. The mask operation is then taken to the corresponding position in the bucket array, where a pointer to ' Entry ' is stored. The ' Entry ' has a key, a value, and a pointer to the next ' Entry ', forming a linked list. # # Tradeoff Now that we know how C + + and Java are implemented HashMap, let's compare their pros and cons. # # # C + + templatedstd::unordered_map#### Benefits * The size of the key and value types is determined during compilation. * The size of the data structure is always deterministic and does not require a boxing operation. * Because the code was finalized during compilation, other compile-optimized operations such as inline, constant folding, and dead code removal can be involved. In summary, the map in C + + is as fast and efficient as your own handwritten map for each key/value type combination, as it is actually the case. # # # # Disadvantages * Code bloat. Each of the different maps is of different types. If you have N map types in your code, in your generationYou also need to have N copies of the map code in the code base. * Compile time bloat. Because of how header files and templates work, each file that uses the ' Std::unordered_map ' code needs to be generated, compiled, and optimized. # # # Java Util hashmap#### Benefits * A Map code implementation can serve any java.util.Object subclass. You only need to compile a copy of the Java.util.Object, which you can refer to in each class file. # # # # # # Cons * Everything must be an object, even if it's not. This means that the base type map must be converted to an object by a boxing operation. Boxing operations increase the pressure on garbage collection, and additional pointer references increase the cache pressure (each object must be looked up by a different pointer). * Buckets are stored in linked lists instead of sequential arrays. This causes a large number of pointer tracking operations to occur during object comparisons. * The Hash and equals functions need to be written by the code writer. Incorrect hash and equals functions slow down map operation and even cause map to behave incorrectly. # # Implementation of HashMap in go now, let's talk about the implementation of map in go. It retains many of the advantages of the implementation we have just discussed, without those shortcomings. Like C + + and Java, HashMap in Go is written using the go language. But Go doesn't support generics, so how do we write a hashmap that can serve (almost) any type? # # # Go runtime using interface{}? No, Go runtime does not use interface{} to implement HashMap. Although interface{} was used in these packages like ' Container/{list,heap} ', the runtime's map was not used. # # # Does the compiler use code generation? No, there is only one implementation of the map in the Go language executable file. Unlike Java, it does not have a boxing operation on ' interface{} '. So how does it work? This will be divided into two parts to answer. It requires collaboration between the compiler and runtime (runtime). # # # Compile time rewrite the first part we need to understand how the implementation of map in the runtime package does the lookup, insert, and delete operations. During compilation, the operation of the map is rewritten to call the runtime. For example. "V: = m["Key"]→runtime.mapaccess1 (M, "key", &v) v, OK: = m["Key"]→runtime.mapaccess2 (M, "key", &v, &ok) m["key"] = 9001→runtime.mapinsert (M, "key", 9001) Delete (m, "key") →runtime.mapdelete (M, "key") "It's worth noting that the channel also does the same thing, slice But not. This is because the channel is a complex data type. There is a complex interaction between the send, receive, and ' SELECT ' Operations and the scheduler, so it is delegated to the runtime. In comparison, slice is much simpler. Such operations as Slice's access, ' Len ' and ' cap ' are done by the compiler themselves, and the complex, like ' Copy ' and ' append ', is entrusted to the runtime. # # # Map Code explanation Now we know that the compiler has written the map operation to call the runtime. We also know that within the runtime, there is a function called ' mapaccess1 ', a function called ' mapaccess2 ', and so on. So, how does the compiler rewrite "gov: = m[" Key "" "" To "goruntime.mapaccess (M," key ", &v)" Without using ' interface{} '? " The simplest function to explain how the map type in Go works is to give you a look at the definition of ' runtime.mapaccess1 '. "' Gofunc mapaccess1 (t *maptype, h *hmap, key unsafe.) Pointer) unsafe. Pointer "Let's take a look at these parameters. * ' key ' is a pointer to the value you provide as a key. * ' h ' is a pointer to the ' RUNTIME.HMAP ' structure. ' Hmap ' is a HASHMAP structure that holds buckets and some other values of the runtime. * ' t ' is a pointer to ' Maptype '. Why do we need a ' *maptype ' after we have ' *hmap '? ' *maptype ' is a special thing that allows Universal ' *hmap ' to serve (almost) any key and ValuCombination of type E. There will be a specific ' maptype ' value for each individual map definition in your program. For example, there is a ' maptype ' value that describes the mapping from ' strings ' to ' ints ', and the other describes ' strings ' to ' http. Headers ' mapping, and so on. In C + +, there is a complete implementation for each independent map definition. And Go is not, it creates a ' Maptype ' during compilation and uses it when calling the runtime's map function. "' Gotype maptype struct {typ _type key *_type elem *_type bucket *_type//internal type representing a hash bucket hmap *_type//internal type representing a hmap keysize uint8//size of key slot indirectkey bool//store PTR to key Instea D of key itself valuesize uint8//size of value slot indirectvalue bool//store PTR to value instead of value itself BUC Ketsize uint16//size of bucket reflexivekey bool//True if k==k for all keys needkeyupdate bool//True if we need to u Pdate key on Overwrite} ' Maptype ' contains the various attribute details required to map from key to Elem in a particular map. It contains information about the key and element. ' Maptype.key ' contains information that points to the pointer to our incoming key. We call this the * type descriptor *. "' Gotype _type struct {size uintptr ptrdata uintptr//size of memory prefix holding all pointers hash UInt32 Tflag TFLA G Align Uint8 fieldalign uint8 Kind uint8 alg *typealg//Gcdata stores the GC type data for the garbage collector. If the Kindgcprog bit is set in kind, Gcdata are a GC program. Otherwise It is a ptrmask bitmap. See Mbitmap.go for details. Gcdata *byte str nameoff ptrtothis Typeoff} "in the ' _type ' type, contains its size. This is important because we only have a pointer to key, and we don't know how big it really is and what type it is. Whether it is an integer, a struct, and so on. We also need to know how to compare this type of value and how to hash this type of value, which is where the meaning of the ' _type.alg ' field is. "' Gotype typealg struct {//function for hashing objects of this type//(PTR to object, seed)--hash hash func (Unsa Fe. Pointer, UIntPtr) uintptr//function for comparing objects of this type//(PTR to Object A, ptr to object B) = = =? Equal func (unsafe. Pointer, unsafe. Pointer) bool} "in your program this is a value that serves a specific type of ' typealg '. Put together, this is the ' runtime.mapaccess1 ' function (slightly modified, easy to understand). "' go//Mapaccess1 returns a pointer to H[key]. Never returns nil, instead//it would return a reference to the zero object for the value type if//the key was not in the M Ap.func Mapaccess1 (t *maptype,H *hmap, key unsafe. Pointer) unsafe. Pointer {if H = = Nil | | h.count = = 0 {return unsafe. Pointer (&zeroval[0])} ALG: = t.key.alg Hash: = Alg.hash (Key, UIntPtr (H.hash0)) m: = Bucketmask (h.b) B: = (*bmap) (Add ( H.buckets, (hash&m) *uintptr (t.bucketsize)) "The point of concern is the ' alg.hash ' argument passed to the ' h.hash0 ' function. ' H.hash0 ' is a random seed generated at the time of map creation to prevent hash collisions in Go runtime. Anyone can read the Go Language source code, so you can find a series of values, so that it uses the Go language hash function calculation, the resulting hash value will be placed in the same bucket. The existence of the seed adds a lot of randomness to the hash function, which provides some protection measures for the collision attack. # # Conclusion I am delighted to be able to make this speech at the Gocon conference. Because the map implementation in Go is a tradeoff between C + + and Java, it takes a lot of advantages without a lot of drawbacks. Unlike Java, you can use basic types of data directly, such as characters and integers, without the need for boxing operations. Unlike C + +, in the final binary, there is no implementation of n ' Runtime.hashmap ', only n copies of ' Runtime.maptype ' values, significantly reducing program volume and compile time. Now what I'm trying to say is that I'm not trying to tell you that Go should not support the paradigm. My goal today is to illustrate the current status of Go 1 and how the map type works in the current situation. Today's implementation of the Go language map is very efficient, providing many of the benefits of template types, without the drawbacks of code generation and compile time bloat. I see it as a design case worth learning to admire. 1. You can find more information about the RUNTIME.HMAP structure here. [Https://dave.cheney.net/2017/04/30/if-a-map-isnt-a-reference-variable-what-is-it] (https://dave.cheney.net/2017/04/30/if-a-map-isnt-a-reference-variable-what-is-it)

Via:https://dave.cheney.net/2018/05/29/how-the-go-runtime-implements-maps-efficiently-without-generics

Author: Dave Cheney Translator: Alfred-zhong proofreading: polaris1119

This article by GCTT original compilation, go language Chinese network honor launches

This article was originally translated by GCTT and the Go Language Chinese network. Also want to join the ranks of translators, for open source to do some of their own contribution? Welcome to join Gctt!
Translation work and translations are published only for the purpose of learning and communication, translation work in accordance with the provisions of the CC-BY-NC-SA agreement, if our work has violated your interests, please contact us promptly.
Welcome to the CC-BY-NC-SA agreement, please mark and keep the original/translation link and author/translator information in the text.
The article only represents the author's knowledge and views, if there are different points of view, please line up downstairs to spit groove

2,081 Reads
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.