Implementation of Go language map

Source: Internet
Author: User

The map in Go is implemented with a hash table at the bottom, and you can find its implementation in $GOROOT/SRC/PKG/RUNTIME/HASHMAP.GOC.

Data

Some key fields in the data structure of the hash table are as follows:

struct Hmap{    uint8   B;    // 可以容纳2^B个项    uint16  bucketsize;   // 每个桶的大小    byte    *buckets;     // 2^B个Buckets的数组 byte *oldbuckets; // 前一个buckets,只有当正在扩容时才不为空};

The structure given above is only a hmap part of the domain. It is important to note that there is an array of buckets, rather than an array of bucket* pointers, that are used directly here. This means that the bucket allocations for the first bucket and the back overflow chain are somewhat different. The first bucket is a contiguous amount of memory space, and the bucket space behind the overflow chain is allocated using MALLOCGC.

This hash structure uses an extensible hashing algorithm, which determines which bucket a value belongs to by the hash value MoD's current hash table size, and the hash table size is the exponent of 2, which is the 2^b in the structure above. Each expansion will increase to twice times the last size. There is a buckets and a oldbuckets in the structure to achieve incremental expansion. Normally, the buckets is used directly, while the oldbuckets is empty. If the current hash table is expanding, oldbuckets is not empty, and the buckets size is twice times the size of the oldbuckets.

The concrete bucket structure is as follows:

struct Bucket{    uint8  tophash[BUCKETSIZE]; // hash值的高8位....低位从bucket的array定位到bucket    Bucket *overflow;           // 溢出桶链表,如果有    byte   data[1]; // BUCKETSIZE keys followed by BUCKETSIZE values};

Where bucketsize is defined by a macro of 8, each bucket holds up to 8 key/value pairs, if more than 8, then will apply for a new bucket, and it with the previous bucket chain up.

The hash value of key is obtained by using the corresponding hash algorithm according to the type of key. Use the low value of the hash as the index of the buckets array in the HMAP structure to find the bucket where key is located. The high 8 bits of the hash are stored in the bucket's tophash. Note that the high 8 bits here are not used as key/value in the bucket, but as a primary key that matches each item of the Tophash array in the order of the lookup . Compare the hash value high with the bucket tophash[i] is equal, if equal then compare the bucket of the key I and the given key is equal. If it is equal, it returns its corresponding value, and conversely, in overflow buckets, follow the above method to continue searching.

The entire hash of the storage as shown (the temporary first used the XX classmate picture, this figure a bit of a problem):

Figure 2.2 Storage structure of HMAP

Note that a detail is the order in which the key/value is placed in the bucket, the keys are put together, and the values are put together, why not put the key and the corresponding value together? If you do, the storage structure will become key1/value1/key2/value2 ... Imagine a map[int64]int8 that, given the byte alignment, would waste a lot of storage space. I have to say that through a small detail above, you can see the thought of go in design.

Incremental expansion

We all know that the hash table is space-time, the access speed is directly related to the fill factor, so when the hash table is too full, it needs to be expanded.

If the size of the hash table before expansion is 2^b, the size after expansion is 2^ (b+1), each expansion will be twice times the original size, hash table size is always 2 of the number of times, then there (hash mod 2^b) equivalent to (hash & (2^b-1)). This simplifies the operation and avoids the take-away operation.

Assuming that the capacity is X before expansion, the capacity is Y after expansion, and for a hash value, in general (hash mod X) is not equal to (hash mod Y), so the new position of each item in the hash table is recalculated after the expansion. When the hash table is expanded, it is necessary to re-hash the old pair onto the new table (known as evacuate in the source code), which is not done at once after the expansion, but is progressively completed (1-2 pair per move at insert and remove). The Go language uses incremental expansion.

Why is incremental expansion? The main focus is to shorten the response time of the map container. If we use map directly as a Web application store with very high real-time response requirements, if we do not use incremental expansion, when there are many elements stored in the map, the system will be stuck at the time of expansion, resulting in the inability to respond to requests for a longer period. However, incremental expansion is essentially a matter of allocating the total expansion time to each hash operation.

Expansion creates a new table that is twice times the size of the original, moving the old bucket into the new table, and does not remove the old bucket from the oldbucket, but adds a deleted tag.

It is due to the gradual completion of this work that it leads to a part of the data in the old table, part of the new table, so it has an effect on the processing logic of the insert, remove, and lookup operations of the hash table. The Oldbucket is freed only when all buckets have been moved from the old table to the new table.

What is the filling factor for expansion? If grow too often, will cause the utilization of space is very low, if long before grow, will form a lot of overflow buckets, the efficiency of the search will also decline. How to choose this balance (in go, the balance is a macro-controlled (#define LOAD 6.5), which means that if the number of elements in the table is greater than the number of elements that can fit in the table, a grow action is triggered. So how did the 6.5 get it? Originally this value from the author of a test program, unfortunately is not able to find the relevant source, but the author gave the results of the test:

        LOAD%OverflowBytes/entry Hitprobe Missprobe4.002.1320.773.004.004.504.0517.303.254.505.006.8514.773.505.005.5010.5512.943.755.506.0015.2711.674.006.006.5020.9010.794.256.507.0027.1410.154.507.007.5034.039.734.757.508.0041.109.40 5.00 8.00% overflow = percentage of buckets which has an overflow bucket bytes/entry = overhead bytes Used per key/value pair hitprobe = # of entries to check when looking up a present key missprobe = # of entries to check when looking up an absent key       

It can be seen that the author takes a relatively moderate value.

Find process
    1. Calculates the hash value based on key.
    2. If an old table exists, first look in the old table, and if the bucket you found is already evacuated, go to step 3. Conversely, returns its corresponding value.
    3. Find the corresponding value in the new table.

Here's a little bit of detail to note. Not seriously, you might think that the low position is used to locate the bucket in the index of the array, so the high is used to key/valule the offset inside the bucket. In fact, the high 8 bits are not used as offset, but are used to speed up key comparisons.

do { //对每个桶b    //依次比较桶内的每一项存放的tophash与所求的hash值高位是否相等    for(i = 0, k = b->data, v = k + h->keysize * BUCKETSIZE; i < BUCKETSIZE; i++, k += h->keysize, v += h->valuesize) { if(b->tophash[i] == top) { k2 = IK(h, k); t->key->alg->equal(&eq, t->key->size, key, k2); if(eq) { //相等的情况下再去做key比较... *keyp = k2; return IV(h, v); } } } b = b->overflow; //b设置为它的下一下溢出链} while(b != nil);
Insert Process Analysis
    1. The hash value is calculated according to key, and the corresponding bucket is obtained.
    2. If the bucket is in an old table, it is re-hashed into the new table.
    3. In the bucket, find the idle position and update its corresponding value if there is already a key that needs to be inserted.
    4. Determine if the table is grow based on the number of elements in the table.
    5. If the corresponding bucket is already full, re-apply the new bucket as Overbucket.
    6. Insert the key/value pair into the bucket.

Here are a few details to note.

During the scaling process, Oldbucket is frozen, and lookups are found in Oldbucket, but no data is inserted in the Oldbucket. If the corresponding key is found in Oldbucket, it is done by migrating it to a new bucket and adding the evalucated tag. And another pair will be migrated as an extra.

Then the key/value is inserted into this position as soon as the first slot is found in a bucket. That is, the location in front of the bucket will overwrite the back (similar to the storage system design in the removal of one of the commonly used techniques, directly with the new data appended to write, the new version of the data overwrite the old version of data). Find the same key or find the first slot to end the traversal. However, this also means that the delete must completely traverse the bucket all overflow chain, all the same key data are deleted. So the current design of the map is optimized for insertion, and the removal efficiency will be lower than the insertion.

Performance optimization in map design

After reading the map source code found that the author has done a lot of design choices. I have limited level, not the pros and cons of the review, here just to share with the reader.

The hmap is an array of buckets, not an array of bucket pointers. The good aspect is that you can allocate large memory at once, reduce the number of allocations, and avoid multiple calls to MALLOCGC. The disadvantage, however, is that the extensible hashing algorithm does not work, and the expansion creates a copy of the value of the entire array (which is much less expensive if the implementation uses a bucket pointer array as a pointer copy). The second is that the first bucket has an inconsistency with the back. This makes the deletion logic a little more complicated. For example, delete the back of the overflow chain can be deleted directly, and for the first bucket, to wait until the completion of evalucated, the entire oldbucket deleted.

There is no reuse of nodes that set freelist reuse deletions. The author adds this to a todo note, but thinks it doesn't make much sense. Because on the one hand, bucket size is not consistent, reuse is more troublesome. On the other hand, the underlying storage has been implemented in the memory pool, so no reuse here will be reused in the memory allocation layer,

Bucket direct key/value and indirect key/value optimization. This optimization is done well. Note that the code will find that if key or value is less than 128 bytes, then their values are directly used as buckets for storage. Otherwise, a pointer to the actual key/value data is stored in the bucket,

Buckets Save 8 Key/value pairs. Perform sequential comparisons when searching. The first time you find a high position is not used as offset, but for faster comparisons. After locating the bucket, it is a sequential comparison of the lookup process. After thinking carefully, think it is OK. Since there are only 8 buckets, it is not too much to compare them in order. Still O (1) is just a little bit bigger than the front factor. Equivalent to a hash to a small range, in this small range in order to find.

Inserts the deleted optimizations. As mentioned earlier, insert just find the same key or the first slot, if there is more than one key in the bucket, the front is overwritten (only if, in fact, it does not happen). The removal will require traversing all bucket overflow chains. The design of this map is optimized for insertion. Considering the general application scenario, this should be considered reasonable.

The author also lists another 2 TODO: Merge several buckets that are almost empty, and consider shrink table if there are few elements in the table. (after all, the realization is simply grow).

Original: https://www.w3cschool.cn/go_internals/go_internals-xe3r282i.html

Implementation of Go language map

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.