Python source code analysis Reading Notes: Chapter 5-dict objects

Source: Internet
Author: User
Chapter 5-dict objects

Similar to the map of C ++ STL in Python, dict is a ing container (key-> value), but the implementation principle is different. Because python uses a large number of structures such as dict (such as the Intese mechanism of string objects) and has high efficiency requirements, Python does not use the balanced binary tree of STL map, but uses a hash table, the search can be completed at least in O (1) time.
When hash is used, the conflict must be solved. dict adopts the open addressing method. I think the reason is that the open addressing method can better use the CPU cache than the zipper method, and the cache hit rate is high.
The probe function is I = (I <2) + I + perturb + 1; perturb is divided by 2 ^ 5 each time it is detected.

Each slot in the dict hash table is a custom entry structure:
Typedef struct {

Py_ssize_t me_hash;
Pyobject * me_key;
Pyobject * me_value;

} Pydictentry;
Meaning as the name suggests.

Each entry has three states: active, unused, and dummy.
Unused: me_key = me_value = NULL, that is, unused idle state.
Active: me_key! = NULL, me_value! = NULL, that is, this entry is in use
Dummy: me_key = dummy, me_value = NULL.
The condition that the hash test ends is that an unused entry is detected. However, a delete operation is bound to be performed in the dict operation. If only active is marked as unused during deletion, it is obvious that all entries after the entry cannot be detected, so the dummy structure is introduced. When dummy is encountered, the current entry is idle, but the detection cannot end. This solves the problem of detecting a chain break after deleting an entry.

Dict objects are defined:
Struct _ dictobject {

Pyobject_head
Py_ssize_t ma_fill;/* # active + # dummy */
Py_ssize_t ma_used;/* # active */

Py_ssize_t ma_mask;

Pydictentry * ma_table;
Pydictentry * (* ma_lookup) (pydictobject * MP, pyobject * Key, long hash );
Pydictentry ma_smalltable [pydict_minsize];

};
The number of entries in the active + dummy status recorded by ma_fill.
The number of entries in the active state recorded by ma_used.
Ma_mask equals to the total slot-1. Because the hash value of a key is likely to exceed the total number of slots, it must be within the range of the total number of slots when used as an index. The total number of slots must be the power of 2, for example, 0x1000. Therefore, after 1 is reduced, it becomes mask: 0x111. Create an & operation with hash to limit the index to 0 ~ Between 0 x, that is, the total number of slots 0 x, clever :)
Ma_smalltable is the default slot with pydict_minsize initially.
Ma_table initially points to ma_smalltable. If it is scaled up later, it points to the new slot space.
Ma_lookup is the search function pointer.

The creation of dict objects is very simple. First, check whether there are available objects in the buffer Object pool. If there are available objects, use them directly. If not, apply from the stack. Set the fill and used fields to 0. In python, strings are used as keys, so all search functions have a string-optimized version: lookdict_string. If the key is not a string object during the check, the default lookdict function is called for search.

The insert operation of dict is completed by the insertdict function. The significance of the insert operation is that if the key-value does not exist, the insert operation will overwrite the key-value. Therefore, the entry corresponding to the key is obtained through the function pointed to by ma_lookup. If the value is not equal to null, it indicates that the key pointer is found and replaced. Otherwise, a new key-value pair is directly set on the returned entry.
Python calls the packaging function pydict_setitem of the insertdict function when processing expressions such as D [Key] = value. Pydict_setitem calculates the hash value of the key and then transmits the required information to insertdict. Then, determine whether to resize Based on the remaining space of the ma_table. Legend and theory have proved that the probability of conflict is greatly increased when the capacity exceeds 2/3, so after the capacity exceeds 2/3, it will be resized.

It is easier to delete the entry in dict. Calculate the hash value, locate the entry, convert it from active to dummy, and adjust the table capacity.

The last is the object pool. Like the previous list object pool, dealloc only recycles the table memory and then puts dict in the pool for reuse later in new mode. Reduce the operation of applying for memory from the heap.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.