Python Source Code Analysis: implementation of Dict objects

Source: Internet
Author: User

Source code to choose the most common CPython

First look at the infrastructure to build dict:

typedef struct {
py_ssize_t Me_hash;
Pyobject *me_key;
Pyobject *me_value;
} pydictentry;

This structure is key-value in Dict, where Me_hash is the hash value of Me_key , [space for TIME]. In addition, we find that both Me_key and Me_value are pyobject pointer types, which also explains why the key and value in Dict can be any type of data in Python.

struct _dictobject {
Pyobject_head
py_ssize_t Ma_fill;
py_ssize_t ma_used;
py_ssize_t Ma_mask;
Pydictentry *ma_table;
Pydictentry * (*ma_lookup) (Pydictobject *mp, Pyobject *key, long hash);
Pydictentry Ma_smalltable[pydict_minsize];
};

This structure is the dict of the body. According to our usual understanding, dict should be a variable long object Ah! Why there are pyobject_head, not pyobject_var_head. In a closer look, the variable length of the dict is still different from the string,list,tuple, which can be specified by the ob_size in the Pyobject_var_head to indicate the number of valid elements within it. But dict can not do so, so Dict simply bypass Pyobject_var_head, and in addition to ma_used this field to explain the number of its effective elements, but also need to Ma_fill to explain the number of effective elements (to calculate the loading rate).

Ma_mask, the hash function involved in the hash;
Ma_smalltable,python always limited space for time, a small pond to cope with most of the small dict (not more than pydict_minsize);
Ma_lookup is the implementation of a detection and two-time detection function.

Before unfolding the Dict implementation details, first introduce the open addressing method of conflict resolution used by DICT. We know that hashing is the mapping of an infinite set to a finite set, and if the ideal hash function is chosen, the element can be found in O (1) time by evenly distributing the expected elements into a finite set. But the ideal hash function does not exist, and because the nature of the mapping (infinite to finite) inevitably occurs in a position where there are multiple elements to ' occupy ', which requires resolution of the conflict. Existing ways to resolve conflicts:

    1. Open addressing Method
    2. Chain Address method
    3. Dohashi Function method
    4. Domain Building method

The basic idea of the building domain method is to assume that the value of the hash function is [0,m-1], set the vector hashtable[0..m-1] as the base table, and set up a storage space vector OVERTABLE[0..V] to store the conflicting records.

The first two methods are the simplest and most efficient, and the following is a review of open addressing and chain address methods.

Open Addressing Method : When a hash table is formed, when an element is first probed where it should occupy, if it is found here (recorded as a) has already been occupied by others, it is starting from a, again detection (of course, this probe uses the hash function is not the same as the first time), If you find that you are still being accounted for, continue probing until you find a usable location (and you may never find it in the current condition). Open Address law there is a crucial problem to solve, that is, when an element leaves the hash table, how to deal with the position after leaving the state. If set to the original empty state, then the subsequent valid elements are not recognized, because in the search is also based on the above detection rules to find, so you must tell the probe function a location, although there is no valid element, but the subsequent detection may appear valid elements. We can find that open addressing is very prone to conflict (mainly when the above successful elements take up the other elements should be in the first probe successful location), so we need to increase the hash effective space.

Chain Address method : The idea of chain address method is very simple, you are not likely to appear multiple elements corresponding to the same location, then I pull a list in this position to store so hash to this position element. Very simple, but also save memory it! Unfortunately, the Python designer did not choose it.

So why did the Python inventor choose Open addressing instead of the link address method, when looking at the Python source to see this passage:

Open addressing is preferred over chaining since the link overhead (overhead) for chaining would be substantial (mass) (100% with Ty Pical malloc overhead).

Because the chain address method requires dynamic generation of a linked table node (malloc), the time efficiency is not as good as open addressing (but the loading rate of open addressing is not higher than 2/3, and the space overhead relative to the chain address method is beyond doubt), It can be seen that the design era of Python is not that memory only 512k available to use the era, the harsh memory has been compromised in efficiency. This, of course, takes into account the fact that Python has to make the lost time efficiency as much as possible by its own design due to its dynamic implementation.

Well, after the open addressing method and why the Python designer chose it, let's look at how Dict implements the algorithm. As seen above, each key-value is implemented by a entry structure, and Python uses entry's own information to indicate the state of each location: the original empty state, the active element left state, and the active element occupies the state.

    • Original null: Me_key:null; me_value:null
    • Effective element departure: Me_key:dummy; Me_value:null
    • Valid elements occupy: Me_key:not null and not dummy; Me_value:not null

The methods of hash and conflict resolution of Dict are as follows:

Lookdict (K,V)

  1. Index <-HASH1 (k), Freeslot<-null, according to Me_key and Me_value select 2, 3, 41 execution;
  2. Check that the value at index is in the ' active element occupy ' state, judging if Data[index] is consistent with V (address or content), and the lookup succeeds; otherwise turn 5
  3. Index points to the position in the ' original empty ' state, the lookup fails, if Freeslot==null returns index; otherwise returns Freeslot; Ext. 5
  4. Index is pointing to the ' active element left ' state, Freeslot<-index, ext. 5
  5. Index <-HASH2 (index), ext. 2

The implementation of Dict's Lookdict method fully embodies Python's efficiency in memory utilization and space-time change, and shows the following aspects:

      1. Memory utilization: When the original empty state is found, it will be returned if the entry of the dummy state has been found earlier.
      2. Improve efficiency: Ma_table always points to the beginning of the effective hash space, after opening up new space, Small_table will abandon it, ma_table to the first position of the newly opened space.

Python Source Code Analysis: implementation of Dict objects

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.