The operation principle of Python program and garbage collection

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. A simple example

Let's start with a simple example that contains two files foo.py and demo.py

[foo.py]
def add (A, B):
Return a + b

[demo.py]
Import Foo

A = [1, ' Python ']
A = ' A string '

def func ():
A = 1
b = 257
Print (A + B)

Print (a)

if __name__ = = ' __main__ ':
Func ()
Foo.add (1, 2)

Execute this procedure

Python demo.py

Output results

A string
258

Also, the file directory has one more Foo.pyc file

2. The Magic Behind

After you see the results of the program's execution, you begin the line to interpret the code.

2.1 Module

Python treats the. py file as a module, in which there is a main module, which is the entry for the program to run. In this example, the main module is demo.py.

2.2 Compiling

After the Python demo.py is executed, the Python interpreter is started, and demo.py is compiled into a byte code object pycodeobject.

Some people may be very curious, the result of compiling should not be PYc file, just like Java class file, then why is an object, here a little explanation.

In the Python world, everything is an object, a function is an object, a type is an object, a class is an object (a class is a custom type, before Python 2.2, int, dict these built-in types are different from classes, and then unified, all inherited from Object), Even the compiled bytecode is an object, and the. pyc file is the representation of the bytecode object (Pycodeobject) on the hard disk.

During the run, the result of the compilation is the Pycodeobject object, which will only exist in memory, and when the Python code of this module is executed, the compiled results are saved to the PYc file so that the next time it is not compiled, it is loaded directly into memory. The PYc file is just a representation of the Pycodeobject object on the hard disk.

This Pycodeobject object contains strings, constant values in the Python source code, and bytecode directives compiled after parsing. The Pycodeobject object also stores the corresponding relationship between these bytecode directives and the original code line number, so that when an exception occurs, it can indicate which line of code is in.

2.3 PYc File

A PYC file contains three pieces of information: Python magic number, time information for PYC file creation, and Pycodeobject objects.

Magic number is an integer value defined by Python. In general, different versions of the Python implementation will define different magic number, which is used to ensure Python compatibility. For example, to limit the number of PYC files compiled by a lower version cannot be performed by a later version of the Python program, you just need to check magic number. Because different versions of Python-defined bytecode directives may be different, if you do not do a check, the execution can be an error.

The code shown below can be used to create PYC files, using the method

Python generate_pyc.py module_name

For example

Python generate_pyc.py Demo

[GENERATE_PYC.PYC]

Import Imp

Import Sys

def GENERATE_PYC (name):

FP, pathname, description = Imp.find_module (name)

Try

Imp.load_module (name, FP, pathname, description)

Finally

If FP:

Fp.close ()

if __name__ = = ' __main__ ':

GENERATE_PYC (Sys.argv[1])

2.4 byte code instruction

Why is the PYc file also known as a byte-code file? Because these files store binary byte data, not text data that is visually viewable.

The Python Standard library provides a tool dis for generating code corresponding to bytecode. Dis provides a method named Dis, which receives a code object and then outputs the byte code instruction information in the code object.

s = open (' demo.py '). Read ()

CO = compile (S, ' demo.py ', ' exec ')

Import dis

Dis.dis (CO)

The above code can be executed to output demo.py compiled byte code instructions

1 0 load_const 0 (-1)

3 Load_const 1 (None)

6 Import_name 0 (foo)

9 Store_name 0 (foo)

3 Load_const 2 (1)

Load_const 3 (U ' python ')

2 Build_list

Store_name 1 (a)

4 Load_const 4 (U ' A string ')

Store_name 1 (a)

6 Load_const 5 (<code object func at 00d97650, file "demo.py", line 6>)

Make_function 0

Store_name 2 (func)

Load_name 1 (a)

Print_item

Print_newline

Load_name 3 (__name__)

Load_const 6 (U ' __main__ ')

Compare_op 2 (= =)

82 Pop_jump_if_false

Load_name 2 (func)

0 call_function

Pop_top

Load_name 0 (foo)

Load_attr 4 (ADD)

Load_const 2 (1)

Load_const 7 (2)

Call_function 2

Pop_top

Jump_forward 0 (to 82)

>> load_const 1 (None)

Return_value

2.5 Python Virtual Machine

After the demo.py is compiled, the next work is done by the Python virtual machine to execute the bytecode instructions. The Python virtual opportunity reads each byte-code instruction sequentially from the compiled Pycodeobject object and executes the bytecode instruction in the current context. It is through this cycle of process that our program is implemented.

2.6 import directive

The first line of code for demo.py is import foo. The import instruction is used to load a module, and another way to load a module is from xx import yy. The advantage of using the FROM statement is that you can copy only the required symbol variables into the current namespace (about which namespaces will be described later).

As mentioned earlier, when a PYc file is already present, it can be loaded directly without the compilation process. However, the contents of the code file will be updated, how to ensure that the update can be recompiled without the old PYc file. The answer is the creation time information stored in the PYc file. When the import directive is executed, if the PYc file already exists, Python checks whether the creation time is later than the code file modification time, so that it can be judged whether it needs to be recompiled or loaded directly. If no PYC files exist, the py file is compiled first.

2.7 Absolute Introduction and relative introduction

This line of code for import Foo is already described in the previous article. Here is a question of what Foo is and how to find Foo. This belongs to the Python module introduction rules, here do not expand the introduction, you can refer to pep-0328.

2.8 Assignment Statements

Next, execute to a = [1, ' Python '], which is an assignment statement that defines a variable a, and its corresponding value is [1, ' Python ']. Here to explain, what is a variable?

According to Wikipedia's explanation,

A variable is a storage location and an associated symbolic name that contains a number of known or unknown quantities or information.

A variable is actually a symbol of a string that is used to correlate an object that is stored in memory. In Python, the Dict (the Dict object of Python) is used to store the variable symbol (string) and the mapping of an object.

The assignment statement is actually used to create the association, in this case, the symbol A is mapped to a list object [1, ' Python '].

The following code executes a = ' a string ', which establishes a mapping between symbol A and another string object a string. Future action on variable A will be reflected on string object a string.

2.9 def Directive

Our Python code continues to run down, and this is done to a def func (), which can also be discerned from the bytecode instructions make_function. Yes, this instruction is used to create functions. Python is a dynamic language, and def actually executes an instruction to create a function (class is the instruction that creates the class), not just a syntax keyword. Functions are not created in advance, but are created when they are executed.

def func () will create a function object called Func. You actually create a function object first, and then bind the Func name symbol to this function.

In Python, overloads in C and Java cannot be implemented because overloading requires the same function name, and the type or number of arguments is different, but Python associates a function with a variable symbol (such as the Func here), and when we create a function with the same name again with a def statement, The variable name is bound to the new function object.

2.10 Dynamic Types

Continue to see the code inside the function func, and then there is an assignment statement a = 1. Variable A has now become a third type, and it is now an integer. So how does Python implement a dynamic type? The answer is hidden in the object of the specific storage. Variable A is just a symbol (actually a string object), and the type information is stored on the object. In Python, the core of the object mechanism is type information and reference counting, where reference counts are part of garbage collection.

With type (a), you can output the type of a, here is int

b = 257 Skip, let's take a look at print (A + B), print is the output function, skip here. What you want to explore here is a + B.

Because A and B do not store type information, you must first check the type when executing a + B, such as the results of 1 + 2 and "1" + "2" are different.

Seeing this, we can imagine how much tedious it would take to execute a simple A + B,python virtual machine. First you need to examine the types of objects that are corresponding to a and B respectively. Also match whether the type is consistent (1 + "2" will have an exception), and then call the correct + function (such as the value + or String +) based on the type of the object, and the CPU will need to execute the ADD instruction for the above statement (the variable must first be MOV to register).

2.11 Namespaces (namespace)

When you introduce the above code, you also miss out on a key message: namespaces. In Python, classes, Functions, and module all correspond to a separate namespace. A separate namespace corresponds to a Pycodeobject object, so the demo.py file above compiles generates two pycodeobject, only in the pycodeobject of this module layer demo.py through a variable symbol func The pycodeobject of a function is nested.

The meaning of namespaces is to determine exactly what object a variable symbol corresponds to. Namespaces can form a chain of namespaces one at a time, and in the process of executing a Python virtual machine, a large part of the duration consumes the object of identifying a symbol from the chain of namespaces.

In Python, namespaces are implemented by a Dict object that maintains such associations as (name,obj).

Here, add the import foo this line of code will create a new variable name in the namespace of demo.py this module Foo,foo will bind to a Pycodeobject object, that is, the foo.py compilation result.

2.11.1 dir function

Python's built-in function dir can be used to view all the moniker under a single namespace. One use is to look at all the properties and methods of a namespace (the namespace here refers to classes, functions, module).

For example, to view the current namespace, you can use Dir () to view the SYS module, which can be used with DIR (SYS).

2.11.2 LEGB Rules

Python uses the order of LEGB to find the object of a symbol

Locals-> enclosing function-> globals-> builtins

Locals, the current namespace (such as functions, modules), the parameters of the function also belong to the variables within the namespace

enclosing, namespaces for external nested functions (common in closures)

Def fun1 (a):

Def fun2 ():

# a namespaces in external nested functions

Print (a)

Globals, global variable, function defines the namespace of the module

A = 1

def fun ():

# need to declare global variables through the global directive

Global A

# Modify global variables instead of creating a new local variable

A = 2

Builtins, the namespace of the built-in module. Python automatically loads a number of built-in functions, classes, such as Dict,list,type,print, in the __builtins__ module at startup, which can be viewed using Dir (__builtins__). That's why we can use so many rich functions and functions without having to import any modules.

By introducing the namespaces, you can understand why the output of this line of print (a) is the result of a string.

2.12 Built-in Properties __name__

Now it's time to explain the line code of If __name__ = = ' __main__ '. When the Python program starts, Python automatically sets a property for each module __name__ usually uses the name of the module, which is the filename, but the only exception is the main module, which will be set to __main__. By taking advantage of this feature, you can do something special. For example, when the module is running with the main module, you can run the test case. And when the other modules import, it is only obediently, providing functions and functions is good.

2.13 function calls

The last two lines are function calls, omitted here.

3. Recalls

Finally, there are some things that need to be reviewed and supplemented.

3.1 pyc File

Python generates PYC files only for modules that may continue to be used and loaded later, and Python thinks that modules that use the Import directive are of this type and therefore generate PYC files. For modules that are only used temporarily, the PYc file is not generated and Python takes the main module as a file of this type. This explains why the Python demo.py only generates a FOO.PYC file when it finishes executing.

If you want to ask when the PYc file is generated, the answer is that after the import directive is executed, the From XX import yy also belongs to the import directive.

3.2 Small integer object pool

In the example demo.py here, the integer used is deliberately a 257, this is to introduce the small integer object pool. Integers are widely used in programs, Python uses small integer object pools to optimize speed, avoiding the frequent application and destruction of memory space for integers.

Python's definition of small integers is [-5, 257], and these integer objects are established in advance and are not garbage collected. In a Python program, all integers in this range use the same object, as you can see from the example below.

>>> A = 1

>>> ID (a)

40059744

>>> B = 1

>>> ID (b)

40059744

>>> C = 257

>>> ID (c)

41069072

>>> d = 257

>>> ID (257)

41069096

The ID function can be used to view the unique flag of an object, which can be considered a memory address

For large integers, Python uses a pool of large integer objects. The meaning of this sentence is:

Each time a large integer is created, a new object is built, but the object is no longer used and will not be destroyed, and subsequent objects are reused to reuse the memory space of objects that were no longer used. (no longer used refers to the reference count of 0, can be destroyed)

Python implementations use this technique of memory object pooling, which avoids the frequent application and freeing of memory space, so that python can be a little faster.

3.3 import Directive

The import instruction is used to load the module, and if necessary, it will be compiled. But the import directive, the important thing to do is to carry out the code of that module of import, which is very important. Python is interpreted as executing, and the function is created when it is executed. If the module's code is not executed again, the functions inside the module cannot be created, let alone call these functions.

Another important function of executing code is to create the functions defined within the module and the symbolic names of the various objects (that is, variable names) in the namespace of this module and bind them to objects so that other modules can refer to them by variable names.

The Python virtual machine also caches the already-import module and places it in a global module collection sys.modules. The advantage of doing this is that if the program is to import the module again in another place, the Python virtual machine only needs to return the one that is cached in the Global module collection.

You must have thought of it now. Sys.modules is a Dict object that can be validated by type (sys.modules)

more than 3.4 threads

Demo.py This example does not use multiple threads, but it is necessary to mention.

When it comes to multithreading, it's often about how threads are synchronized and how to access shared resources. Python implements thread synchronization through a global interpreter lock GIL (Global interpreter Lock). When a python program is single-threaded, the Gil is not enabled, and when a user creates a thread that means to use multiple threads, the Python interpreter activates the Gil automatically and creates the context and data structures that are required.

The Python bytecode interpreter works in a sequence of instructions, in which Python maintains a numeric value, which is the clock within Python, and if this number is n, it means that Python should start the thread scheduling mechanism immediately after executing n instructions. , you can get this number from the following code.

Import Sys

Sys.getcheckinterval () # 100

The thread scheduling mechanism assigns the Gil to the thread, and the thread that gets the Gil can start executing, while the other threads must wait. Because of the GIL's presence, Python's multithreaded performance is so low that it cannot play the advantage of multi-core CPUs and is even less performance than a single thread. So if you want to use multi-core CPUs, one suggestion is to use multiple processes.

3.5 Garbage Collection

When it comes to garbage collection, a reference counting model is usually used, which is one of the most intuitive and simple garbage collection techniques. Python also uses reference counting, but reference counting has these drawbacks:

Frequent update of reference count reduces operational efficiency

Reference count does not resolve circular reference issues

Based on the reference counting mechanism, Python uses the tags in the mainstream garbage collection technology--elimination and generational collection.

Python garbage collection mechanism

The Python GC primarily uses reference counting (reference counting) to track and recycle garbage. On the basis of reference counting, the "mark-clear" (Mark and Sweep) is used to solve the circular reference problem that the container object may produce, and the garbage collection efficiency can be improved by means of "generational recycle" (generation collection) in space time.

Reference count

The reference counting method maintains a reference count value of a reference number of other objects within the object, and when the reference count is 0 o'clock, the object is no longer referenced by other objects and can be reclaimed.

In combination with the source code, the head of all Python objects contains such a structure pyobject (equivalent to inheriting from Pyobject):

Object.h
struct _object {
py_ssize_t ob_refcnt;
struct Pytypeobject *ob_type;
} Pyobject;

OB_REFCNT is the reference count value.

For example, the following is the definition of an int object:

Intobject.h
typedef struct {
Pyobject_head
Long Ob_ival;
} Pyintobject;

There are obvious advantages to the reference counting method:

Efficient
No pause during Operation
The object has a defined life cycle
Easy to implement

The original reference counting method also has obvious drawbacks:

The number of maintenance reference counts is proportional to the value of the reference, unlike the basic amount of reclaimed memory, such as Mark and sweep.
The issue of circular references cannot be resolved. A and B refer to each other without an external reference to any of the A and B, and their reference counts are 1, but they should obviously be recycled.

To address these two deadly vulnerabilities, Python introduced the following two GC mechanisms.

Mark-Clear

The mark-Clear method is intended to solve the circular reference problem. Container objects (such as list, Dict, set, and even class) that can contain references to other objects may produce circular references, and for this reason, the head of all container objects is added with Pygc_head to implement the "tag-clear" mechanism when applying for memory.

Objimpl.h

typedef Union _GC_HEAD {

struct {

Union _gc_head *gc_next;

Union _gc_head *gc_prev;

py_ssize_t gc_refs;

} GC;

long double dummy; /* Force worst-case alignment * *

} Pygc_head;

When applying memory for an object, it is obvious that the actual amount of memory requested has been added to the pygc_head size.

Gcmodule.c

Pyobject *

_pyobject_gc_malloc (size_t basicsize)

{

Pyobject *op;

Pygc_head *g = (Pygc_head *) Pyobject_malloc (

sizeof (Pygc_head) + basicsize);

if (g = = NULL)

return Pyerr_nomemory ();

......

op = from_gc (g);

return op;

}

For example, from the creation of the list object, there are the following main logic:

Listobject.c

Pyobject *

Pylist_new (py_ssize_t size)

{

Pylistobject *op;

......

op = pyobject_gc_new (Pylistobject, &pylist_type);

......

_pyobject_gc_track (OP);

Return (Pyobject *) op;

}

The _pyobject_gc_track links the object to the No. 0 generation of objects (the latter details the generational collection).

A garbage token, the reference count of the object in the collection is first copied (lest the actual reference count is destroyed during the operation):

Gcmodule.c

static void

Update_refs (Pygc_head *containers)

{

Pygc_head *GC = containers->gc.gc_next;

for (; GC!= containers; GC = Gc->gc.gc_next) {

ASSERT (Gc->gc.gc_refs = = gc_reachable);

Gc->gc.gc_refs = FROM_GC (GC)->ob_refcnt;

ASSERT (gc->gc.gc_refs!= 0);

}

}

Then manipulate the copy, traverse the collection of objects, and subtract the reference count copy value of the referenced object by 1:

Gcmodule.c

static void

Subtract_refs (Pygc_head *containers)

{

Traverseproc Traverse;

Pygc_head *GC = containers->gc.gc_next;

for (; GC!= containers; gc=gc->gc.gc_next) {

Traverse = FROM_GC (GC)->ob_type->tp_traverse;

(void) Traverse (from_gc (GC),

(Visitproc) Visit_decref,

NULL);

}

}

This traverse is a function of the object type definition, used to traverse the object and manipulate the reference count copy through the incoming callback function Visit_decref.

For example, Dict is going to use VISIT_DECREF on key and value:

Dictobject.c

static int

Dict_traverse (Pyobject *op, visitproc visit, void *arg)

{

py_ssize_t i = 0;

Pyobject *PK;

Pyobject *PV;

while (Pydict_next (OP, &i, &AMP;PK, &AMP;PV)) {

Visit (PK);

Visit (PV);

}

return 0;

}

The objects within the collection are then divided into two classes based on whether the reference count copy value is 0, reachable and unreachable, where unreachable is the object that can be reclaimed:

Gcmodule.c

static void

Move_unreachable (Pygc_head *young, Pygc_head *unreachable)

{

Pygc_head *GC = young->gc.gc_next;

while (GC!= Young) {

Pygc_head *next;

if (gc->gc.gc_refs) {

Pyobject *op = from_gc (GC);

Traverseproc traverse = op->ob_type->tp_traverse;

ASSERT (Gc->gc.gc_refs > 0);

Gc->gc.gc_refs = gc_reachable;

(void) Traverse (OP,

(Visitproc) Visit_reachable,

(void *) young);

Next = gc->gc.gc_next;

}

else {

Next = gc->gc.gc_next;

Gc_list_move (GC, unreachable);

Gc->gc.gc_refs = gc_tentatively_unreachable;

}

GC = Next;

}

}

After handling the trivial details such as weak reference and finalizer (this article does not unfold, interested children's shoes please refer to the Python source code), you can reclaim the unreachable objects.

Generational Recycling

The whole idea of generational recycling is that all the blocks of memory in the system are divided into different sets according to their survival time, each set becomes a "generation", the garbage collection frequency decreases with the increase of the lifetime of "generation", and the survival time is usually measured by several garbage collections.

The struct body used to represent the "generation" is gc_generation, including the current generation list header, the maximum number of objects, and the current number of objects:

Gcmodule.c

struct Gc_generation {

Pygc_head Head;

int threshold; /* Collection Threshold * *

int count; /* Count of allocations or collections of younger

Generations *

};

Python defines a collection of three-generation objects by default, with the larger the number of indexes, the longer the object survives.

#define Num_generations 3

#define GEN_HEAD (N) (&generations[n].head)

/* Linked lists of container objects * *

static struct Gc_generation Generations[num_generations] = {

/* Pygc_head, threshold, Count */

{{{gen_head (0), Gen_head (0), 0}}, 700, 0},

{{{gen_head (1), Gen_head (1), 0}}, 10, 0},

{{{gen_head (2), Gen_head (2), 0}}, 10, 0},

};

The newly generated object is added to the No. 0 generation, and the omitted part of the preceding _pyobject_gc_malloc is the timing of the Python GC trigger. Each new student will check whether the No. 0 generation is full and start the garbage collection if it is full:

G->gc.gc_refs = gc_untracked;

generations[0].count++; /* Number of allocated GC objects * *

if (Generations[0].count > Generations[0].threshold &&

Enabled &&

Generations[0].threshold &&

!collecting &&

! Pyerr_occurred ()) {

collecting = 1;

Collect_generations ();

collecting = 0;

}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The operation principle of Python program and garbage collection

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The operation principle of Python program and garbage collection

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support