Python garbage collection Mechanism--perfect explanation!

Source: Internet
Author: User

Let's start with an overview. The description in the second part is powerful.

Garbage collection (GC)
Today's high-level languages such as java, c #, etc., have adopted the garbage collection mechanism, instead of c, c ++ users manage and maintain their own memory. It is extremely free to manage memory by yourself, you can apply for memory at will, but like a double-edged sword, it lays hidden dangers for a large number of memory leaks and dangling pointers.
For a string, list, class, or even a value that is an object, and the positioning is simple and easy to use, naturally it will not let users deal with how to allocate and reclaim memory.
Python uses the same garbage collection mechanism as java, but the difference is:
Python uses a reference counting mechanism as the main strategy, and mark-clear and generational collection as the secondary strategies.

Reference counting mechanism:
Everything in Python is an object, and their core is a structure: PyObject

 typedef struct_object {
 int ob_refcnt;
 struct_typeobject * ob_type;
} PyObject;
PyObject is a must-have for every object, and ob_refcnt is used as a reference count. When an object has a new reference, its ob_refcnt will increase, and when an object referencing it is deleted, its ob_refcnt will decrease

#define Py_INCREF (op) ((op)-> ob_refcnt ++) // Increment the count
#define Py_DECREF (op) \ // Decrement the count
    if (-(op)-> ob_refcnt! = 0); else __Py_Dealloc ((PyObject *) (op))
 

When the reference count is 0, the object's life is over.

Advantages of the reference counting mechanism:

simple
Real-time: Once there is no reference, the memory is directly released. No need to wait until a specific time like other mechanisms. Real-time also brings a benefit: the time to process the reclaimed memory is shared to the usual.
Disadvantages of the reference counting mechanism:

Maintaining reference counts consumes resources
Circular reference
list1 = []
list2 = []
list1.append (list2)
list2.append (list1)
 

List1 and list2 refer to each other. If there are no other objects referring to them, the reference counts of list1 and list2 are still 1. The occupied memory can never be recycled, which will be fatal.
For today's powerful hardware, disadvantage 1 is still acceptable, but circular references cause memory leaks, and it is destined that Python will also introduce a new collection mechanism. (Mark removal and generational collection)
Reprinted address: http://my.oschina.net/hebianxizao/blog/57367?fromerr=KJozamtm

Illustrate Ruby and Python garbage collection
English text: visualizing garbage collection in ruby and python
Chinese: Ruby and Python Garbage Collection

This article is based on a speech I gave at RuPy in Budapest. I think writing a post while it's hot should make more sense than just staying on the slide. You can also take a look at the lecture video. To tell you something, I will give a similar speech at the Ruby conference, but I will not talk about Python. Instead, I will compare the garbage collection mechanisms of MRI, JRuby, and Rubinius.
For a more detailed explanation of Ruby's garbage collection mechanism and its internal implementation, please pay attention to the upcoming masterpiece
Ruby Under a Microscope.


If you compare algorithms and business logic to the brain of an application, which organ does garbage collection correspond to?
Since it is the "Ruby Python" conference, I think it would be interesting to compare the garbage collection mechanism of Ruby and Python. Before then, why bother with garbage collection? After all, this is not a bright and exciting theme, right. How many of you have a cold on garbage collection? (Some RuPyde participants raised their hands!)

A recent blog post from the Ruby community is about how to speed up unit testing by changing Ruby GC settings. I think this article is excellent. It is helpful for people who want to make unit tests run faster and make the program GC pause less, but GC has not caught my interest. At first glance GC is like a drowsy, dry technical subject.

But garbage collection is actually a fascinating subject: GC algorithms are not only an important part of the history of computer science, but also a cutting-edge subject. For example, the mark-and-sweep algorithm used by MRI Ruby is more than fifty years old, and the GC algorithm used by Rubinius, Ruby's alternative language, was invented only in 2008.

However, the word "garbage collection" is actually a misnomer.

The beating heart of the application
GC systems do much more than "garbage collection." In fact, they are responsible for three important tasks. they

Allocate memory for newly generated objects
Identify those junk objects, and
Reclaim memory from garbage objects.
If you compare the application to the human body: all the elegant code, business logic, and algorithms you write should be the brain. By analogy, which body organ should the garbage collection mechanism be? (I heard a lot of interesting answers from RuPy listeners: waist, white blood cells :))

I think garbage collection is the throbbing heart of applications. Just like the heart provides blood and nutrients to other organs of the body, the garbage collector provides memory and objects for your application. If the heart stops, the person will be finished within a few seconds. If the garbage collector stops working or is running slowly, like an arterial blockage, your application will decrease in efficiency until it eventually dies.

A simple example
The use of examples has always been helpful in understanding the theory. Here is a simple class written in Python and Ruby, which we will use today as an example:


By the way, the code in the two languages can be so similar: Ruby and Python are really just slightly different in expressing the same thing. But are the internal implementations of the two languages so similar?

Available list
What did Ruby do when we executed Node.new (1) above? How does Ruby create new objects for us?
Unexpectedly it does very little. In fact, long before the code started to execute, Ruby created hundreds and thousands of objects in advance and strung them on a linked list called: Available List. A conceptual diagram of the available list is shown:


Imagine that each white square is marked with an "unused pre-created object". When we call Node.new, Ruby just takes a pre-created object for us to use:


The gray grid on the left indicates the current object used in our code, while the other white grids are unused objects. (Please note: No doubt my simplification is practical. In fact, Ruby will use another object to load the string "ABC", another object to load the Node class definition, and another object to load the abstractions analyzed in the code. Syntax tree, etc.)

If we call Node.new again, Ruby will pass us another object:


This simple algorithm to pre-allocate objects using a linked list has been invented for more than 50 years, and the inventor is the well-known computer scientist John McCarthy, which was originally implemented in Lisp. Lisp is not only the earliest functional programming language, it also has many innovations in the field of computer science. One is the concept of using the garbage collection mechanism to automate program memory management.

The standard version of Ruby, also known as "Matz's Ruby Interpreter" (MRI), uses a GC algorithm similar to McCarthy's implementation in 1960. For better or worse, Ruby's garbage collection mechanism is 53 years old. Like Lisp, Ruby creates some objects in advance, and then makes them available to you when you allocate new objects or variables.

Object allocation in Python
We have learned that Ruby pre-creates objects and places them in the available list. What about Python?

Although Python also uses available lists (for recycling specific objects such as lists) for many reasons, Python and Ruby are different in terms of allocating memory for new objects and variables.

For example, we use Pyhon to create a Node object:


Unlike Ruby, Python immediately requests memory from the operating system when an object is created. (Python actually implements a memory allocation system of its own, providing an abstraction layer on top of the operating system heap. But I won't expand on that today.)

When we create the second object, we request memory again like the OS:


It seems simple enough, when we create objects, Python will take some time to find and allocate memory for us.

Ruby developers live in messy rooms
Ruby leaves useless objects in memory until the next GC execution

Come back to Ruby. As we create more and more objects, Ruby will continue to look for available pre-created objects from the available list for us. As a result, the available list becomes gradually shorter:


... and then shorter:


Note that I keep assigning new values to the variable n1, and Ruby leaves the old values in place. The three Node instances "ABC", "JKL" and "MNO" are still stuck in memory. Ruby does not immediately clean up old objects that are no longer used in the code! Ruby developers are like living in a messy room with clothes on the floor or dirty dishes in the sink. As a Ruby programmer, useless garbage objects will always surround you.

Python developers live in a hygienic family
Used up garbage objects are immediately cleaned up by Python

Python and Ruby have quite different garbage collection mechanisms. Let's go back to the three Python Node objects mentioned earlier:


Internally, when creating an object, Python always stores an integer in the C structure of the object, called a reference number. Initially, Python sets this value to 1:


A value of 1 indicates that there is a pointer to or refer to these three objects. Suppose we now create a new Node instance, JKL:


As before, Python sets the number of references for JKL to 1. However, please note that because we changed n1 to point to JKL and no longer point to ABC, Python set the reference number of ABC to 0.
At this moment, the Python garbage collector stepped forward immediately! Whenever the number of references to an object is reduced to 0, Python immediately releases it and returns the memory to the operating system:


The above Python reclaims the memory used by the ABC Node instance. Remember, Ruby discards old objects and doesn't release their memory.

This garbage collection algorithm in Python is called reference counting. It was invented by George-Collins in 1960 and happened to coincide with the available list algorithm invented by John McCarthy in the same year. As Mike-Bernstein said in his speech on the outstanding garbage collection mechanism of the Gotham Ruby Conference in June: "1960 was the golden age of garbage collectors ..."

The Python developer works in a health home. As you can imagine, a roommate with mild OCD (a kind of obsessive-compulsive disorder) kept cleaning behind you all the time. As soon as you put down a dirty plate or cup, a guy already Ready to put it in the dishwasher!

Now look at the second example. Join us and let n2 refer to n1:


The DEF reference number on the left has been reduced by Python, and the garbage collector will immediately recycle the DEF instance. At the same time, the reference number of JKL has become 2 because both n1 and n2 point to it.

Mark-clear
Eventually the messy room was littered with trash, and it couldn't be quiet anymore. After running the Ruby program for a while, the available list was eventually exhausted:


At this point all Ruby pre-created objects have been used by the program (they are all grayed out) and the available list is empty (there is no white box).

At this point Ruby sacrifices another algorithm invented by McCarthy, named: Mark-Clear. First of all, Ruby stopped the program. Ruby uses the "Earth Stop Garbage Collection Method". Ruby then polls all pointers, variables, and code to produce other reference objects and other values. At the same time, Ruby facilitates internal pointers through its own virtual machine. Mark each object referenced by these pointers. I use M in the figure.


The three objects marked with M are still in use by the program. Internally, Ruby actually uses a series of bit values, called: available bitmaps (Translate Note: Remember the "Burst Sorting in Programming Pearl", which has a strong effect on the finite integer set with low dispersion The compression effect is used to save the resources of the machine.), To track whether the object is marked.


Ruby stores this available bitmap in a separate memory area to take full advantage of Unix's copy-on-write. For more on this matter, please follow my other blog post, "Why You Should Be Excited About Garbage Collection in Ruby 2.0"

If the marked object is alive, the remaining unmarked objects can only be garbage, which means that our code will no longer use it. I will use the white grid to represent garbage objects in:


Ruby then clears these useless garbage objects and sends them back to the available list:


Internally, all this happened so quickly, because Ruby doesn't actually copy objects from here to there. Instead, by adjusting the internal pointer to point to a new linked list, the garbage objects are returned to the available list.

Now wait until the next time you create the object again. Ruby can assign these garbage objects to us. In Ruby, subjects have six reincarnations, reincarnation, and enjoy multiple lives.

Mark-delete vs. reference counting
At first glance, the GC of Python The algorithm seems to be far better than Ruby's: Ning She Jieyu and filthy? Why would Ruby prefer to force a program to stop running regularly instead of using Python's algorithms?

However, reference counting is not as simple as it seems at first glance. There are many reasons why not many languages do not use the reference counting GC algorithm like Python:

First, it's not easy to implement. Python has to leave some space inside each object to handle reference numbers. This paid a small price in space. But worse, every simple operation (such as modifying a variable or reference) becomes a more complicated operation, because Python needs to increase one count, decrease another, and possibly release objects.

Second, it is relatively slow. Although Python is very stable as the program executes GC (a dirty dish starts washing in a sink), this is not necessarily faster. Python keeps updating many reference values. Especially when you no longer use a large data structure, such as a list containing many elements, Python may have to release a large number of objects at once. Reducing the number of references becomes a complicated recursive process.

In the end, it doesn't always work. In my next post, which contains notes from the rest of my talk, we will see that reference counting cannot handle circular data structures--that is, data structures that contain circular references.

Break down next time
I'll break down the rest of the speech next week. I will discuss how Python flattens ring data types and how GC works in the upcoming Ruby 2.1 release.

Comparing Ruby and Python garbage collection (2)
English original address: Generational GC in Python and Ruby
Article in Chinese: Comparing Ruby and Python Garbage Collection (2): Generational Garbage Collection

Last week, I wrote the first half of this article based on a previous report called "Visualizing Garbage Collection in Ruby and Python." In the previous article, I explained how standard Ruby (also known as Matz's Ruby interpreter or MRI) uses a garbage collection algorithm called Mark and Sweep. This algorithm is based on the original version of 1960. Developed by Lisp. Similarly, I also introduced the use of a 53-year-old GC algorithm in Python. The thinking of this algorithm is very different, and it is called reference counting.

It turns out that Python uses another algorithm called Generational Garbage Collection in addition to reference counting. This means that Python's garbage collector treats newly created and old objects differently. And in the upcoming 2.1 version of MRI Ruby, the garbage collection mechanism of Generational Garbage Collection is also introduced for the first time (in the other two Ruby implementations: JRuby and Rubinius, this GC mechanism has been used for many years, I will be next week At the RubyConf conference on how it works in both Ruby implementations).

Of course, the phrase "treat newly created and old objects in a different way" is a bit vague, such as how to define new and old objects? For example, how do you specifically treat Ruby and Python differently? Today, let's talk about the operating principles of the GC mechanism of these two languages and answer the questions above. But before we start talking about Generational GC, let's take a moment to talk about a serious theoretical problem with Python's reference counting algorithm.

Cyclic data structures and reference counting in Python
From the previous article, we know that in Python, each object holds an integer value called a reference count to track how many references point to this object. Whenever a variable or other object in our program references the target object, Python will increase the count value, and when the program stops using this object, Python will decrease the count value. Once the count is reduced to zero, Python will release the object and reclaim the associated memory space.

Since the 1960s, the computer science community has faced a serious theoretical problem. For reference counting, if a data structure refers to itself, that is, if the data structure is a cyclic data structure, Certain reference counts are definitely not zero. To better understand this problem, let us give an example. The following code shows some of the node classes we used last week:


We have a constructor (called init in Python) that stores a single attribute in an instance variable. After the class definition we create two nodes, ABC and DEF, which are rectangles on the left in the figure. Both nodes' reference counts are initialized to 1 because there are two references to each node (n1 and n2).

Now, let's define two additional attributes in the node, next and prev:


Unlike Ruby, you can dynamically define instance variables or object properties while your code is running. It seems a bit like Ruby is missing some interesting magic. (I'm not a Python programmer by definition, so there may be some naming errors). We set n1.next to n2 and n2.prev to n1. Now, our two nodes form a double-ended linked list using circular references. Please also note that the reference counts of ABC and DEF have been increased to 2. There are two pointers to each node: first is n1 and n2, and next is next and prev.

Now, assuming our program does not use these two nodes anymore, we set both n1 and n2 to null (None in Python).

Well, Python will reduce the reference count of each node to 1 as usual.

Generation Zero in Python
Please note that in the example just mentioned, we ended with a less common case: we have an "island" or a set of unused, point-to-point objects, but nobody has external references. In other words, our program no longer uses these node objects, so we hope that Python's garbage collection mechanism is smart enough to release these objects and reclaim the memory space they occupy. But this is not possible because all reference counts are 1 instead of 0. Python's reference counting algorithm cannot handle objects that point to each other.

Of course, the above is an intentionally designed example, but your code may inadvertently include circular references and you are not aware of it. In fact, when your Python program runs, it will create a certain amount of "floating point garbage". Python's GC cannot handle unused objects because the application count does not reach zero.

That's why Python introduced the Generational GC algorithm! Just as Ruby uses a free list to keep track of unused, free objects, Python uses a different kind of list to keep track of active objects. Rather than calling it an "active list", Python's internal C code calls it Generation Zero. Every time you create an object or something, Python adds it to the zero-generation linked list:


From the above, you can see that when we create the ABC node, Python adds it to the zero-generation linked list. Please note that this is not a real list and cannot be accessed directly in your code, in fact this linked list is a completely internal Python runtime.
Similarly, when we create a DEF node, Python adds it to the same linked list:


Generation 0 now contains two node objects. (He will also include every other value created by Python, with some internal values used by Python itself.)

Detect circular references
Python then loops through each object on the zero-generation list, checks each referenced object in the list, and subtracts its reference count according to the rules. In this process, Python counts the number of internal references one by one to prevent the object from being released prematurely.

For easy understanding, let's look at an example:


From the above, you can see that the number of references contained in the ABC and DEF nodes is 1. There are three other objects in the zero-generation linked list at the same time. The blue arrow indicates that some objects are being referenced by objects other than the zero-generation linked list. . (Next we will see that there are two other linked lists called first and second generations in Python). These objects have higher reference counts because they are being pointed to by other pointers.

Next you will see how Python's GC handles zero-generation linked lists.


By identifying internal references, Python can reduce the reference count of many zero-generation linked list objects. In the first line you can see that the reference counts of ABC and DEF have become zero, which means that the collector can release them and reclaim memory space. The remaining active objects are moved to a new linked list: a generation linked list.

In a sense, Python's GC algorithm is similar to the token collection algorithm used by Ruby. Periodically track references from one object to another to determine if the object is still active and is being used by the program, which is similar to Ruby's marking process.

GC threshold in Python
When will Python go through this marking process? As your program runs, the Python interpreter keeps track of newly created objects and objects that were released because the reference count was zero. In theory, these two values should remain the same, because every object created by the program should eventually be released.

Of course, this is not the case. Because of circular references, and because your program uses objects that exist longer than other objects, the difference between the count value of the allocated object and the count value of the released object is gradually increasing. Once this difference cumulatively exceeds a certain threshold, Python's collection mechanism is started, and the zero-generation algorithm mentioned above is triggered, releasing "floating garbage", and moving the remaining objects to the generation list.

Over time, the objects used by the program gradually moved from the zero-generation list to the one-generation list. Python follows the same method for objects in the first-generation list. Once the allocated count value and released count value reach a certain threshold, Python will move the remaining active objects to the second-generation list.

In this way, the objects that your code has been using for a long time, and the active objects that your code continues to access, will be transferred from the zero-generation linked list to the first generation and then to the second generation. With different threshold settings, Python can process these objects at different time intervals. Python handles Generation 0 most frequently, followed by Generation 1 and then Generation 2.

Weak generation hypothesis
Take a look at the core behavior of the generational garbage collection algorithm: the garbage collector will process new objects more frequently. A new object is one that your program has just created, and an incoming object is one that still exists after a few time periods. Python promotes an object as it moves from generation zero to generation one, or from generation one to generation two.

Why do you do this? The root of this algorithm comes from the weak generational hypothesis. This hypothesis consists of two perspectives: firstly, the youngest relatives usually die faster, while the older ones are more likely to survive longer.

Suppose now that I create a new object in Python or Ruby:


According to the hypothesis, my code is likely to use ABC for only a short time. This object may be just an intermediate result in a method, and this object will become garbage as the method returns. Most of the new objects are so garbage quickly. However, occasionally programs will create some very important, long-lived objects-such as session variables or configuration items in web applications.

By frequently processing new objects in the zero-generation linked list, Python's garbage collector will spend time more meaningful: it handles new objects that may quickly become garbage. At the same time, only rarely, when the threshold conditions are met, the collector will go back to process those old variables.

Back to Ruby's Free Chain
The upcoming version 2.1 of Ruby will use the generation-based garbage collection algorithm for the first time! (Please note that other Ruby implementations, such as JRuby and Rubinius, have used this algorithm for many years). Let's go back to the free chain diagram mentioned in the previous blog post and see how it works.

Please remember that when free chain is finished, Ruby will mark the objects that your program is still using.


From this picture we can see that there are three active objects, because the pointers n1, n2, n3 still point to them. The remaining objects represented by white rectangles are garbage. (Of course, the actual situation will be much more complicated. The free chain may contain thousands of objects, and there are complex reference-pointing relationships. The diagram here is only to help us understand the simple principle behind the Ruby GC mechanism. Get into the details)

Similarly, we said that Ruby will move junk objects back to the free chain, so that they can be recycled when the program applies for new objects.

Ruby 2.1 generation-based GC mechanism
Starting from version 2.1, Ruby's GC code adds some additional steps: it promotes the remaining active objects into the mature generation. (The word old is used instead of mature in the C source code of MRI.) The following figure shows a conceptual diagram of two Ruby 2.1 object generations:


On the left is a different scene from the free chain. We can see that the garbage objects are represented in white, and the rest are gray active objects. The gray objects have just been marked.

Once the "marker clearing" process is over, Ruby 2.1 moves the remaining marker objects to the maturity zone:


Unlike three generations in Python, Ruby 2.1 uses only two generations. The left side is the young generation of objects, while the right side is Cheng.

Old object of cooked generation. Once Ruby 2.1 tags an object once, it is considered mature. Ruby will bet that the remaining active objects won't become garbage objects anytime soon.

Important reminder: Ruby 2.1 does not actually copy objects in memory. These regions that represent different generations are not composed of different physical memory regions. (There are GC implementations of other programming languages or other Ruby implementations that may take a copy operation when the object is promoted). Ruby 2.1's internal implementation does not include pre-marked objects during the mark & clear process. Once an object has been marked once, it will not be included in the subsequent mark removal process.

Now, suppose your Ruby program continues to run, creating more new and younger objects. Then the GC process will appear in the new generation,


Like Python, Ruby's garbage collector focuses most of its energy on the new generation of objects. It will only include new, young objects created since the last GC process in the subsequent mark removal process. This is because many new objects are likely to become garbage (white mark) immediately. Ruby does not repeatedly mark mature objects on the right. Because they have survived a GC process, they will not become garbage quickly for a considerable period of time. Because only new objects need to be marked, Ruby's GC runs faster. It completely skips mature objects, reducing the time the code waits for the GC to complete.

Occasionally Ruby runs a "global collection", re-marking and re-sweeping, this time including all mature objects. Ruby monitors the number of mature objects to determine when to run a global collection. When the number of mature objects is double the number of last global collections, Ruby cleans up all tags and treats all objects as new objects.

Cataract
An important challenge of this algorithm is worth explaining in depth: Suppose your code creates a new young object and adds it as a child of an existing mature object. For example, this will happen when you add a new value to an array that has been around for a long time:


Let's take a look at the diagram, the new objects are on the left, and the mature objects are on the right. The marking process on the left has identified 5 new objects that are still active (grey). But two objects have become garbage (white). But how to deal with this new object in the middle? This is the object mentioned in that question. Is it garbage or active?

Of course it is an active object, because there is a reference to the mature object on the right pointing to it. But we said earlier that mature objects that have been marked will not be included in the mark removal process (up to global collection). This means that newly created objects like this will be mistakenly considered as garbage and released, resulting in data loss.

Ruby 2.1 overcomes this problem by monitoring mature objects and seeing if your code adds a reference from them to the newly created object. Ruby 2.1 uses an old-fashioned GC technique called white barriers to monitor changes to mature objects-no matter when you add a reference from one object to another (whether it is a new or modified one) Subject), the cataract will be triggered. The cataract will detect if the source object is a mature object, and if so, add this mature object to a special list. Subsequently, Ruby 2.1 will include these mature objects that meet the conditions into the scope of the next mark removal, to prevent newly created objects from being incorrectly marked as garbage and cleared.

Ruby 2.1's cataract implementation is quite complicated, mainly because existing C extensions do not include this part of the functionality. Koichi Sasada and the core team of Ruby used a more clever solution to solve this problem. If you want to know more, please read these related materials: Koichi ’s fascinating presentation at EuRuKo 2013.

Standing on the shoulders of giants
At first glance, the GC implementation of Ruby and Python is very different. Ruby uses John-MaCarthy's native "mark and clear" algorithm, while Python uses reference counting. But if you look closely, you can see that Python uses a bit of the idea of mark removal to handle circular references, and both use a generation-based garbage collection algorithm in a similar way. Python is divided into three generations, while Ruby has only two generations.

This similarity should not come as a surprise. Both programming languages were designed using computer science research decades ago, and these results were made long before the language took shape. I was surprised that when you lift the surface of different programming languages and go deeper, you can always find some similar basic ideas and algorithms. Modern programming languages should be grateful for groundbreaking computer science research by computer sages such as McCarthy in the 1960s and 1970s.

Python garbage collection mechanism-perfect explanation!

 


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.