Explore array sorting reasons for improving the efficiency of the Python program's loop operation

Source: Internet
Author: User
Tags shuffle python list
In the morning I accidentally saw a blog post about two Python scripts, one of which was more efficient. This blog post has been deleted, so I can't give the article link, but the script basically can be summed up as follows:
fast.py

Import Timea = [i-I in range 1000000]]sum = 0T1 = Time.time () for I in a:  sum = sum + it2 = time.time () print t2-t1

slow.py

Import timefrom Random Import Shufflea = [I for I in range (1000000)]shuffle (a) sum = 0T1 = Time.time () for I in a:  sum = sum + it2 = time.time () print t2-t1

As you can see, two scripts have exactly the same behavior. Produces a list that contains the first 1 million integers and prints the time that the integers are summed. The only difference is that the slow.py first randomly sorts the integers. Although this may seem strange, it seems that randomization is enough to make the program noticeably slower. On my machine, running Python2.7.3, fast.py is always one-tenth seconds faster than slow.py (fast.py execution takes about three-fourths seconds, which is an unusual growth rate). You might as well try it. (I did not test on Python3, but the result should not be much worse.) )

So why is the list element randomization causing such a noticeable slowdown? The original author of the blog wrote this as "Branch prediction (branch prediction)". If you are unfamiliar with this term, you can look at StackOverflow's question, which explains the concept well. (My concern is that the original author of the original has encountered this or similar problem and applied the idea to a python fragment that is not suitable for the application.) )

Of course, I suspect that branch prediction (branch prediction) is the real cause of the problem. There is no top-level conditional branching in this Python code, and it is reasonable that two scripts have a strictly consistent branch in the loop body. No part of the program is conditional on these integers, and the elements of each list are not dependent on the data itself. Of course, I'm still not sure if Python is "bottom-up" enough, so that CPU-level branch prediction can be a factor in the performance analysis of Python scripting. Python is, after all, a high-level language.

So why is slow.py so slow if it is not the cause of the branch prediction? After a bit of research, after some "failed beginnings", I felt I had found a problem. This answer requires a bit of familiarity with Python's internal virtual machines.

Start of failure: List vs. generator (lists and generators)

My first thought was that Python would be more efficient at processing a sorted list [I for I in Range (1000000)] than the random list. In other words, this list can be replaced with the following generator:

def numbers ():  i = 0 while  i < 1000000:    yield i    i + = 1

I think it might be more efficient in time. After all, if Python uses generators internally instead of real lists to avoid the hassle of saving all integers in memory at once, this can save a lot of overhead. Random lists in slow.py cannot be easily captured by a simple generator, and all VMS (virtual machines) cannot be optimized.

However, this is not a useful discovery. If you insert A.sort () between the slow.py shuffle () and the loop, the program will be as fast as fast.py. It is clear that some of the details after sorting the numbers make the program faster.

Start of failure: List comparison array

My second idea is that there is a possibility that the data structure is causing the caching problem. A is a list, which naturally makes me believe that a is actually implemented through a chain of lists. If the shuffle operation deliberately randomize the nodes of this list, then fast.py may be able to assign all the lists of linked table elements to adjacent addresses, thus adopting advanced local caches. There are many cache misses in slow.py, because each node references another node that is not on the same cache line.

Unfortunately, this is not true either. Python's list object is not a list of links, but rather an array of true meanings. In particular, a Python list object is defined with a C struct:

typedef struct {pyobject_var_head pyobject **ob_item; py_ssize_t allocated;} Pylistobject;

...... In other words, Ob_item is a pointer to an array of pyobjects pointers, and the allocated size is the size we assign to the array. Therefore, this does not help to solve this problem (although it is somewhat comforting to me that I am not sure about the complexity of the algorithm for list operations in Python: The complexity of the add operation algorithm for the list is O (1), the algorithm complexity of accessing any list element is O (1), and so on). I just want to explain why Guido chose to call them the list "lists" instead of the array "arrays", but actually they are arrays.

Workaround: Overall Object

Array elements are contiguous in memory, so such data structures do not introduce caching problems. It turns out that the cache location is the cause of the slow.py slowdown, but it comes from an unexpected place. In Python, integers are objects that are allocated in the heap instead of a simple value. Especially in virtual machines, an integer object looks like this:

typedef struct {pyobject_head long ob_ival;} Pyintobject;

The only "interesting" element in the structure above is ob_ival (similar to an integer in C). If you think that using a complete heap object to implement integers is wasteful, you might be right. Many languages are optimized to avoid this. For example, the Ruby interpreter for Matz typically stores objects as pointers, but exceptions are made to frequently used pointers. Simply put, the Ruby interpreter plugs the fixed-length number into the same space as the object and marks it with the least significant bit as an integer instead of a pointer (in all modern systems, malloc always returns a memory address aligned in multiples of 2). At that point, you just need to get the value of the integer by the appropriate displacement-no heap location or redirection is required. If CPython do similar optimizations, slow.py and fast.py will have the same speed (and they may all be faster).

So how does the CPython deal with integers? What is the interpreter's behavior that gives us so much doubt? The Python interpreter assigns integers to the block in 40Byte at a time. When Python needs to generate a new integer object, it opens the next free space in the current integer "block" and stores the integer in it. Our code allocates 1 million integers in the array, and most of the neighboring integers are placed in adjacent memory. As a result, traversal in an ordered 1 million number shows good cache positioning, while locating frequent cache misses in the first 1 million digits of a random order.

Therefore, the answer to "why array ordering makes code faster" is that it does not have this effect at all. Array traversal without scrambling is faster because we access the order of the integer objects in the same order as they are allocated (they must be assigned).

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.