If the Gil is an inefficient design, what is the best alternative for the design?

Source: Internet
Author: User

Reply content:

Early Linux is also a large kernel lock, the kernel state is locked, the user state can be parallel. The semantic benefit of Big Kernel Lock is that there is no deadlock, and that the existing non-multithreaded code can take advantage of a little multicore.

"Better design" is not fun, only 1.1 points to a grain to replace a fine intensity of the lock, this job in the Linux community seems to have been doing for many years. But Linux fortunately does not make a commitment to the API, who modified the interface, who can modify all of their own extensions, not in the main line of expansion not care, hanging on the hanging, who let them not into the main line it.

But Python/ruby this C-based three-party library culture dense loose community, interface changes cost is too high, not to mention the benefits are not necessarily significant, such as:
    • After a fine-grained lock, does the performance of the original single-threaded script have no effect?
    • How do you regulate the deadlock with a three-party library?
    • Network IO task can be used to slow down the process, then we make multi-threaded map what? Computationally dense who uses python?
There's an article in the Python wiki that lists in more detail why you can't do it gil:globalinterpreterlock

If it is a new language, avoid the GIL is relatively easy to restrict the C extension of the interface only to go to the FFI, do not expose the internal implementation of the interpreter details, the size of the lock granularity is only the implementation level of the problem. But this requires language performance to be fast enough, and it is a tradeoff between performance and release time. Gil is not an inefficient design. The disadvantage is that the size of the lock is too coarse. In my opinion, the design of Lua is very good.

LUA also has a structure similar to Gil, but the scope of the lock is limited to a runtime environment. In one OS process, multiple LUA runtime environments can exist simultaneously. This makes it possible to have only one executing statement at a time in a LUA environment. If you want parallel execution within an OS process, you can open many LUA environments at the same time, and they can communicate with each other.

So LUA locks only one LUA environment, which can be executed with multiple LUA environments in the OS process. The problem with Python Gil is that he is a global in the OS process. Causes the Python statement to not execute concurrently.

The synchronization of locks in an environment makes the design of virtual machines simple and efficient. This is in itself understandable. Java was the same in the early days, but then Java took a lot of effort to make finer-grained locks in the environment. Allows Java to execute multiple threads in parallel in a virtual machine environment. This makes the Java virtual variable a lot more complex, and performance is also affected because the code is not good at predicting which variables and resources need to be placed in the lock-protected area, and can only be checked again. A global lock on the virtual machine environment does not require checking every variable and resource.

So I think the good design. is to change the scope of the Gil Lock from the global of the OS process to the virtual machine global. Allows multiple virtual machines to exist simultaneously in an OS process. Only one CPU core can be used in each virtual machine because the Gil exists. However, multiple virtual machines allow the entire OS process to take advantage of multiple CPU cores, and because within an OS process, the data exchange can be passed directly by reference, with no significant overhead associated with the memory copy. The addition of fine-grained lock will actually make the CPython run a single-threaded program is slow, upstairs an answer is not as good as adding a big lock is this meaning.

Most people think Gil is bad, remove multi-threaded can run full of multicore, this way of thinking has a problem ... Removing the Gil will significantly increase the complexity of the CPython implementation, aside from this, the difficulty of writing shared-memory multithreaded routines will still not drop, you will fall in C in the pit, where one will not be less.

So let it be natural, anyway CPython level is like that, it is not directed at the performance to go. Python want to occupy a lot of cores also simple, do scientific calculations, with theano/numbapro/numexpr optimization numpy calculation or Cython bare write multithreading are line, do high concurrent server, Multiprocessing directly spawn out the sub-process to do ah, and like pulsar this framework to help you do a fine-grained lock + STM + TLS:
    • Using fine-grained locks to solve the GIL, today's mutexes use Futex in a user-state implementation, much faster than the original.
    • Using STM to reduce the overhead of fine-grained locks on core paths
    • Use TLS to circumvent time.time, global functions such as random.randint. preemption when accessing the same resource
Workload issues.
---------------
@fleuria Linux kernel uses finer-grained locks, why does Linux use finer-grained locks without compromising performance, while Python uses fine-grained locking to incur additional overhead when there is no competition?

Because the kernel with fine-grained locks is spin_lock, that is, Cmpxchg + while, the competition overhead between multicore is almost zero, and there is no spin_lock overhead between the cores, since the first spin_lock in a single core is usually the IRQ of this CPU core Disable off, see the implementation of functions such as Linux irq_disable ()/local_irq_disable (), which does not occur when a clock interrupt is allowed to cause the task to switch.

and user state can use Spin_lock? Almost impossible, because the single core between the Spin_lock without unlock when the clock is interrupted to cut away, then the new scheduled task will enter the Spin_lock while (cmpxchg (...) = = XXX); Loop and must keep looping until dozens of milliseconds after a clock slice ends being cut back into the first task Spin_unlock, the second task can get the lock and end the loop. This is because the user state is unable to disable the IRQ, can not prohibit the clock interrupt caused by the task of passive switching will enter the above Spin_lock single-core internal friction state.

So the user process also has to use a mutex that is many times slower than the kernel Spin_lock + irq_disable/enable combination. Outright mutex has a user-state implementation of the Futex, can be in the lock with Cmpxchg first detection, if there is no competition will enter the kernel state, and really compete to enter the kernel state, so the occupation of the mutex than the original reduced a lot.

So now with the Futex, Python is likely to use finer-grained locks in the user state, taking up a few orders of magnitude less than before. However, even if the performance of Futex is not up to the kernel state of Spin_lock + irq_disable, there are some weak costs, then the core path using STM to solve, will be a better match.

The fact that exposing the C interface to the application does pose some problems can complicate the implementation of fine-grained locks and STM, which is the real reason why GIL exists. --------2015-02-17 Update--------
The comment area is starting to get interesting. So do the following supplementary explanation.
About ref count and lock. The example I cite is inappropriate.
To avoid getting bogged down in specific language implementation details. So we have a simple and practical model to discuss.
The conclusion is that ref count does not require the OS-provided thread lock to implement the GC.
GC, which focuses on discovering G, then C.

Adding a reference will not show G's
。 After adding the reference, its rec count value must be >= 2 (ref count 1 when creating STH).
when ref count becomes 0, it will no longer become 1.。 In order for a resource to be referenced, the resource must first exist. If the compiler implementation is not a problem, only if the semantics of that resource is not visible when ref count will become 0.

What is interlocked operation? In fact, it is a certain number of non-string instructions (plus and minus bits) that maintain the lock signal during execution to ensure the exclusive problem of multi-processor access to shared memory.
This command avoids the situation where ref count is 2 o'clock, and two different processors have a 1 (0) result for their Dec.

How do I implement GC?

iflockDecr(&sth.ref)thenfree(sth);   /* 实际应用的话, free 前需递归地对所有被 sth 引用的其他资源进行「发现,回收」动作。 *//* if 发现 sth 为 G then C(sth) */
A better design alternative to GIL is Coroutine (co-process). Since there can only be one thread running at a time, let the language's virtual machine operate its own coroutine stack. GIL is a lazy approach, using the C runtime stack to implement the language stack, and then use the operating system kernel scheduling to simulate the implementation of Coroutine's own Operation Stack.

Recommend an article: working simultaneously vs waiting simultaneously
Note that GIL isn ' t great part, referring to the context management of Go and Erlang.
It also mentions coroutines in one page of C , similar to the implementation of Lua Coroutine.

As Ruby has continuation, you can achieve coroutine. However, the efficiency of full-continuation is much lower than that of one-shot continuation (Coroutine). I don't know if Ruby has implemented the latter.

There is really no coroutine and can be implemented with callbacks like node. js (like CPS). But Python has no way of closure.

So, GIL is inefficient, but not too bad. The bad thing is that Python uses Gil and not Closure,ruby both Gil and continuation, which don't fit together. Gil is not a design, just a fix for a problem that was not considered earlier. So on the contrary, just consider multithreading (this is the implementation level, such as Ruby JRuby can support true multicore, but Ruby because the standard library and many C extensions do not consider multi-threaded case, so added Gil). Throw the Gil out of the language. Adding features like Erlang's actor,rust task can easily write out that the correct multithreaded lock is shared, and the smaller-grained lock should be better than global
  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.