0 reply content: In the early days, linux also had a large kernel lock, and the entire kernel was locked when it entered the kernel state. The user State can be parallel. The Semantic advantage of Big Kernel Lock is that there is no deadlock, and some non-multi-threaded code can take advantage of a little multi-core.
"Better Design" is not interesting. It can only be replaced by a fine-grained lock at. This job seems to have been in the linux community for many years. But linux does not promise the API. Whoever modifies the interface can modify all the extensions on their own. If the extension is not in the main line, don't care, who told them not to go into the main line.
However, Python/Ruby is a loose C-based third-party library culture. The interface modification cost is too high, and the benefits are not necessarily significant. For example:
- After the fine-grained lock is ensured, does the performance of the original single-thread script not affect?
- How can we standardize a deadlock in a third-party database?
- Network I/O tasks can be slowed down with a coroutine. What is the multi-thread diagram? Who uses python for computing intensive?
An article in python wiki details why GIL: GlobalInterpreterLock cannot be used.
If it is a new language, it is relatively easy to avoid GIL. The interface that limits C extension only follows FFI and does not expose the internal implementation details of the interpreter, the width of the lock granularity is only a matter of implementation. However, this requires that the language performance be fast enough, and then it becomes a trade-off between performance and release time. GIL is not an inefficient design. The disadvantage is that the lock granularity is too coarse. In my opinion, Lua is well designed.
Lua also has a structure similar to GIL, but the lock range is limited to a runtime environment. A single OS process can have multiple lua runtime environments at the same time. This allows only one statement to be executed at a time in a lua environment. If you want parallel execution in the OS process, you can enable many Lua environments at the same time and communicate with each other.
Therefore, Lua locks only one Lua environment, and multiple Lua environments can be executed together in the OS process. The problem with Python GIL is that it is global in the OS process. As a result, the Python statement cannot be executed at the same time.
Locking is used in an environment to make the virtual machine design simpler and more efficient. This is understandable. This was also true in the early days of Java, but later Java spent a lot of energy doing more fine-grained locks in the environment. This allows Java to execute multiple threads in parallel in a virtual machine environment. This makes Java virtual machines much more complex and affects the performance, because the Code does not predict which variables and resources need to be placed in the lock protection area. You can only check them all. To make a global lock for the virtual machine environment, you do not need to check every variable and resource.
So I think it's a good design. The range of the GIL lock is changed from the global OS process to the global virtual machine. This allows an OS process to have multiple virtual machines at the same time. Each virtual machine can only use one CPU core because of the existence of GIL. However, multiple virtual machines enable the entire OS process to utilize multiple CPU cores. In addition, in an OS process, data exchange can be directly transferred by reference without the huge overhead of memory copy. The addition of fine-grained lock actually slows down the program running a single thread in CPython. The answer from the upstairs is better to add a large lock ..
Most people think that GIL is not good. If multiple threads are removed, they will be able to run on multiple cores. This is a problem in thinking... Removing GIL will significantly increase the implementation complexity of CPython. Aside from this, the difficulty of writing multi-threaded programs with shared memory will not decrease. You will fall down in C, and there will be no less here.
So let it go. The CPython level does not mean performance .. Python is easy to occupy multiple cores. If it is used for scientific computing, you can use Theano/NumbaPro/numexpr to optimize numpy computing or Cython to write multiple threads in bare mode. If it is used as a high-concurrency server, multiprocessing directly spawn out sub-processes, and frameworks like pulsar help you with fine-grained locks + STM + TLS:
- Fine-grained locks are used to solve GIL. Today's mutex uses futex to implement it in the user State, which is much faster than before.
- Using STM to reduce the overhead of fine-grained lock of core paths
- Use TLS to avoid preemption when global functions such as time. time and random. randint access to the same resource.
Workload problems.
---------------
@ Fleuria the problem that the Linux kernel uses more fine-grained locks does not affect the performance of Linux. What if Python uses fine-grained locks without competition will lead to additional overhead?
Because the fine-grained lock used by the kernel is spin_lock, that is, cmpxchg + while, the overhead of competition among multiple kernels is almost zero, and there is no overhead of spin_lock between the same core, because before the single core starts the spin_lock, generally, the irq of the cpu core is discarded to the disable. For details, see the implementation of Linux irq_disable ()/local_irq_disable () and other functions. The task is switched without the clock interruption before the irq is allowed.
But what if the user State cannot use spin_lock? Almost no, because if a single core enters the spin_lock and is interrupted by the clock when there is no unlock, the newly scheduled task will enter the while (cmpxchg (...) of the spin_lock (...) = xxx); loop and the loop must not stop until after dozens of milliseconds after a clock chip ends, it is switched back to the first task in the spin_unlock, the second task can get the lock and end the loop. This is because the user State cannot disable irq, and the passive switch of tasks caused by clock interruption cannot be disabled. Then, the task enters the single-core friction state of the above spin_lock.
Therefore, user processes must use mutex, which is much slower than the kernel spin_lock + irq_disable/enable combination. Simple mutex has the futex implemented by the user State. It can be checked with cmpxchg During lock. If there is no competition, it will not go into the kernel state, in the kernel state only when there is competition, the usage of mutex is much lower than the original one.
So now with futex, it is possible for Python to use more fine-grained locks in user State, which is several orders of magnitude less occupied than before. However, even the futex performance cannot reach the kernel-state spin_lock + irq_disable. However, there are some minor overhead. Using stm in the core path will be a good combination.
Over-exposure of the C interface will indeed cause some problems to the application, which will make the implementation of fine-grained locks and STM more complicated and heavy workload. This is also the real reason for the existence of GIL. -------- 2015-02-17 update --------
The comment area began to become interesting. Therefore, I would like to add a description.
About ref count and lock. The example I gave is inappropriate.
To avoid further implementation details. So we will discuss it with a simple practical model.
The conclusion is: If ref count is used to implement gc, the thread lock provided by the OS is not required.
Gc, the focus is to find g, then c.
Adding a reference will not result in g. After a reference is added, the rec count value must be greater than or equal to 2 (the ref count is 1 when a reference is created ).
After the ref count is changed to 0, it will not be changed to 1 again.. To make a resource referenced by a person, the resource must exist first. If the implementation of the compiler is OK, the ref count is changed to 0 only when the semantic resource is invisible.
What is interlocked operation? In fact, the lock signal is maintained during the execution, and some non-string commands (plus or minus bit bitwise test) with the object operand mem ensure the exclusive problem when the multi-processor accesses the shared memory.
This command can avoid the following situation: When ref count is 2, the result of two different processors is 1 (it should be 0 ).
How to Implement GC?
If lockDecr. ref) then free (something);/* in actual application, you must recursively "discover and recycle" all other resources referenced by something before "free. * // * If it is found that something is G then C (something )*/
A better design to replace GIL is coroutine ). Since only one thread can be running at a time, it is better to let the language Virtual Machine Operate coroutine stack on its own. GIL is a lazy practice. It uses C runtime stack to implement the language stack, and then uses the kernel scheduling of the operating system to simulate the implementation of coroutine's own stack operations.
Recommended article: Working simultaneously vs waiting simultaneously
Note that GIL isn' t great refers to context management of Go and Erlang.
Coroutines in one page of C is also mentioned.
Is similar to the implementation of Lua coroutine.
Like Ruby with continuation, coroutine can be implemented. However, full-continuation is much less efficient than one-shot continuation (coroutine. I don't know if Ruby has implemented the latter.
There is actually no coroutine, which can be implemented using callback like Node. js (similar to CPS ). However, Python cannot be implemented without closure.
Therefore, GIL is inefficient, but it is not too bad. The bad thing is that Python neither uses GIL nor closure, and Ruby has GIL and continuation. GIL is not a design, but a fix for issues that were not taken into account in the early stages. On the contrary, you only need to consider multithreading. (this refers to the Implementation Layer. For example, using Ruby as an example, jruby can support real multi-core, but Ruby adds GIL because the standard library and many c extensions do not consider multithreading ). Aside from what GIL said in terms of language, adding actor like Erlang, rust tasks and other features can easily write correct multi-thread program locks, which are shared, locks with smaller granularity should be better than global locks.