Multi-core programming and core programming

Source: Internet
Author: User
Tags thread logic

Multi-core programming and core programming
Multi-core computing-some things that should be known at the CPU and memory layers during multi-core programming. Try to find the essence of multi-core coordination.

Here we will basically refer to the x86 system and then simplify or modify it as needed.

First look at the various caches:
In order to solve the imbalance between Access Memory and CPU operations, so that memory access is not dragged back, the local principle is used to classify the memory and improve the memory read/write performance, which is called cache. In this case, the cache is removed first, so we are faced with several cores and the same memory, which is relatively simple. The so-called storage system becomes a black box, And the cache uses its own protocol to ensure that dirty data is not read and the write is effective. (But in actual optimization, the immediate direction is cache. Here, the storage is regarded as a black box, which does not mean that the cache is not important ).

When x86 reads and writes data of certain lengths and the Data Location meets certain alignment conditions, these operations are executed in certain order due to competition between bus resources, it is also called an atomic operation. Why are some operations not atomic? These operations require reading, computing, writing, and so on, resulting in multiple accesses to the memory. However, there may be other operations between different accesses. In this case, the operation is not atomic.

From the memory perspective, what we accept is actually a sorted read or write operation. As a memory, we need to provide a guarantee that after a certain position is written, the subsequent reading position should ensure that the output is the last written data. However, operations are controlled by the bus. What if there are multiple buses? The memory should also ensure that the last written data is read at the same location. Therefore, when designing the memory, we should consider how to ensure the correctness of the above description. So there is X-read-Y write memory: when X-read-Y writes, operations on the same basic storage unit are mutually exclusive.

Here, we consider some of the memory. The abstract memory obtained is actually the simplest concurrent object. Each storage unit is a concurrent object called an atomic register. This concurrent object may be read and written by multiple cores, but it will guarantee mutex read and write. The last layer organizes these storage units. As a memory, it is also a concurrent object. Can the correctness of each storage unit ensure the correctness of the entire memory?


Here we introduce a concept of concurrent objects: it provides some operations and can be used simultaneously by multiple cores. The concurrent object itself is based on its own abstraction. When multiple operations are performed at the same time, some correctness must be ensured. Such correctness includes static consistency, sequence consistency, and linearity. The most common is sequence consistency, which means that the operations are mutually exclusive. While while the linearity emphasizes the combination of parts to the whole, the linearity is consistent. The storage unit that supports X-read and Y-write is a concurrent object. Operations are data storage and acquisition. The operation mutex ensures the sequence consistency, and it seems that it can be linear (as described in the book ). Therefore, when considering the entire memory, this concurrent object provides the Read and Write Functions of multiple storage units, which is also correct.

Now we will compare the abstract concurrent memory with the actual one. As mentioned above, the operations on the bus are mutually exclusive, so we get a very NB memory that supports simultaneous read/write. But this benefit is brought about by the bus, so the design of the memory itself reduces the burden. (Omg, what about the previously ignored cache ?). Note that the storage unit of the memory is byte and the block is transmitted on the bus (which seems to be 64-bit), while the CPU may read and write 1-byte or 2-byte, 4 bytes, 8 bytes. In this way, not only can read and write a single storage unit correctly, but also can ensure that multiple alignment units are correctly read and written as a whole.

[What about more advanced concurrent objects?]

On the one hand, we have described the read and write operations on memory in the x86 system. On the other hand, we have put forward the concept of the correctness of concurrent objects and also observed how the memory acts as a concurrent object.

One problem is that a command has been decoded on a core and is dropped to the disordered engine for execution. At the same time, another core modifies the memory where the command is located. Obviously, this modification (assuming that the modification is completed in an instant) will not lead to the re-read command of the other core. This issue is incorrect. The concept of "concurrent" mentioned on two cores is meaningless. In the shared memory concurrent computing, we assume that different execution units are executed at different speeds, an unpredictable interval can be stopped at any time. We cannot mention "at the same time ". Why should we consider "at the same time? Because the two cores share data. Command data of one core is also the data that needs to be written by another core. From the perspective of storage units, we only consider what values are given when others read and what is stored when writing. From the perspective of execution unit, we only consider what value we read and write it into when we write it.


If there is no shared thing between the two execution units (here it is equivalent to the core and thread), we have nothing to consider at this layer, you can only look at concurrency and parallelism at a higher abstraction level. However, this is unrealistic. Multi-core computing is complicated by sharing things. Communication and synchronization should be implemented through shared items so that multiple execution units can coordinate the work. Shared items can be high-level abstraction, but they can only be a storage unit at the underlying layer. The simplest synchronization is to make the two execution units mutually exclusive. The previous model already supports such mutex, and the existing thing can ensure correct realization of the peterson lock. (Theoretically, this is actually complicated)

CPU commands are divided into several levels, which are generally considered to be an assembly command we see. However, commands are also translated into micro operations. Currently, intel CPU has four decoding units, three of which are unit decoders and one of which is complex command decoders. After these operations are distributed to the out-of-order engine, more minor operations are required for each execution port to complete a micro-Operation Command. This is a little different from the read/write operations mentioned above. As a result, the CPU provides some relatively high-level atomic operations (lock commands), represented by Assembly commands. During these operations, the bus is locked and the exclusive memory includes read and write computing. All these commands are executed on a single execution unit. Only one execution is performed at a time, and the execution result can be observed later. (Now, we can implement the peterson lock ).

There are still some problems: CAS, store buffer, unordered execution, and memory barrier.

Atomic operations such as CAS are a bit special, and "branches" exist in commands. The significance is to provide an infinite number of consistent numbers. In the absence of such logic, the previous atomic operations require N storage units when N cores are executed and mutually exclusive, and CAS does not solve this problem.

All the previously mentioned write and storage operations are immediately written. The existence of the store buffer allows the write action to be slowed down first.

The unordered execution of x86 allows the load operation of one unit to advance to the write operation of another unit.

This is the so-called sequence of processors with store buffer. What problems will this cause? If you look at the store buffer, the write operation may not take effect immediately. Looking at the disordered sequence, it means that the Operation Sequence of this core is affected when the corresponding storage unit is deoperated by other cores. Therefore, memory barrier is introduced. The operation before memory should take effect. As you can see from other cores, the operation before memory barrier is divided into two parts. Therefore, the essence of the problem lies in the visibility of the internal operation sequence to the outside. Multi-core coordination is the pursuit of a causal relationship.

At the multi-core processor level, multi-core computing seems like, probably, at least need to understand the above.

At the above layer, for example, the memory model of multi-thread execution in C ++ 11, a large number of orders are being discussed for a long time. In fact, the problem is solved: The operation sequence inside the thread, it can be observed by other threads to coordinate the global order. When the operation sequence does not need to be observed, it means that it can be optimized by Single-thread logic. When the operation order needs to be observed, the threads coordinate the work in this order to ensure the correctness of the program. The so-called release-acquire semantics means that when a thread acquire reaches the expected value, another thread release the value and the action before release takes effect. The more detailed is the consumer, and the higher the level is the mutex. From the C ++ layer to the CPU layer, there is still a compiler...

No matter whether you have no locks or locks, these cannot be escaped in multi-core programming.

PS: The article is purely your own YY and is not responsible for the consequences.

Read The Art of Multiprocessor Progrtamming (The Art of multi-processor programming) Maurice Herlihy, Nir Shavit, Jin Hai, and Hu Kan.






Makeup knife:
The full sequence of the lock operation makes the changes of the variable (storage location) set observed by multiple cores consistent. Multi-core collaboration can only be achieved based on this consensus. The lock operation on one core also has an egg. When the lock operation on one core is performed in a certain order, the operation sequence must be the same when the corresponding operations are observed on other cores: The operation sequence is part of the consensus. However, such consensus requirements are somewhat strong. Therefore, the smaller one is the visibility of the Operation Sequence on a core to other cores. This is the necessity of memory barrier: the internal operation sequence is visible to the outside (the visible write operation means that the write operation will take effect ). In addition to the absence of memory barrier and lock instruction, multi-core read/write memory must also meet certain consistency requirements: for example, the Stores in intel documents Are Transitively Visible, stores Are Seen in a Consistent Order by Other Processors.

Overall:
Execution model: All users execute in disorder, READ memory together, and write memory together with store buffer. During the implementation process, each nuclear energy has observed certain changes, and the observed results meet a certain degree of compatibility, that is, a certain consensus. If the above consensus is not strong enough, use memory barrier. If the requirements cannot be met, there are still atomic operations.

The essence of multi-core collaboration is consensus, which makes it possible to reach an agreement on some changes. The consensus at the above three levels is from weak to strong. The most important thing about consensus is the order. The strong order is the global order. The weakness is to let others know their order.

Multi-core and distributed systems share the following two aspects:
Concurrent objects and their correctness.
Consensus should be reached for collaboration.


But the most important thing about consensus is that it is still true to put the order in the distributed environment? This is not allowed. shared storage exists in multiple cores. However, we can introduce similar things in the distributed architecture, such as a coordinator. In this way, the distributed architecture is transformed into a multi-core architecture.


Multi-core Programming Problems

As long as you compile a multi-threaded program, the operating system will automatically allocate you a cpu time slice.
Whether or not multi-core functions are related to the operating system and your program.

What is the difference between multi-core programming and single-core programming?

Exclusive resource lock. Due to limited resources and unlimited concurrent programs, there is a conflict. We are about to use "Lock" to solve the problem.
Single Process/thread, not required.
Multi-process/thread, required.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.