Author: banyao2006
Compared with single-core processors, multi-core processors face great challenges in architecture, software, power consumption, and security design, but they also have great potential.
Like SMT, CMP is committed to exploring the coarse-grained concurrency of computing. CMP can be seen as the development of large-scale integrated circuit technology. When the chip capacity is large enough, SMP (symmetric multi-processor) or DSM (distributed shared processor) in the large-scale parallel processor structure can be) nodes are integrated into the same chip, and each processor executes different threads or processes in parallel. In a single-chip multi-processor based on SMP structure, the processors communicate with each other through the off-chip cache or out-of-chip shared storage. In a single-chip multi-processor based on the DSM structure, the processors communicate through the on-chip high-speed cross-switch network connected to the distributed memory. Since SMP and DSM are already very mature technologies, CMP structure design is easier, but the requirements for backend design and chip manufacturing processes are high. As a result, CMP becomes the "future" high-performance processor structure that is first applied to commercial CPUs.
Although the increasing integration of multi-nuclear power brings many benefits, the chip performance is multiplied, but it is obvious that some of the original system-level problems are introduced into the processor.
1-core structure: homogeneous or heterogeneous
The components of CMP are divided into two categories: homogeneous and heterogeneous. homogeneous means that the internal core structure is the same, while heterogeneous means that the internal core structure is different. To this end, the implementation of core structures for different applications is crucial to the performance of the future microprocessor. The core structure is related to the area, power consumption, and performance of the entire chip. How to inherit and develop the results of traditional processors directly affects the performance and implementation cycle of multiple cores. At the same time, according to the Amdahl theorem, the program accelerator ratio depends on the performance of the serial part. Therefore, theoretically, it seems that the structure of heterogeneous Microprocessor has better performance.
The command system used by the core is also very important to the system implementation. Whether the same command system is used between multiple cores or different command systems can run the operating system, etc, it will also be one of the contents of the study.
2. program execution model
The primary issue in multi-core processor design is to select a program execution model. The applicability of the program execution model determines whether the multi-core processor can provide the highest performance at the lowest cost. The program execution model is an interface between the compiler designer and the system implementer. Compiler designers decide how to convert a high-level language program into a target machine language program based on a program execution model; the system implementer determines the effective implementation of the program execution model on the target machine. When the target machine is a multi-core architecture, the problem arises: how does the multi-core architecture support important program execution models? Are there other program execution models that are more suitable for multi-core architectures? How can these program execution models meet application requirements and be accepted by users?
3 Cache Design: multi-level cache design and consistency problems
The speed gap between the processor and the primary memory is a prominent contradiction for CMP. Therefore, multi-level cache must be used to alleviate this problem. Currently, CMP for Shared primary cache, CMP for shared secondary cache, and CMP for Shared primary cache are available. Generally, CMP adopts the CMP structure of the shared level-2 cache, that is, each processor core has a private level-1 cache, and all processor cores share the level-2 cache.
The system architecture design of the cache is also directly related to the overall system performance. However, in the CMP structure, the shared cache or the unique cache has the advantages and disadvantages. You do not need to build multi-level cache on a chip, and create several levels of cache, these issues need to be carefully studied and discussed because they have a great impact on the size, power consumption, layout, performance, and operating efficiency of the entire chip.
On the other hand, multi-level cache leads to consistency issues. The cache consistency model and mechanism will have an important impact on the overall performance of CMP. Cache consistency models widely used in traditional multi-processor systems include sequential consistency models, weak consistency models, and release consistency models. The related cache consistency mechanisms mainly include bus listening protocols and directory-based directory protocols. Currently, most CMP systems use bus-based Listening protocols.
4-core communication technology
Data is sometimes shared and synchronized between programs executed by each CPU core of the CMP processor. Therefore, the hardware structure must support inter-core communication. The efficient communication mechanism is an important guarantee for the high performance of CMP processors. Currently, there are two mainstream on-chip efficient communication mechanisms. One is the cache structure based on Bus Sharing, one is based on the On-chip Interconnection Structure.
The bus shared cache structure refers to the level-2 or level-3 cache shared by each CPU core, which is used to store commonly used data and communicate with each other through the bus connected to the core. The advantage of this system is that the structure is simple and the communication speed is high. The disadvantage is that the architecture based on bus has poor scalability.
The on-chip Interconnection structure means that each CPU core has an independent processing unit and cache, and each CPU core is connected through a cross switch or on-chip network. Each CPU core communicates through messages. The advantage of this structure is good scalability and guaranteed data bandwidth. The disadvantage is that the hardware structure is complicated and the software changes greatly.
The competition between the two may not replace each other, but cooperate with each other. For example, the global network and local bus are used to achieve a balance between performance and complexity.
5 bus design
In traditional microprocessors, cache miss or memory access events negatively affect the CPU execution efficiency, and the efficiency of the Bus Interface Unit (BIU) determines the degree of impact. When multiple CPU cores need to access the memory or private cache of multiple CPU cores simultaneously, a cache miss event occurs, the efficiency of Biu's arbitration mechanism for these multiple access requests and the conversion mechanism for external storage access determines the overall performance of the CMP System. Therefore, we need to find an efficient multi-port Bus Interface Unit (BIU) structure and convert single-word access from multiple cores to primary memory into more efficient burst access; at the same time, finding the optimal number of burst access words for CMP processors and the arbitration mechanism for efficient multi-port Biu access will be an important part of CMP processor research.
6 Operating System Design: task scheduling, interrupt processing, synchronization mutex
For multi-core CPUs, optimizing the task scheduling algorithm of the operating system is the key to ensuring efficiency. Generally, task scheduling algorithms include global queue scheduling and Bureau force column scheduling. The former means that the operating system maintains a global task waiting queue. When a CPU core is idle in the system, the operating system selects a ready task from the global task waiting queue and starts executing the task on this core. The advantage of this method is that the CPU core utilization is high. The latter means that the operating system maintains a local task waiting queue for each CPU kernel. When the system has a CPU kernel idle, select the appropriate task execution from the core task waiting queue. The advantage of this method is that the task basically does not need to be switched between multiple CPU cores, which is conducive to improving the local cache hit rate of the CPU core. Currently, most multi-core CPU operating systems use the global queue-based task scheduling algorithm.
Multi-core interrupt processing is very different from single-core interrupt processing. Multi-core processors need to communicate with each other through interruption. Therefore, the local Interrupt Controller between multiple processors and the Global interrupt controller responsible for arbitration of Interrupt allocation between each core also need to be encapsulated in the chip.
In addition, a multi-core CPU is a multi-task system. Because different tasks compete for shared resources, the system must provide synchronization and mutex mechanisms. However, the traditional single-core solution cannot meet the requirements of multiple cores. It must be ensured by the hardware's "Read-Modify-write" atomic operation or other synchronization mutex mechanism.
7 Low Power Design
With the rapid development of semiconductor technology, the integration of microprocessor is getting higher and higher, and the surface temperature of the processor is getting higher and higher, increasing exponentially. The power consumption density of the processor can double every three years. At present, low power consumption and hot optimization design have become the core issues of microprocessor research. The multi-core structure of CMP determines its related power consumption.
Low-Power Design is a multi-level problem. It needs to be studied at the operating system level, algorithm level, structure level, region level, and other levels at the same time. Different levels of low-power design methods have different effects-the higher the abstraction level, the more obvious the effect of reducing power consumption and temperature.
8 storage Wall
To make the chip kernel fully work, the minimum requirement is that the chip can provide memory bandwidth that matches the chip performance. Although the internal cache capacity can solve some problems, however, with the further improvement of performance, there must be other means to increase the bandwidth of the memory interface, such as increasing the bandwidth of a single pin, such as DDR, DDR2, QDR, and XDR. Similarly, the system must have high-bandwidth storage. Therefore, the chip has higher and higher packaging requirements. Although the number of encapsulated pins increases by 20% every year, the problem cannot be completely solved, but it also brings about cost improvement, therefore, how to provide a high-bandwidth, low-latency interface bandwidth is an important issue that must be addressed.
9 reliability and security design
With the development of technological innovation, the application of processors has penetrated into all aspects of modern society, but there are great hidden dangers in terms of security. On the one hand, the reliability of the processor structure itself is low. Due to the ultra-fine refinement and high-speed clock design and low power supply voltage, the design safety factor is increasingly difficult to guarantee, and the fault rate is gradually increasing. On the other hand, more and more malicious attacks from third parties and more advanced means have become a universal social problem. At present, the improvement of reliability and security has attracted much attention in the field of computer architecture research.
In the future, the concurrent execution of multiple processes in the CMP chip will become mainstream. In addition, the hardware complexity and design errors will increase, making the chip not necessarily safe, therefore, security and reliability design has a long way to go.