Operating System Technology for operating "core" strategizing-CMP
2005-05-19
■ Dong yuanlin haoxiang, Department of Computer Science and Technology, Tsinghua University
■ Wang Dongsheng, Tsinghua University Information Technology Research Institute, Li Peng
Single-chip multi-processor (CMP), especially the development of single-chip symmetric multi-processor (homogeneous CMP) with multiple identical general-purpose processors, it is the inevitable development direction brought by the progress of IC manufacturing technology and the development of microprocessor architecture since the 1990s S.
Similar to the distributed multi-processor (SMP) architecture widely used in the server field, in the CMP System, all processor kernels in the same chip are equally involved in Task Scheduling and interrupt processing, share memory and external devices, and can also share (partially or entirely) in-chip high-speed cache.
The CMP structure is relatively simple and can directly use the existing processor kernel. Therefore, the development cycle and cost are relatively low. Another advantage of the simple structure is that it is easier to obtain a high clock speed. Because multiple processors are integrated into one chip and shared cache, the communication latency between the processors is significantly reduced, which is conducive to improving the overall system performance. Therefore, CMP has a good development prospect and a wide range of applications. Many famous universities, scientific research institutions and commercial companies have carried out extensive and active research.
To truly leverage the advantages of CMP, the support of software, especially operating systems, compilation tools and other system software, is crucial. Without such software, CMP will be in the "idling" state. Therefore, each CMP System requires system software tailored for it.
Challenges posed by CMP to the Operating System
System software is of great significance for the extensive and in-depth application of CMP. Here we will discuss the operating system. The operating system is the basic system software of a computer system. It is at the core of the computer system and is responsible for controlling and managing all the software and hardware resources of the computer, it is the only software that directly deals with hardware systems. It is the basic part of the entire software system and provides a good interface for computer users.
An operating system is a resource manager that provides system commands, interface operations, and other tools to complete system management functions in an easy-to-understand manner, effectively control various hardware resources, organize your own data, complete your work, and share resources with others.
For programmers, the operating system provides an extension or virtual computing platform equivalent to the computer hardware. The tools provided by the operating system to programmers include system commands and interface operations, as well as system calls. The system calls abstract many hardware details, and the program can process data in a unified manner, programmers can avoid many specific hardware details, improve program development efficiency, and improve program porting features.
Parallelism is one of the important branches of computer science and technology. Its core idea is to rationally divide and allocate tasks so that multiple processors can execute one or more tasks at the same time, this greatly improves the overall computing capability of the system. The significance of CMP is to provide a new idea for parallel task execution. It supports the division and allocation of tasks (that is, scheduling) between multiple processor kernels in a chip ), the task scheduling must be completed by the operating system.
The development of CMP poses new challenges to the operating system. First, how can we Reasonably organize and Schedule Tasks to maximize the performance of the CMP structure? Secondly, how can we maintain a relatively stable external interface of the operating system? For general users, what we hope is a smooth transition. On the one hand, the interface should be the same as the previous operating system, on the other hand, it is better to run the previously usable applications directly on the CMP machine without any modification. That is to say, the CMP should be transparent to users, this requires that the operating system remain unchanged in terms of user interface and programming interface.
How to better organize and schedule tasks so as to maximize the performance of the CMP structure is the core issue, which is the greatest expectation of CMP. To solve this problem, software and hardware must work together to start from task scheduling, interrupted allocation, and resource sharing, in terms of hardware, the CMP system is required to provide a new mechanism for synchronization and mutex, interrupt allocation, and interruption between CPU cores.
Key technologies supporting CMP Operating Systems
At present, the research on CMP is still in the exploratory stage, and the related operating systems are also in the active research period. The research, analysis, and reference of the key technologies supporting the CMP operating system are of great significance for us to understand, understand, and design the CMP operating system, we briefly introduce the technology that supports guidance, initialization, scheduling, interrupt processing, synchronization, and mutual exclusion of CMP operating systems.
System Boot and initialization
Operating System Boot and initialization refer to the process from system power-on to equal task scheduling among multiple processor kernels. This process is the basis for establishing equal scheduling implementation, it is of great significance for the operation of the entire system. Although in a multi-processor system, each processor can work in parallel on an equal footing, this is based on the fact that the system has multiple tasks that can be executed in parallel, and in the process of guidance and initialization, since a lot of work can only be performed in a serial manner, the processor kernel is unequal at this stage, with the primary and secondary points. After the system is powered on, it is controlled by hardware. Only one processor is started, which is called the main CPU or boot processor (BP), while other processors, an application processor or AP is in the waiting state.
After the power-on starts, the main CPU jumps to a specific memory address, which is usually mapped to read-only memory. Here, the boot program bootloader of the entire computer is saved, the task is to perform a simple hardware test, initialize environment parameters, load the operating system kernel to the memory, jump to the start address of the operating system and start execution. This guidance process is completely completed by BP.
After entering the operating system kernel, BP needs to perform the initial innovation work, prepare the running environment, set various initial states, clear basic read/write data segments, save various environment parameters passed by bootloader, open the memory stack, and set the stack pointer and global pointer. The previous work was completed by the underlying assembly code. Then, BP jumps to the function compiled by the advanced language and starts the initialization of the CPU in the second stage.
During CPU initialization, BP first starts self-check to collect basic information about CPU-related instruction sets, storage management, high-speed cache, and coprocessor. Next, prepare the running environment for the CPU, prepare a lock for the AP, then wake up the AP, transfer the AP to the address set by the main CPU, start the lock test and enter the waiting state. After the AP is awakened, BP outputs its own information and continues to initialize various resources such as memory.
The following work mainly involves BP Development Board initialization and external device initialization, and then prepare idle processes for all CPUs. This is a process that does not participate in scheduling. When a CPU does not have a task to execute, transfer to this process. After the idle process is ready, BP unlocks the AP. Each AP starts one by one, initializes each CPU, and fills in its own status in the appropriate data structure, finally, they enter the idle state one after another.
After all the APS are initialized and enter the idle state, BP completes the initialization of the final stage of the system and executes the first process of the system. Then, the system enters the symmetric multi-processor environment, all CPUs enter normal and equal scheduling.
Scheduling
The scheduling system has a very important impact on the integrity of the operating system and even the entire computer system. embedded, desktop, and high-end server systems have very different requirements for schedulers. In the CMP structure, the scheduling mechanism focuses on better meeting the multi-processor concurrency. The core idea is to reduce the scheduling competition between CPUs and select the overhead of the next running process, and improve the overall Load Balancing Capability of the system, thus greatly improving the execution efficiency of the multi-processor system. Its features are as follows:
1. Scheduling Algorithm
In a traditional single-CPU scheduling system, all ready processes (in the task_running State) are organized into the same two-way linked list, which is called a global task queue, all the processes in the chain table will be traversed during the scheduling process, and the weights of each process will be called and calculated, and the process with the largest selection value will be put into operation. Because the scheduler needs to traverse all ready processes, the time complexity of selecting the next running process is O (n) (N is the number of ready processes ). At the same time, because the ready queue is global, for a single CPU system, only one CPU can access this queue. In the multi-processor structure, you must use a full-board spin lock to ensure that only one CPU is accessed at the same time, which leads to the waiting of other CPUs in the system. If the number of ready processes is large, the ready queue will become a significant bottleneck.
In an operating system that supports CMP, each CPU maintains a ready process queue, called a local task queue, which greatly reduces competition between CPUs. Ready processes are divided into two categories: active and expired based on whether the time slice is used up. Active includes ready processes that are not used up and can be scheduled, and expired includes ready processes that have been used up by the time slice. At the same time, processes in each type are listed in different priority lists based on their priorities.
During scheduling, the first item of the list with the highest priority that is not null in the active queue is used as a candidate process, so that the operation for selecting the next running process can be completed within a fixed period of time. At the same time, the kernel establishes a single-bit ing array to correspond to each priority linked list. The flag space greatly reduces the time required to search for non-empty linked lists. When a process consumes its time slice, the kernel recalculates its priority and places it in the corresponding priority list of the expired queue. The process of calculating the process priority can also be distributed to the process execution process. This method disperses the centralized computing process to ensure the maximum running time of the scheduler and reduce unnecessary overhead; at the same time, the process of locating candidate processes is accelerated by retaining more information in the memory. When no scheduling process is available in the active queue, the kernel simply calls the active and expired queues and uses the original expired queues as the new active queues to perform a new round of scheduling.
The time complexity of this scheduling algorithm for selecting the next running process is O (1), which is irrelevant to the number of ready processes and greatly improves the scheduling efficiency.
2. System Load Balancing
The scheduling system supporting the CMP operating system kernel needs to solve the problem of "affinity" between processes and CPU. If affinity is not taken into account, a process may be frequently migrated between CPUs, and interactive processes (or high-priority processes) may constantly "jump" between CPUs, after each migration, frequent memory access may occur, resulting in a reduction in overall performance.
The scheduling system that supports the CMP operating system kernel tries its best to keep each process running on a fixed CPU, which can improve the cache hit rate. However, if a CPU ready queue is too long, the continuous process switching results in a decrease in the cache hit rate, and other CPUs cannot fully utilize the performance. No matter whether the current CPU is busy or idle, clock interruption starts once in a period of time to balance the load. Of course, once the current CPU finds its ready queue is empty, it will also take the initiative to balance the load.
For effective load balancing, the operating system kernel introduces the scheduling domain (struct sched_domain) concept based on the characteristics of the system structure to divide all the CPUs into different regions layer by layer, the CPU in each scheduling domain is divided into several CPU groups, and any CPU is unique in one group. Each CPU belongs to a basic scheduling domain (which includes at least the CPU), but the CPU also belongs to one or more larger scheduling domains. Multiple scheduling domains of the CPU form a one-way linked list and must meet the following requirements: 1. the parent scheduling domain is the superset of the sub-scheduling domain; 2. the highest-level scheduling domain of each CPU must include all the processors in the system. For example, in a CMP system that supports hyper-threading, the basic scheduling domain of each logical CPU contains all the logical CPUs on the traditional physical CPU. Each CPU group in the basic domain contains one logical CPU; the parent scheduling domain of the basic scheduling domain is the highest-level scheduling domain of the system. It contains all the logical CPUs in the system. Each CPU group in this domain contains all the logical CPUs on one physical CPU.
For CMP systems, multiple cores on each chip naturally form a scheduling domain. The difference between a single-chip CMP System and a general SMP system's basic scheduling domain is that the CPU group in the CMP basic scheduling domain contains a "CPU core ", the CPU Group of the SMP basic scheduling domain contains a traditional physical CPU. For complex CMP systems like hyper-threading and multi-core SMP, a new chip-level scheduling domain must be added.
During load balancing, all scheduling domains are traversed from the basic scheduling domains of the current CPU. If Load Balancing has not been performed for a certain period of time in a domain, first look for the CPU group with the largest load in the domain (the load in the CPU group is equal to the total load of all CPUs in the group ), then, find the busiest CPU in the group (the CPU load exceeds the CPU load by at least 25%) and update the load records of both parties, determine that the number of processes to be migrated is half of the difference between the source CPU load and the current CPU load (the updated value ), then, the migration is performed in the order from the expired queue to the active queue and from the low-priority process to the high-priority process. However, the actual migration process is usually less than the number of planned migration.
Interrupt handling
Traditional single-processor usually uses an External Interrupt Controller to solve the communication between external devices and the CPU. For multi-processor systems, the communication between the processors also needs to be interrupted. For CMP, the Interrupt Controller (local Interrupt Controller) between multiple processors also needs to be encapsulated into the chip together with the processor. In addition, a global interrupt controller (gcontroller for short) is also required to allocate interruptions between each processor kernel, which must also be designed inside the chip.
1. Interrupt allocation
The Global interrupt controller is responsible for submitting and allocating interrupt requests from external devices to various CPUs in the chip. For each interrupt vector, either static or dynamic modes can be used. If an interrupt vector is statically allocated, the Controller submits the interrupt request to one or more preset CPUs. In dynamic mode, the request can be sent to all CPUs, or randomly sent to a processor.
The local controller processes interrupt requests generated inside the processor, interrupt requests from the External Interrupt Controller, and interrupt requests sent by other processors.
2. Inter-processor interruption
In a CMP System, a processor inside a chip often sends an interrupt request to other processors in the system, which is called an inter-processor interrupt (IPI, Inter processor interrupt ). IPI should include at least two types:
"Reschedule" is interrupted. The current CPU can send this interrupt to indicate that the target CPU may need a process scheduling. If the target CPU does not schedule the process after the interrupt is processed, it depends on whether the current process needs to be scheduled in advance or in the process of handling the interruption.
"Request execution" is interrupted. This interrupt is used to request the target CPU to execute a specified function. The reason why you need to use IPI to ask other CPUs for execution is that a function must be completed by the target CPU instead of another CPU. For example, if a processor changes the content of a page ing directory or page ing table in the memory, this may cause the TLB of other processors to be inconsistent with it, this interrupt is sent to the processor that is using this ing table in the system, asking them to execute the code themselves and discard their respective TLB contents.
3. Clock interruption
In all interrupt requests, clock interrupt plays a special important role. Each processor has its own clock generator. In the system initialization phase, the system first sets an external clock interruption source for all CPUs to share, and calculates the computing speed of each CPU Based on this, and calibrate its own clock interruption interval. Since each processor is completely consistent in design, all CPUs have basically the same clock pulse cycle, in order to prevent all processors from clock interruption at the same time, the operating system should make the clock interruptions of each CPU staggered in the phase to evenly distribute these interruptions in the clock interruption cycle.
Synchronization and mutex Technology
In a multitasking system, different tasks may need to access some shared variables and resources at the same time, forming a competitive relationship. This requires the system to provide a synchronization and mutex mechanism, so that these shared variables and shared resources can only be accessed by one task in an exclusive manner at the same time. Traditional single-processor systems are parallel at the macro level. On the micro level, because there is only one processor, only one task can be executed at a time. The synchronization and mutex problems are easily solved, when the CMP operating system is running, multiple tasks may be executed at the same time. Traditional methods sometimes cannot meet this specific situation and new mechanisms need to be introduced.
The synchronization and mutex mechanisms require the underlying hardware to provide read-Modify-write access atomic operations. This operation allows the CPU to read a value from the primary memory, modify it, and then save the modified value to the same location of the storage device. The entire process is a complete bus transaction and cannot be interrupted by memory access operations of other CPU cores. The "Read-Modify-write" type of atomic operations has multiple implementation methods, including test_and_set, swap, and load-linked/store-conditional.
The test_and_set command reads a value (usually a byte or a word) from the primary storage, compares it with 0, and sets the condition code according to the comparison result, finally, store 1 to the corresponding unit of the storage device unconditionally. Once a test_and_set command starts its bus cycle, no other CPU can access the primary memory.
In Intel's processor, the CPU chip has a lock pin. If the assembly language program adds the prefix "Lock" before a command ", after compilation of thick machine code, the CPU will lower the lock pin when executing this command and lock the bus. At this time, other CPUs on the bus cannot perform bus operations for the time being, this ensures the atomicity of operations. The x86 series CPU also provides an xchg command to exchange atoms. This command is an atomic operation regardless of whether the lock prefix is added.
The atomic operations are simplified in some of the MIPs processors. They provide a pair of commands (load-linked/store-conditional ), with this pair of commands, you can perform an atomic "Read-Modify-write" operation. The load-linked command executes the First Half of the Atomic "Read-Modify-write" operation. It reads a value from the memory and sets a flag in the hardware to indicate that an atomic operation is in progress, and set the read address. Then, use the store-conditional command to complete the modification. In this case, the entire command sequence from load-linked to store-conditional is executed in an atomic manner.
The operating system can use the spin-Lock Based on the above principles ). It is used to obtain the access permission of a variable (called a lock). If the access permission is successful, perform the next step, for example, access to shared variables or shared resources. If the access permission fails, the lock is queried until the access permission of the lock is obtained successfully. Spin locks are important operations to ensure synchronization and mutex operations.
With the "Read-Modify-write" atom provided by the hardware, the operating system can complete various synchronization and mutex operations to correctly solve the problem of resource sharing.