Multi-core operation mode:
1. <<qnx-–-Micro-core structure of the real-time operating system. PDF>>
2. Symmetric multi-processing "(symmetrical multi-processing) SMP
SMP(symmetrical multi-processing), a symmetric multi-processing system, refers to a group of processors (multi-CPU) that are pooled on a single computer, sharing the memory subsystem between CPUs and the bus structure. It is a kind of parallel technology with relatively asymmetric multi-processing technology and widely used. In this architecture, one computer is no longer composed of a single CPU, while a single copy of the operating system is run by multiple processors, and the memory and other resources of a computer are shared. While using multiple CPUs at the same time, from a management standpoint, they behave like a single machine. The system distributes the task queue symmetrically over multiple CPUs, which greatly improves the data processing capability of the whole system. all processors have equal access to memory, I/O, and external interrupts. in a symmetric multiprocessor system, the system resources are shared by all CPUs in the system, and the workloads are evenly distributed across all available processors.
SMP Technology http://www.elecfans.com/baike/computer/fuwuqi/20091217137187.html
SMP Build conditions
To build an SMP system, the first key point is the need for a suitable CPU to match. We usually see the CPU is a single use, so do not see what the difference between them, but, in fact, support SMP function is not without conditions, arbitrarily take a few CPU to build a multi-processing system that is implausible. To implement the SMP feature, the CPU we use must have the following requirements:
1. The internal APIC (Advanced Programmable Interrupt Controllers) unit must be built into the CPU. The core of the Intel Multi-processing specification is the use of Advanced Programmable Interrupt controllers (Programmable Interrupt controllers--apics). The CPU communicates between them by sending interrupts to each other. By giving the interrupt additional actions (actions), different CPUs can control each other to some degree. Each CPU has its own APIC (the local APIC of that CPU), and there is an I/O APIC to handle interrupts caused by I/O devices, the I/O APIC is installed on the motherboard, but the APIC on each CPU is indispensable, Otherwise, you will not be able to handle interrupt coordination between multiple CPUs.
2, the same product model, the same type of CPU core. For example, while both Xeon and Opteron have built-in APIC units, it is impossible to make them work together to build an SMP system, and even if the CPU cores belonging to the Xeon family or the Opteron series are the same development platform, the SMP system cannot be built-because their running instructions are not identical. The APIC interrupt coordination difference is also very large.
3. The exact same operating frequency. If you want to establish a double Xeon or double-Opteron system, must be two 2.8GHz or two 3.0GHz processor, can not use a 2.8GHz, another 3.0GHz to build, otherwise the system will not be lit normally.
4. Keep the same product serial number as possible. Even the same core of the same frequency processor can cause incredible problems due to the different production batches. Two production batches of CPUs run as a dual processor, it is possible that a CPU burden is too high, while the other burden of a small situation, unable to maximize performance, and worse, may lead to the crash, therefore, should choose the same batch of processors to build the SMP system.
3. Asymmetric dual-core MCU basic knowledge and inter-core communication http://www.elecfans.com/emb/app/20120326265555.html
This paper begins with the comparison of two discrete MCUs and single-chip dual-core MCUs (taking LPC4350 as an example) and introduces the basic knowledge and important features of asymmetric dual-core MCUs. Next, it focuses on the concept of inter-core communication and several implementation methods, especially the control/State communication based on the message pool. Then, the kernel mutex, initialization process and some other important details are discussed. Finally, two kinds of application models of dual-core task division are proposed, and the examples are presented respectively.
Background and basic concepts
When developing MCU applications, it is common practice to use two or more MCUs to allocate part of the "miscellaneous work" to another low-end MCU with an "assistant" nature if the single MCU does not meet the requirements of the system. However, the use of two MCUs, the shortcomings are also obvious, especially in the chip and PCB cost, system reliability and power consumption have congenital deficiencies. In addition, the use of different architectures of the MCU, but also face the need for different development tools and developers challenges. If a different way of thinking, so that the MCU contains two cores, one for the master control, and the other for co-control, and their master and co-control in the architecture can be backward-compatible, efficient communication, in many cases can not only maintain the power of the multi-machine system, but also to avoid the shortcomings of the multi-machine system.
In fact, this is the feature of the asymmetric multi-processor (AMP) architecture. AMP is the architecture that is relative to symmetric multiprocessor (SMP), which has a consistent programming model and is primarily balanced when assigning work. The advantage of AMP is the fine division of tasks, flexibility to adapt to different scenarios, make the best use, in order to optimally balance the cost, performance and power consumption. In addition, AMP is less difficult to program. As a result, AMP is more suitable for MCU applications than SMP.
The AMP architecture has many advantages over standalone dual MCUs. It is critical to add a kernel that is much cheaper than adding a single MCU, especially if the two core architectures are similar, or even the equivalent of adding one or two uart to an existing wafer. On the other hand, two cores can have the same frequency and can access on-chip resources equally through the bus matrix. In discrete dual MCU schemes, the frequency of the co-control MCU is often much lower than the main control, and both sides use low-speed serial link communication.
Next, we give a brief introduction to the amp MCU using NXP Semiconductor's new LPC4300 series, especially the LPC4350 model.
Features of asymmetric dual-core MCUs
AMP MCUs are typically used in relatively large systems, which have high requirements for functionality and performance. In function, more peripherals should be supported. LPC4350 2 high-speed USB, 2 can, industrial Ethernet, graphic LCD controller, and SDHC interface, plus a number of unique logic configurable peripherals and a number of traditional peripherals, for industrial control, energy, medical, audio, automotive, motor, monitoring and many other industry product development.
The performance improvement is the soul of the amp MCU. The kernel, memory, and bus architectures have a critical impact on performance. Figure 1 shows how the LPC4350 is implemented.
Figure 1:lpc4350 Kernel, memory, and bus connection diagram
The first is the kernel choice. The LPC4350 is based on the 32-bit arm cortex-m4 and CORTEX-M0 cores (hereinafter referred to as M4 and M0), and two cores can execute code at a frequency of up to 204MHz. Among them, M4 with signal processing and floating point computing ability, competent many of the original to use DSP to meet the application, and inherit the cortex-m3 control ability; On the other hand, M0 is rapidly attracting developers to transition from a 8-16-bit architecture to an overwhelming advantage of cost, energy efficiency, and processing power. More importantly, M4 is fully backwards compatible with M0, and can be developed and debugged using the same set of development tools.
The second is the capacity and organization of the memory. The LPC4350 is equipped with up to 264KB of on-chip RAM, and the RAM is divided into 4 groups, each connected to a separate bus, rather than without chunking. If not, then there will be two cores competing for the same piece of RAM-performance is not as good as a single core! Further, the LPC4350 also has two bus connections to external extended parallel and serial memory, so there are 6 separate memory address spaces--lpc4350 without on-chip flash. For models with on-chip flash memory, the on-chip flash is also divided into two pieces.
Finally, the bus architecture. There is a eight-layer bus matrix inside the LPC4350. It is like a set of crossbar switches that can connect the CPU to a large number of slave devices, including memory, via the bus. The reasonable allocation of bus connection, to avoid multiple primary devices (such as CPU and DMA) simultaneously access the same memory or peripherals, you can maximize the data flow parallel, so as to maximize the performance advantage.
Inter-core communication
Inter-core communication can be divided into two categories: one is control and state information communication, the other is data communication. The former generally does not carry data, but often has a higher real-time requirements, the latter is mainly a variety of data buffers, usually the real-time requirements of low but large data volume. The control/State communication is more common and more similar to the synchronization between tasks. This kind of communication is suitable to be implemented by system software and provide programming interface. Data communication is often associated with specific applications (especially in data structures) and needs to be tailored. When implemented, it is appropriate for the application software to define various data structures.
The cores communicate through shared RAM, and each core can trigger an interrupt source for each other, communicating by preparing the data-triggering the interrupt, as shown in 2. Of course, the kernel can also periodically check the state of shared ram.
Figure 2: Shared memory communication pattern diagram between cores
Next, we describe the control/State communication scheme based on Message Queuing and message pooling.
Message Queuing: Open two message queues, one for M4 to send messages to M0, and the other for M0 to send messages to M4. The address of the two queues must be agreed in advance. Queues are circular queues that can be implemented using a simple array with a read, write subscript, or by using a linked list structure. The former is simple, the cost is small, but the message can only be fixed length, not easy to carry other information, also, it is necessary to put the array in the shared memory area continuous position, low flexibility. Linked list-based implementation with pointers to each message, each message, in addition to the public List control section, can carry a variety of additional parameters according to the message category, and can be used by the system software memory management mechanism to flexibly allocate the message memory, but the disadvantage is relatively complex, additional overhead. If dynamic memory management is involved, real-time performance will be far less than array-based scenarios.
One drawback of Message Queuing is the serialization of messages, which has no concept of precedence. But in fact, we have the real-time operating system (RTOS) and nested interrupt mechanism support, we should implement the concurrent processing of messages.
Message pool: The message pool is actually a simplified array-based message queue on the storage structure-removing the queue's read and write subscript loggers. Each element in the pool is a message, and there is a byte indicating the state of each element-idle/handled, new, and semi-processed. When the sender writes the message, it scans the array to find the idle location, and when the receiver reads the message, it also scans the array to find the state. It can be seen that the message pool is handling messages based on priority--the elements of the small subscript are first processed.
The ability to scan a message pool enables concurrent processing of messages and can be processed two times by the interrupt context and task context. In the interrupt service routine that handles the message pool, the first processing of each message is first scanned, and the portion of the message, if any, that is more demanding in real time is executed. If the RTOs is not used in the system, you can scan the message pool for the second time in the main loop in the background and the next two times. For systems using RTOs, tasks with different priorities can be created or activated based on the priority of the message, and the message "possessed" gets a second processing in the context of those tasks.
A major drawback of the message pool is that it is not appropriate to support a larger number of pending messages. If necessary, you can add a list control field to each message, and we can completely eliminate this limitation by chaining the same priority message into a string.
A number of important details
Kernel Mutex: Pseudo-parallel multitasking requires mutually exclusive access to shared resources, especially between true parallel cores. In particular, one kernel cannot shut down another core's interrupts, so it cannot be protected by the off-interrupt critical section. The only guarantee is that there will be no two cores to access the same address at the same time. In addition, because of architectural limitations, "spin lock" cannot be used to mutually exclusive. To do this, we can implement mutual exclusion by applying some programming guidelines. The simplest and most effective way is to set the "read-only" or "Write-only" permissions to each kernel at the same address, or conditional read and write permissions. For example, for a message queue read location, only the receiver can write, and the sender can only read to determine whether the queue is empty/full. Also, for a message pool, the sender and receiver can read and write the state of the elements in the pool, but the sender can only change the idle state to non-idle, and the receiver can only change the various non-idle states to idle. For example, for a linked list structure, you can only allow the sender to update various pointers, and the receiver will indicate when the sender updates each pointer by changing the state of the elements in the list and triggering interrupts.
Kernel Authentication: M4 backwards compatible with M0, which allows us to reuse a lot of source code. However, it is sometimes necessary to identify which kernel is currently running on. There are two ways to do this: if you authenticate during compilation, you can pre-predefine macros such as "CORE_M4" and "Core_m0" in compiler settings, use conditional compilation of C + + to process, or read a name "CPUID" If you need to differentiate during runtime. Registers, depending on the value of the CPUID to determine whether it is M4 or M0.
Initialize with executable Image: LPC4350 after the power-on reset, the M4 starts executing the code, and M0 remains in the reset state. In this way, we can also ignore the existence of M0, and only the single-core MCU to use. In order to use M0, you need to have M4 ready for M0 to start executing the entire environment, including the register context and address space, and then release M0. When M0 is in the reset state, we can find M0 through JTAG, but we cannot manipulate it. Therefore, if you want to debug the M0 program, you need to M4 download the appropriate image, so that it can release M0, it is impossible to get a blank chip, directly from the M0 first hands.
Although M4 and M0 each have their own images, we can include the M0 image in the M4 image, so that only one flash is burned at the time of production. To incorporate the M0 image, the toolchain typically provides the ability to convert the image into a C array definition format. With this function, we convert the M0 image into a form of a C array and compile the connection with the M4 source file so that the M0 image is embedded in the M4 image. M4 during initialization, copy the M0 image to the location where the M0 is ready to be executed. Since the M0 fixed from the zero address start orientation amount, M4 also need to set the M0 address mapping, the first address of the image is set to be the M0 0 address.
It is worth mentioning that this "control-driven co-control" design philosophy is also widely used by the amp.
Debugging details: When we use the Debug emulator to connect to the MCU, the reset signal is usually generated, but the range can be limited to the core, but also the entire film can be reset. When debugging the M0, you need to set the reset range to include only M0, to avoid affecting the running M4. In addition, you need to write the appropriate debug initialization script to prepare the kernel execution environment. These tasks are tedious, but highly versatile, and we can draw on existing scripts.
We can debug M4 and M0 at the same time: just run two separate IDE processes and open the corresponding project separately. By practice, at least in Mdk+ulink.
Inter-Nuclear Mission Division
M0 does not have M4 powerful processing power, but as a CPU, it also has a complete interrupt system and basic arithmetic and data transmission capability, and on the LPC4350, it can be run at a frequency of up to 204MHz. The advantages of dual-core design can be exploited by reasonably sharing some tasks for M0. Next, we discuss two major task division models.
Handling high-frequency interrupts-intelligent "DMA": Interrupted responses are at extra cost: Both the hardware overhead generated by the CPU's interrupt model itself, the software overhead generated by the interrupt management of the operating system, and, of course, the overhead of the interrupt service program itself. When the frequency of interrupts is high (for example, up to dozens of or even hundreds of khz), the interrupted response will incur an additional overhead that is not negligible for CPU time. More importantly, the interrupted response is handled by the hardware and above the task management, which can affect the execution of any task regardless of its priority. DMA significantly improves this situation. But when the DMA channel or bus is not allocated enough, or if the device is not supported by DMA, we can let the M0 in response to these high-frequency interrupts, reasonable organization of the data buffer, like a smart DMA.
For example, in a dimming device, up to dozens of or even hundreds of ad samples are required to obtain the expected brightness of each light, as well as as many LEDs to indicate the actual output brightness. The latter requires very much PWM and is likely to exceed the number of hardware PWM channels. Therefore, in the implementation of AD sampling and software PWM, it is necessary to fast channel data stream processing and high-frequency led refresh to ensure the PWM accuracy. Both are prone to interrupt requests up to dozens of khz, which can take up more than half of the CPU time with the additional overhead of interrupting the response. The traditional approach is to use several MCUs to distribute and be polled by the master. Under LPC4350, these tasks can be handled by M0. The same example applies to PLC applications, which need to be quickly refreshed with multi-channel control.
Provides additional processing power for weak compute operations: The overall performance of the M0 is approximately 72% of the M4, but there is not much disadvantage for weak computing operations such as adding and multiplying and logical operations, shifts, and simple data transfer. Weak computing operations tend to account for more than half of the program, especially in the driver and some communication protocol stacks. Reasonable allocation of a part of the weak computational operation task to M0, can effectively improve the overall processing capacity. In this way, accomplishing the same task requires a lower frequency and lower power consumption, or, conversely, the ability to perform more demanding tasks at a limited frequency.
For example: In high-precision industrial motion control, the control of the motor often requires a large number of algorithms, but also processing such as can, industrial Ethernet, as well as various fieldbus communication. We can let M4 run the motor control algorithm, and the communication protocol stack and driver is done by M0. The same example applies to embedded audio-the audio codec and sound processing algorithm performed by M4, while the M0 is responsible for audio bus, USB, and other transactions.
Summary of this article
Through the above introduction can be seen, compared to the traditional use of multiple MCU solutions, asymmetric dual-core MCU in performance, cost, power consumption, production and many other links have obvious advantages. Inter-core communication is slightly more complex, but it can be implemented as an infrastructure by the underlying system software. In the specific development, should be based on the actual problem of reasonable allocation of tasks, and in the initialization process, kernel identification and debugging, need to pay attention to some operational details.
Kernel: Multi-core operation mode