1: Fundamentals of heterogeneous multi-core communication
This paper is mainly based on tms320dm814x series Daphne singular multi-core processor. The dm814x integrates Armcortex-a8 core, C674XDSP core, HD video processing subsystem (HDVPSS), and HD video/image coprocessor (HD-VICP2). HDVICP2 is managed by the ARM968 kernel-based video M3 core and can be completed by H. 264, Mpeg_4, MJPEG codec, HDVPSS by VPSS M3 Nuclear Management, with 2 channels of high-definition video capture channel and display channels.
Heterogeneous multicore processors mostly employ a master-slave architecture. The master-slave structure of the processor is based on the functions of different cores and cores. The structure and functions of the main kernel are generally more complex, which is responsible for the management and scheduling of global resources, tasks and the completion of boot loading from cores. From the core to accept the main core management, responsible for the operation of the main core distribution tasks, and has local task scheduling and management functions. In multi-core processors, depending on the structure of the core, each core can run the same or different operating systems. In dm814x, ARM is the main core, DSP and coprocessor from the core, arm core operating open source Liunx system, can also run TI's real-time operating system SYS/BIOS,DSP core and M3 core operating real-time operating system Sys/bios.
The interconnect structure between the master-slave and heterogeneous multicore processor cores is shown in Figure 1.
Figure 1 Master-slave heterogeneous multi-core interconnection architecture
It is shown from Figure 1 that in order to achieve the communication between heterogeneous multicore, the inter-core interrupt controller and the inter-core interconnection bus are designed in the chip. The inter-core interrupt is a bridge between multi-core task synchronization and communication, the inter-core interrupt register each flag bit is assigned to different cores in the chip, through the inter-nuclear interrupt to another send interrupt request, the execution of the corresponding interrupt service program or through the interrupt register delivery address, with shared memory to achieve data transfer and sharing. Each core's access to peripherals is implemented via the configuration bus.
In conclusion, the following functions are required to achieve effective management and communication of each processor core:
1) The main processor is managed from the processor;
2) Transfer and exchange of information between internal processors.
The former can be implemented via on-chip interconnection, which is implemented by inter-core interrupts and memory sharing. The implementation of these features is explained in detail in the following example in dm814x. 1.1 Inter-core interrupts
To achieve efficient on-chip inter-core communication, the dm814x series Da Vinci processor integrates hardware mailbox interrupts (Mailbox interrupts) and spin locks (spinlocks). The dm814x has 12 mailboxes, each with 4 interrupt sources to send interrupts to 4, and 4 message depth FIFO with 32 bits per message. Each mailbox can be read and written by any one of the cores, and through the corresponding registers set the interrupt sender and receiver, passing the message through the message register. ARM, DSP, and 2 M3 media controllers communicate through a system-level mailbox, each HDVICP2 with its own separate mailbox, and can send interrupts to its own internal modules and other delivery. 1.2 Shared Memory
The implementation of shared memory requires the system to properly map and manage the memory, each subsystem has its own memory and memory mapping register, in order to simplify the development of software, DM814X uses unified mapping, which makes the chip resources have consistency.
dm814x runs Linux and Sys/bios two sets of operating systems, respectively, using makefile mechanism and XDC building system. Linux configures the kernel-managed memory space at runtime with the kernel boot parameters, and Sys/bios uses the XDC configuration file for allocation of memory areas such as data areas and code areas at build time. The system is built and run with the need to partition the memory used by each core and the shared memory. As shared memory, can be used by different processors, in order to achieve mutually exclusive access to shared resources, the chip integrates hardware spin lock to solve the mutual exclusion of shared resources between multiple cores. 2: Implementation of heterogeneous multi-core inter-mission communication
Heterogeneous multicore processors enable efficient inter-core communication and collaboration, requiring not only the support of on-chip hardware modules, but also the mechanism for inter-mission communication in software. Linux and Sys/bios all provide the mechanism of process communication on this operating system, in Linux there are pipelines, message queues, semaphores, shared memory and other tasks between synchronization and mutual exclusion, sys/bios in the form of semaphores, mailboxes, events and so on, and for the communication between the core, TI provides A set of heterogeneous multi-core communication components (Syslink) provides a variety of implementation methods for inter-core communication between users. The Syslink features and related APIs are described below.
The Syslink component consists mainly of the following components:
Systems Management (System Manager)
Processor Management (Processor manager--pm)
Inter-core communication (Inter-processor COMMUNICATION--IPC)
Other modules (Utility Modules)
2.1. System Management
The System Management Module (IPC module) provides a simple and fast way to facilitate multicore management, and also provides an interface for system resources (system memory) management.
The functions provided by the IPC module include:
Syslink System Initialization (Syslink_setup (), Syslink_destroy ()) and allocating memory for other Syslink components, including IPC modules and PM modules (Memoryos_setup (), Ipc_setup (& config))
System configuration: Any system-level configuration information is managed by the system;
2.2. Processor Management
The Procmgr module provides the following services from the processor
1) boot-loading from the processor
2) Read and write the memory area from the processor
3) from processor power management
Therefore, the module provides the following interfaces for the above services:
Loader: The Loader interface of the processor has a variety of implementation methods, the file form may be written such as COFF, ELF, dynamic Loader (not very clear what this is) or custom type of files, etc.
PowerManager: Given the versatility of the Processor management module and the desire to customize the Power management module, power management in Syslink is a standalone module that can be embedded in processor management;
Processormanager: Provides the processor with load, MMU management (A8 support), read and write interfaces from processor memory, and so on. In the Syslink system, each processor is encoded (i.e. the processor ID) for ease of management, as shown in the figure, where the hardware abstraction layer is used to mask the underlying hardware differences, the benefit of which is to provide a common software interface for different underlying hardware. In Syslink, the loader file from the processor theoretically supports multiple formats, and the COFF and Elf are primarily supported in the Syslink release release. In TI's compilation system, COFF files and elf files can be distinguished by the suffix of the executable file. The current Elf loader only supports sequential loading, that is, only when one is loaded from the processor and started to load the next slave processor, parallel loading is not supported.
Figure 2:loader Flowchart
2.3. Inter-Processor communication protocol (Inter-processor Communicationprotocols)
The following communication mechanisms are defined under Syslink:
Notify, Messageq, Listmp, Gatemp, HEAPBUFMP, HEAPMEMMP, frameq (typically used for raw video data), Ringio (typically used for audio data)
The interfaces of these communication mechanisms have a few things in common:
All interfaces of the IPC communication mechanism are named by the system normalization; at the Hlos end, all IPC interfaces have <module>_setup () And<module>_destroy (). The API is used to initialize or destroy the corresponding IPC Module, the partial initialization also needs to provide the configuration interface, <module>_config (); All instantiations need to be created using <module>_create (), using < Module>_delete () to remove, use API <module>_open () to obtain handle when using IPC in a deeper level, and use API <module>_close () to end use of IPC The configuration to reclaim HANDLE;IPC is mostly configured under Sys/bios, and the static configuration method can be used for support of XDC configuration; each IPC module supports trace information for debugging and supports different trace levels Some IPCS provide specialized APIs for extracting analytical information;
2.3.1, Notfiy
The Notify Component abstracts hardware interrupts into multiple logical events, and is a simple and quick way to send less than 32bit of information. The Notify module uses hardware interrupts and therefore cannot be dispatched frequently. Events are prioritized, the higher the EventID priority, the higher the priority of event 0, and the lower the priority of the EventID increase, and the first response when multiple events are triggered, the highest priority. such as signaling delivery, buffer pointer passing, and so on. High-priority events, such as event 0, are used when signaling is passed, whereas the pass buffer pointer is a low-priority event that can be used, such as event 30. Since the other modules use the Notify mechanism, there are some event numbers reserved in the Syslink, this part of the event number users need to choose carefully (if you do not use other builds, you can consider taking up this part of the event number), before registering the event can be used Notify_ Eventavailable () to check whether the event is available, that is, whether the event number on the interrupt number is registered.
2.3.2, Messageq
Messageq, queue-based message delivery with the following characteristics:
It realizes the delivery of variable-length messages during processing, and the delivery of messages is accomplished by manipulating message queues; Each message queue can have multiple writes, but only one reader; each task can read and write to multiple message queues, and a host must first create a message queue when it prepares to receive messages. Before sending the message, it is necessary to open the scheduled receive Message queue;
2.3.3, Listmp
Listmp realizes a multi-homed two-way cyclic linked list, that is, the two-way circulating linked list is co-owned by multiple processors and can be co-maintained and used by multiple processors. The implementation of LISTMP is different from the normal two-way circular linked list, so it not only has the characteristics of the bidirectional cyclic list, but also adds other features, such as the following points:
A simple multi-homed protocol is implemented to support multiple readers (Multi-reader, multi-write).
Using gate as an internal protection mechanism to prevent multiple host processors from accessing the linked list at the same time;
The implementation of LISTMP does not include notification mechanisms, which can be implemented in the external encapsulation if required, and buffers that are managed using the LISTMP mechanism need to be allocated from the shared memory area, including buffers allocated from heap memory and dynamically allocated memory.
2.3.4, Gatemp
Gatemp is a protection mechanism for multi-processor shared resources, which, like its name, compares shared resources to houses, so gatemp is the door to the house. The Gatemp component implements a switch gate mechanism that protects shared resources from being read and written by only one processor at a time. Depending on the configuration of the SOC hardware resources, the implementation of Gatemp is different. For hardware support hardware spinlock can be gatehwspinlock based on h/w spinlock, and for systems without this hardware resource, a software method (Peterson algorithm) is used to implement Gatepeterson.
2.3.5, HEAPMP
HEAPMP primarily includes HEAPBUFMP and HEAPMEMMP, which are used to configure and manage heap memory for shared memory areas.
The HEAPMP has several characteristics:
Supports multi-homed, that is, whether the host processor running Hlos or the slave processor running Sys/bios can configure and manage heap memory, the shared memory area can be configured as a buffer pool (buffer pools), and the buffer can be allocated and freed from the shared memory area;
HEAPBUFMP provides the user with a fixed-size buffer pool management interface, and HEPMULTIBUFMP provides the user with a configurable size buffer pool management interface.
2.3.6, Frameq
FRAMEQ is a component designed specifically for transmitting video frames. The basic data structure of FRAMEQ is a data queue which can be used for queue/dequeue data, which encapsulates the video frame buffer pointer, frame data type, frame width, frame height, timestamp and other information.
The FRAMEQ module has the following characteristics: It supports multiple readers, but the writer is the only one, can allocate and release frames, and can allocate and initialize the new frame buffer to the same memory area multiple times; FRAMEQ allows multiple queues, in the multi-channel operation, the video frame is allocated to the corresponding frame queue according to the channel number;
2.3.7, Ringio
The Ringio is a buffered buffer based on the data stream, and is optimized for the characteristics of the audio and video data.
Ringio supports a feature: only one reader and one writer are supported; read-write is relatively independent and can be read and written concurrently in different processes or processors;
2.4. Common components (basic components)
Utility modules includes sharedregion, List, Trace, Multiproc, nameserver, etc., which are the basis for the implementation of the upper component.
2.4.1, Sharedregion
There are two ways to configure Sharedregion, static configuration methods and dynamic configuration methods. In the actual configuration, it is necessary to indicate the virtual address and stack setting of the shared memory area in each processor, and the Sharedregion module does not occupy the shared memory space because its state is in the memory of the processor locally. All Sharedregion module APIs are used by gate for process mutex operations.
The Sharedregion module creates a shared memory lookup table for each processor in the system. This lookup table contains the relationships and related settings of all processors to the shared memory area. If a block of shared memory is inaccessible to a processor, it is set to NULL in the table.
2.4.2, Multiproc
The Multiproc module is used for a unique identity processor in a multicore processor (multi-processor ID management, in a fwload program where Multiproc_getid gets procid parameters for Procmgr_open), before using the module, A multiprocessor environment needs to be configured in an IPC environment using the *.cfg script. 3: Introduction to the software framework of link-based MCFW
Based on the SYSLINKIPC bottom communication module, TI designs a MCFW software framework for high-clarity video acquisition, encoding and transmission, which unifies threads running on each core as a link structure. 3.1 Link Introduction
Each link is a set of threads with a certain function and a combination of related data structures, each link has a unique ID, you can see from this ID that this link is running on which core, Each link can have multiple input queues and output queues, and the notify mechanism notifies the new data if it is ready.
And when link is created, tell the last link of the link, and the next link to connect, to connect all the links together to form a chain.
While the threads in link are synchronous and mutually exclusive through the thread communication mechanism such as MESSAGEQ semaphore, it is possible to avoid the transfer of data through the inter-core shared memory for video data, so as to achieve efficient inter-core data sharing. 3.2 Link's working mechanism
In each link, you must implement some functions and register these function pointers to the core module of link management for frame data acquisition, release, dump-related state, etc. at initialization time.
For any link to get frame data from its upstream link, you need to call the link management core function system_getlinksfullframes (), which will send a message to the corresponding upstream link, A callback function SYSTEM_LINKGETOUTPUTFRAMESCB () that triggers the link to register with the management module passes the frame data to the link;
Similarly, when a link wants to release the finished frame buffer to the upstream link, it needs to call the link management core function system_putlinksemptyframes (), which will send a message to the corresponding upstream link, The callback function that triggers it to register SYSTEM_LINKPUTEMPTYFRAMESCB () reclaims the frame buffer for subsequent data processing;
When the chain is established, all of the downstream Link's link registers a callback function for SYSTEM_GETLINKINFOCB (), and System_linkgetinfo () is called when driver is created in the downstream link driver function to get the relevant parameters of the upstream link.
With the above method, for a link there is no need to care about which link it interacts with, all the addresses are LinkId from the search, and the same link implementation can interact with different link without changing the implementation of the function.