DMA Descriptors and mappings

Source: Internet
Author: User
1.DMA Channel

The DMA(Direct Memory access) channel is built between the device and RAM, andDMAC(DMA controler) interacts with the device I/O controller to implement data transfer.

In the PC, the DMA controller is on the South Bridge on the motherboard that is responsible for managing the I/O bus. A typical PC architecture data channel diagram is as follows:


DMAC(DMA Controller) Once activated by the CPU, it can transmit data on its own . In the realization of DMA transmission, the DMA controller is directly in charge of the bus, so there is a problem of bus control right transfer . before DMA transfer , the CPU should give the bus control to the DMA controller. after the DMA transfer , the DMAC issues an interrupt request and returns the bus control back to the CPU. After the DMA controller obtains the bus control, the CPU hangs immediately or only carries on the internal operation, the DMA controller output reads and writes the command, directly controls the RAM and the I/O interface to carry on the DMA transmission.

It needs to be clear that the I/O device is typically internally cached, that is, the device memory that is commonly referred to. From the source-host angle of data transmission, the DMA channel ends with RAM and device memory respectively. Device memory is generally selected using fast low-power SRAM materials, such as the AR9331 switching chip PCU Unit has 4KB TX FIFO and 2KB rx FIFO, "The GE0 and GE1 support 2K transmit FIFO and 2K receive F IFO. "

The most used Dmac are disk drives and other slow devices that require a large number of bytes to be delivered at a time, such as a PCI network card (NIC). DMA layer in 2.Linux

The core of the DMA operation is the DMA memory mapping, which includes a consistent DMA mapping, a streaming DMA mapping, and a scatter/aggregate mapping. The following is a general framework for the DMA layer of the Linux kernel:


As you can see from the diagram, the DMA layer in the Linux kernel provides a standard DMA mapping interface for device drivers, such as the dma_alloc_coherent of a consistent mapping type and the dma_map_single of a streaming mapping type. These interfaces mask differences between different platforms, providing good portability for device drivers. 3.DMA Descriptor

The SoC datasheet typically provides DMA RX/TX descriptor address and trigger control registers. In embedded software development, theDMA descriptor Array is a very important concept.

The DMA descriptor Array (DMA descriptor Array/ring/chain) is an array of pointers that are shaped like unsigned long* hw_desc[desc_num], and each pointer (Hw_desc[i]) points to a descriptor . This descriptor is defined by the hardware , and its data structure is generally defined by the datasheet or SDK.

3.1 Hardware Descriptor (h/w descriptor)

Hardware descriptors typically contain the following five sections:

<1> Control bit (empty flag/own bit): Descriptor empty/owned by DMA or not, this bit field describes descriptor validity for CPU/DMA. Empty_flag and Own_bit are the same work description of the control state at different angles, empty indicates that there is no data (packet) on descriptor. For Rx,descriptor empty (owned by DMA) indicates that the DMA handling data is in need of being hooked up to the DMA cache on that descriptor; for Tx,descriptor empty (not owned by DMA) indicates an urgent need for the CPU to mount the packets to be sent to the descriptor. The following is mainly based on the Dmaown_bit analysis control position, the reader can do the equivalent conversion.

<2> packet address : The pointer points to the source or destination memory area of the DMA transfer, in some places it is called the DMA cache . DMA caching is the ultimate destination for a packet, that is, the cluster data area (Mblk::mblkhdr.mdata Invxworks, sk_buff.data in Linux).

<3> Packet length (packet length): The effective length of the RX/TX packet.

<4> Ring Tail sign (wrap bit): Wrap bit of the last descriptor, which marks the final descriptor used to determine overflow (Rx overflow). Standing in the "ring" angle analysis, can also be understood as an array index of the loop point.

<5> Ring Link (next pointer): This pointer points to the next descriptor. Although the allocated DMA descriptor array is already linear, the hardware is always accustomed to locating the next descriptor by address. Software is more accustomed to using array indexes in RX ISR to traverse packets describing token (reap).

Some places call BD(Buffer descriptor), because it is hardware specific, so developers generally do not need to modify H/W descriptor data structure.

The address of the DMA descriptor array hw_desc[] is the virtual address of the DMA map, which describes the token base site and needs to be configured in the relevant DMA registers of the SOC chip, such as the Ar9331datasheet DMARXDESCR (Pointer to Rx descrpitor) and dmatxdescr_q0 (descriptor address for Queue 0 Tx). Obviously, this array needs to be allocated or converted to a KSEG1 segment (such as the one in MIPS). However, the DMA cache (desc buffer) that each descriptor points to is typically allocated in the cached extents (such as the KSEG0 segment in MIPS), with the cache being used primarily for memory performance reasons.


3.2 Software Descriptor (s/w descriptor)

The hardware descriptor (H/W descriptor) focuses more on the transmission of packets, such as Ethernet frames, and lacks care for the software organization level of the packet or data chain. The software layer needs to maintain complete data chain information, including horizontal packet fragmentation (fragment) and longitudinal multi-packet Chaining (chain), for link Tracking (conntrack) when the packet flows between the layers of the network protocol stack. For example, a NIC in the ISR will first create a buf (Mblk/mbuf in Vxworksor or Sk_buff in Linux), and the grouped content will be encapsulated into the BUF structure, calling the corresponding function (End_rcv_rtn_call end_obj*,m_blk_id) in VxWorks or Netif_rx (sk_buff*) in Linux pushes packets into high-level code in the network subsystem (TCP/IP protocol stack).

Packets flow between the stack in the form of mblk or Sk_buff, so at the software level, it is often necessary to extend the H/W descriptor appropriately according to the integrity of the information organization. Usually, s/w descriptor is the containerof h/w descriptor. In addition to the organization extension of the packet, additional descriptive and maintenance information can be added as needed, such as adding timestamp to track TX timeout. 4.DMA Process

4.1 DMA Loop and start up

We know that there are differences in read and write operations, the RX is irregular, so the RX descriptor ring on the need to hang cluster at any time on standby, TX is often initiated by our initiative , The packet cluster is hung on the TX descriptor ring in a driven send call.

At the initial time, the device driver needs to manually place the control bit of each descriptor on the RX Ring 1 (owned by Dma,empty and don't available for CPU to handle), meaning that the DMA is currently valid (available for DMA), CPU vacancy DMA transfers data from the device in. Initialize the control bit of each descriptor on the TX ring by hand 0 (not owned by Dma,empty and available for CPU to fill), meaning that the CPU is currently valid (available for CPU), DMA vacancy c PU mounts the outgoing packets onto the TX ring.

In general, after the network device (MAC) driver layer is configured, the DMA RX channel can be enabled by configuring the DMA RX trigger control register. For the TX process, it is often configured to configure the DMA TX trigger control Register to boot DMA TX After the packet is ready and hooked up to the TX ring.

size of 4.2 DMA ring

Each descriptor on the RX Ring (descriptor) points to a cache entity (buffer), so the size of the loop determines the ability to accept packets to a certain extent. Rx loops are too small to cause frequent rx overflow, which is bound to affect throughput performance. The size of the TX ring determines how many outgoing packets we can hang (packet buffer to is sent), and the TX ring is too small to affect throughput performance. But the rx/tx ring is not the larger the better, the need to combine the memory margin to configure, and CPU processing capacity to adapt to the overall trade-off to achieve the best performance (performance balance). The number of receive and send buffers (transfer buffers) in the Windows PC's network adapter advanced options generally refers to the size of the rx/tx ring.

4.3 DMA Process Control

(1) in the inward Transfer (RX), the device receives the packet (or the device Rx FIFO full time) notifies the DMAC,DMA to begin transferring data from the device memory to the DMA buffer that is preloaded on the RX ring. At this point, on the one hand, Dmac automatically resets the control bits of the corresponding descriptor 0 (not owned by Dma,not empty and available for CPU to handle), meaning that the CPU is digested; Dmac will initiate an interrupt request to the CPU. In the RX ISR, the RX loop is scanned, and the corresponding packet of DMA buffers on a descriptor with a control bit of 0 is picked up and encapsulated into a mblk/sk_buff, and pushed by the End_obj/net_device ISR to the TCP/IP protocol stack queue, Wait for a series of processing of the network subsystem.

Typically, in the RX ISR, you need to reassign a new mblk/sk_buff from the Netpool/skbpool and hook it up (refer to the streaming map below) to the descriptor where the buffer was picked up (you need to manually place the control bit 1), so you can ensure that the RX description token always available. The life cycle of the mblk/sk_buff that is picked up will be controlled by the network subsystem, released or used to construct the TX packet.

When the protocol stack process (the lower half of the interrupt processing) is too long, it will cause Netpool/skbpool to be tight, on the other hand, the interruption of the response (the upper half of the interrupt) will cause the RX overflow to be thrown in time. In some SOC chips, the RX overflow may suspend DMA for a period of time to avoid frequent interruptions. The following statement is excerpted from AR9331 datasheet:

This bit (Dmarxstatus::rxoverflow) are set when the DMA controller reads a set empty flag (not available fo R DMA) in the descriptor it is processing. The DMA controller clears this bit (dmarxctrl::rxenable) is it encounters an RX overflow or bu S error state.

(2) When the local protocol stack provides feedback (ACK) or forwarding (Forward) to the RX's incoming packets, it is necessary to mount the packet onto the idle descriptor on the TX ring, with the idle standard being 0 (not owned by DMA). After the TX ring is hung, we need to manually place the descriptor control bit 1 (owned by Dma,not empty and available for DMA to XMit), which means that the descriptor on the TX ring is mounted with the outgoing packet. Shortly thereafter, the boot DMA TX TRIGGER,DMAC Transfers (copies) the packets on the descriptor with a control bit of 1 to the device memory (TX FIFO) after the bus is snatched.


We always assume that dmac and device I/O controllers work well with each other, so there is usually less reference to the TX overflow event. Dmac automatically resets the control bits of the descriptor corresponding to the packets that were successfully transferred to the device memory 0 (not owned by Dma,empty and available the for CPU to Reuse/refill) and sends an interrupt request to the CPU. Device drivers are typically in the TX ISR or next send, scanning the TX ring with a description of 0 on the token, and releasing the Recycle (reap)-hung packet cache. 5.DMA and cache consistency problem

5.1 Cache Consistency Problem

The following figure shows the data flow between the CPU, cache, RAM, and device.


Intermediate CPU and memory are directly interacting with operations in the Kseg1 address space, and up and down through the cache for Kseg0 address space Operations (referring to cases where MMU is not applicable).

Because the data region is readable and writable for the process, and the instruction data is read-only for the process, the program source code is divided into " program instruction " and " program Data " Two kinds of sections, this kind of partition is in favor of the privilege management. When the instance copy of multiple processes is running in the system, because the instructions are the same, only one part of the program's instruction is saved in memory. The L1 Cache design of modern processor basically adopts the Harvard structure that echoes the idea of the program, and separates the instruction from the data, which are called I-cache and D-cache respectively, each have separate read and write ports (I-cache read-only, No write port required). The Harvard structure is conducive to improving the local program, specifically to improve the CPU hit the cache.

If a data exchange between RAM (Memory) and device changes the contents of DMA buffers in RAM, suppose that in this case the DMA buffer corresponding to a block of memory in the RAM is cached in exactly the cache. If there is no mechanism to ensure that the contents of the cache are updated (or invalid) by the new DMA buffer data, the cache and a block of memory in its corresponding RAM appear to be inconsistencies in the content. If the CPU attempts to obtain data from the DMA buffer device to RAM, it will get the data directly from the cache, which is clearly not expected because the data in the cache's corresponding RAM has been updated.

On the issue of cache consistency alone, different architectures have different strategies. Some are guaranteed at the hardware level (such as the x86 platform), some have no hardware support and need software participation (such as MIPS, ARM platform), at this time the device driver programmer needs to solve the cache and DMA inconsistency problem at the software level.

5.2 Consistent DMA mappings

As mentioned above, for the sake of the memory performance, the DMA buffer segment of the packet that flows in the kernel protocol stack is best to open the cache, which is usually allocated in the cached section (e.g., KSEG0 segment in MIPS).

In the x86 platform, the hardware handles cache consistency, so the consistent DMA mapping can only be achieved by securing a set of contiguous physical memory page frames of the desired size. On MIPS or ARM platforms, Dma_alloc_coherent () first assigns a continuous set of physical pages as a buffer for subsequent DMA operations, and then maps the physical address space to a non cached virtual address space at the software level, specifically in Page directories and page table entries. the cache function on the mapping interval makes the cache consistency problem no longer a problem. Because cache is turned off and caching is lost, the consistency map discounts performance.

The DMA descriptor itself is ideal for using a consistency map and is typically mapped to a KSEG1 segment (for example, a section in MIPS).

In the actual driver, the buffer for the consistency map is allocated by the driver itself during the initialization phase, and its lifecycle can continue until the driver module is removed from the internal system. However, in some cases, the consistency map also encounters insurmountable difficulties, mainly referring to the DMA buffers used in the driver that are not allocated by the driver, but from other modules (typically the skb-> of packet transmission in a network device driver) The buffer area that data points to), you need to use another type of DMA mapping: Streaming DMA mapping.

5.3 Streaming DMA Mappings

In the case of a streaming DMA map, the buffer used by the DMA transport channel is often not allocated by the current driver itself, and often a buffer of the stream map is established in each DMA transfer. In addition, the device driver must be careful to handle cache consistency issues that may occur because of the inability to determine the mapping of the DMA buffers passed in by the external module.

(1) in the inward Transfer (RX), the DMA device writes the data to memory, DMAC will issue a interrupt request to the CPU, before using the memory in the RX ISR, it needs to first Invalidated-cache (sync_ SINGLE_FOR_CPU) makes cache invalid refill (refill), at which time the CPU obtains the most up-to-date data through cache caches.

(2) in the Outward Transfer (TX), one possible scenario is that the CPU-constructed local protocol stack feedback packet is still in D-cache, so Flush D-cache (sync_single_for_device) is required in the Send call writes the data back to memory, makes the DMA cache update to the freshest data to be sent, and then initiates the DMA TX trigger.

It should be noted that on some platforms, such as ARM,CPU read/write with different cache (read is cache, write is written buffer), so the establishment of streaming DMA mapping needs to indicate the flow of data in the DMA channel, For the kernel to decide whether to manipulate cache or write buffer.

5.4 Distributed/aggregated DMA maps (Scatter/gather map)

To date, the discussion of the mapping of a buffer to a DMA operation is limited to a single buffer, and then another type of DMA mapping-scatter/aggregate mapping-is discussed.

A scatter/aggregate mapping organizes a plurality of DMA buffers that are dispersed on a virtual address through an array or list of struct scatterlist, and then transmits data between main memory RAM and the device through a single DMA transport operation. You can simulate the SCATTER/GATHERI/O features provided by the WSA series API in Winsock.


The above illustration shows a scatter/aggregate mapping for a DMA transfer between three scattered physical pages and devices in main memory. One of the single physical pages and devices can be viewed as a single streaming mapping, each of which has a data structure strcut scatterlist in the kernel.

If this scatter/aggregate mapping is viewed from the CPU's point of view, three pieces of data (stored in three discrete virtual address spaces, respectively) need to interact (send or receive) with the device by establishing an array of struct scatterlist types/ The linked list completes all data transfers in a single DMA transfer. This improves efficiency by reducing repetitive DMA transfer requests.

Through the discussion above, the scatter/aggregate mapping Essence is the transfer of the scattered blocks of main memory from main memory to the device through a single DMA operation, for which each data block kernel will be established corresponds to the one streaming DMA mapping. But for MIPS, ARM platform, still need to use software to ensure cache consistency problem.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.