Kernel How to ensure cache data consistency

Source: Internet
Author: User

In the embedded system, the cache is located between the CPU and the DDR, is a SRAM, read and write performance is much higher than the DDR, using the cache line provides prefetching function, balance between CPU and DDR performance difference, improve the system performance.

As far as I know, arm/ppc/mips three mainstream embedded processors are software management caches, which have special instructions for cache operation, such as PPC Iccci The icbi,arm CP15 Coprocessor also provides the operation of the cache.

There are 2 types of cache operations: write-back and invalid. The writeback operation is to write the data in the cache back to the DDR, the invalid operation is invalid, the cache has data, the next time you read the data in the cache, it needs to be re-read from the DDR. Both of these operations are in fact to ensure cache data consistency.
The cache operator function is also encapsulated in the kernel Platform assembly code, where the arm V7 processor

3.4.55 Kernel For example, in Arch/arm/mm/cache-v7. The operation function of the cache is encapsulated in S, where the V7_dma_flush_range refresh function completes both writeback and invalid 2 operations.

The cache operation function is similar in other platforms (ARM MIPS).


Cache sensitivity is due to being trapped 2 times on the cache issue ... , one time when porting Uboot, did not pay attention to the cache, Nic DMA always does not work, the DMA descriptor on the surface of the configuration I have configured, but the DMA is not running, after the uboot the entire boot code after I have to consider the cache is the problem, I write the descriptor data is not fully written to the DDR, but the cache, uboot on the system performance requirements are not high, simply turn the cache off, found that the DMA is normal. This question has been tossing me for more than half a month, the problem of the cache is really difficult to check, read the data written in the DDR is completely correct (because the read is actually the data in the cache), can only rely on speculation to infer the cause of the problem. Cause I now see the cache operation is tense.

No problem, because the embedded processor software management cache, we need our code to actively operate the cache, but in the kernel development is rarely directly cache operation, then kernel is when the cache operation.

First of all, to understand why the cache operation, can only say that the cache is an angel is also the devil.

The cache improves system performance while causing inconsistencies in the data. Embedded processor Software Management The cache is designed to ensure data consistency.

Where do we need to ensure data consistency?


For data that is fully operational by the CPU, the data is exactly the same. That is, the data is completely read by the CPU, there is no CPU opaque operation. In this case, the CPU reads and writes the data from the cache, our code (the code is executed by the processor, we should stand in the processor perspective of the problem) regardless of cache consistency.

To think about it, I think there are 2 situations in kernel that require data consistency:

(1) Register address space. Registers are the interface between the CPU and the peripheral, and some status registers are changed by the peripherals according to their state, which is opaque to the CPU. It is possible that this time the CPU reads the status register and the next time it is read, the status register has changed, but the CPU is still the cached value in the cache read. However, the register operation must be consistent in kernel, which is the basis of the kernel control peripheral, and the IO space is mapped to the kernel space via Ioremap. Ioremap the page table is configured as uncached when mapping the register address. The data does not go to the cache and is read directly from the address space. Data consistency is ensured.

(2) The address space of the DMA buffer. DMA operations are also opaque to the CPU, which results in in-memory data updates that are completely invisible to the CPU. And vice versa, the CPU writes the data to the DMA buffer, actually writes to the cache, then starts the DMA, the operation DDR data is not the CPU actually wants to operate.


How does the DMA operation in kernel guarantee cache consistency?

In the memory mapping and DMA chapter of the LDD3 in detail some functions of the general DMA operation Layer, at that time see this chapter some statements see I quite faint, now stand in the angle of the cache and then see a lot of clear.

The Universal DMA layer is divided into 2 types of DMA mappings:
(1) Consistency mapping, representing functions:
void *dma_alloc_coherent (struct device *dev, size_t size, dma_addr_t *handle, gfp_t GFP);
void dma_free_coherent (struct device *dev, size_t size, void *cpu_addr, dma_addr_t handle);


(2) Streaming DMA mapping, representing functions:
dma_addr_t dma_map_single (struct device *dev, void *cpu_addr, size_t size, enum dma_data_direction dir)
void Dma_unmap_single (struct device *dev, dma_addr_t handle, size_t size, enum dma_data_direction dir)

void dma_sync_single_for_cpu (struct device *dev, dma_addr_t handle, size_t size, enum dma_data_direction dir)
void Dma_sync_single_for_device (struct device *dev, dma_addr_t handle, size_t size, enum dma_data_direction dir)

int Dma_map_sg (struct device *, struct scatterlist *, int, enum dma_data_direction);
void Dma_unmap_sg (struct device *, struct scatterlist *, int, enum dma_data_direction);

The first step is to see how consistency mapping guarantees cache data consistency. Look directly at the dma_alloc_coherent implementation.

void *dma_alloc_coherent (struct device *dev, size_t size, dma_addr_t *handle, gfp_t GFP) {    void *memory;    if (dma_alloc_from_coherent (dev, size, handle, &memory)) return memory; Return __dma_alloc (dev, size, handle, GFP, Pgprot_dmacoherent (Pgprot_kernel), __builtin_return _address (0));} The key point is pgprot_dmacoherent, which is implemented as follows: #ifdef config_arm_dma_mem_bufferable#define pgprot_dmacoherent (Prot) __pgprot_modify (Prot, L_pte_mt_mask, l_pte_mt_bufferable | L_PTE_XN) #define __HAVE_PHYS_MEM_ACCESS_PROTSTRUCT file;extern pgprot_t phys_mem_access_prot (struct file *file, unsigned long pfn, unsigned long size, pgprot_t Vma_prot); #else # define Pgprot_dmacoherent (Prot) __ Pgprot_modify (Prot, l_pte_mt_mask, l_pte_mt_uncached | L_PTE_XN) #endif 
In fact, it is to modify the page property for uncached, it is necessary to understand that the page is kernel page properties of the pages management, when the fault is filled in the TLB, the property will be written to the TLB storage domain. The address space of the Dma_alloc_coherent map is guaranteed to be uncached.
and __dma_alloc calls __dma_alloc_buffer in the./arch/arm/mm/dma-mapping.c line at 100, as follows:

    /     * * Ensure the allocated pages is zeroed, and that any data     * Lurking in the kernel direct-mapped invalidated.     *    /ptr = page_address (page);    memset (PTR, 0, size);    Dmac_flush_range (PTR, ptr + size);    Outer_flush_range (__pa (PTR), __pa (PTR) + size);

After the buffer is allocated, the buffer is flushed with the cache and possibly the external cache (level two cache).

Dma_alloc_coherent first caches the allocated buffers, and then modifies the page table of the buffer to uncached, which guarantees the consistency of the block data after the DMA and CPU operations.

LDD3 that the consistency map buffer can be accessed simultaneously by the CPU and DMA, which is ensured by the uncached TLB mapping.

Then look at the streaming DMA mapping, take dma_map_single as an example, because the code call is too deep, here is just a list of call relationships, the cache-related calls are as follows:

dma_map_single ==> __dma_map_page ==> __dma_page_cpu_to_dev ==> ___dma_page_ The Cpu_to_dev ___dma_page_cpu_to_dev is implemented as follows: void ___dma_page_cpu_to_dev (struct page *page, unsigned long off, size_t size, en    Um dma_data_direction dir) {unsigned long paddr;    Dma_cache_maint_page (page, off, size, dir, Dmac_map_area);    paddr = Page_to_phys (page) + off;    if (dir = = Dma_from_device) {outer_inv_range (paddr, paddr + size);    } else {Outer_clean_range (paddr, paddr + size); }/* Fixme:non-speculating:flush on bidirectional mappings? */}dma_cache_maint_page calls Dmac_map_area on the mapped address space, which is eventually called to Arch/arm/mm/cache-v7. The cache handler function for the V7 processor in S V7_dma_map_area, as follows:/* * Dma_map_area (start, size, dir) *-Start-kernel virtual start address *-Si    Ze-size of Region *-DIR-DMA direction */entry (V7_dma_map_area) Add R1, R1, R0 Teq R2, #DMA_FROM_DEVICE beq V7_dma_inv_range b v7_dma_clean_rangeendproc (V7_dma_map_area) 
Specifies that the direction is Dma_from_device, then V7_dma_inv_range invalidates the segment address cache. The direction is Dma_to_device, then V7_dma_clean_range writes back to the segment address cache. The cache data consistency is guaranteed.
The external cache has also been refreshed in ___dma_page_cpu_to_dev.

Dma_map_single does not modify the storage properties of the cpu_addr specified buffer mappings, or cached, but caches write-back or invalid for the buffer based on the data flow, which also guarantees the cache data consistency.

Then look at Dma_unmap_single, the call relationship is as follows:
Dma_unmap_single ==> __dma_unmap_page ==> __dma_page_dev_to_cpu ==>___dma_page_dev_to_cpu ==> dmac_unmap_ Area = = "V7_dmac_unmap_areadmac_unmap_area is implemented as follows:/* *  Dma_unmap_area (start, size, dir) *  -Start-kernel virtual  Start Address *  -size  -size of Region *  -dir   -DMA Direction */entry (v7_dma_unmap_area)    add R1, R1, R0    teq R2, #DMA_TO_DEVICE    bne v7_dma_inv_range    mov pc, Lrendproc (V7_dma_unmap_area)

Specifies that the direction is dma_to_device and does nothing. If the direction is Dma_from_device, then V7_dma_unmap_area invalid the segment address cache.
My understanding because the cache data must be consistent after releasing the buffer for data that is specified to be read by the CPU, the CPU is then going to read the data for processing. For data that is written to the CPU, the buffer is freed and the CPU no longer operates on the buffer, so nothing is done.

As you can see from here, LDD3, the streaming DMA mapping has strict requirements for when the CPU can manipulate the DMA buffers, and the CPU can only wait until after dma_unmap_single to operate the buffer.
The reason is that because the streaming DMA buffer is cached, the cache is brushed at the time of the map, and then when the device DMA completes Unmap, the cache (based on the data flow write back or invalid) is guaranteed to ensure the cache data consistency. The CPU operation buffers before unmap are not guaranteed to be data consistent. Therefore kernel need to strictly guarantee the operation timing.
Of course kernel also provides functions dma_sync_single_for_cpu and dma_sync_single_for_device, which can manipulate buffers when not released. It is clear that these 2 function implementations are sure to refresh the cache again to ensure data consistency.

Here, the 2 types of DMA mappings are analyzed, It is clear that the core difference between a consistency map and a streaming DMA mapping is whether the buffer page table mapping is cached, and the Uncached page table ensures that both the CPU and the peripherals can be accessed simultaneously.
But these are the interface functions that the kernel has encapsulated to drive developers, and driver developers do not need to be concerned with the cache, just call these interfaces in accordance with LDD3 rules.
This is why the cache operation is seldom seen in the driver, and the kernel code makes the cache operation opaque to the driver.
It reminds me of it. When developing a NIC driver, the allocation of the DMA descriptor is a consistency map because the DMA descriptor requires the CPU to operate concurrently with the device. and the data send and receive buffer allocation is streaming, with the allocation, after the release of the CPU can operate data!

so far, I have been exposed to the kernel TLB mapping, did the uncached mapping only 2: Register space (IOREMAP) and consistent DMA buffer (dma_alloc_coherent), the other address space is cached, to ensure system performance.

When the kernel operates the cache, the kernel code does not need to be concerned about the cache when operating the DMA buffer. This article analyzes this!


Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Kernel How to ensure cache data consistency

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.