Arm-cache coherency

Source: Internet
Author: User
Tags ack

There are two ways to improve the performance of a system:

1) constantly improve the performance of a core by constantly improving freq, reducing VT, which will increase power (dynamic,leakage)

2) Increase the number of processor

ARM's Big-little processor cluster uses the second approach, minimizing power consumption through power gating and DVFS.

But another problem with multiprocessor is the cache coherence.

for cluster internal, ARM uses MPCore multi-core coherency technology,

1) implemented a Mesi-based cache coherency protocol, and added some feature,

    Direct Cache-to-cache Copyof clean data,direct cache-to-cache move of dirty data in cluster.

Does not need to write will main memory,

2) also includes a module SCU (Snoop Control Unit), which saves all L1 data cache tags as a directoryto reduce the broad-cast bus bandwidth waste,

3) MPCore Technology, supports an optional ACP (Accelerator coherency port), Accelerator can read and write processor cluster internal cache,

But PRocessor cannot get accelerator's cache, nor does it guarantee consistency with its cache.

For multiple cluster, the AMBA4 ACE protocal (AXI coherency Extensions) can be implemented.

1) Ace and Ace-lite, introduced system-level Coherency,cache MAINTENANCE,DVM (Distributed virtual memory), barrier transaction support.

2) The Ace itself is support for the 5 state of the Moesi cache coherency model, master can support Mesi,moesi,mei, etc., are compatible ,

3) ace needs to be used with the specified system interconnect to process all shared transaction,

    Interconnect may speculative reads or wait for Snoop resp when he gets the transaction sent by Master,

Interconnect may contain a directory,snoop filter, or broadcast snoop to all Master,

4) ACE supports system level coherency, which refers to all master, including Gpu,dma,dissimilar CPUs.

AMBA Roadmap for Development:

1) AXI4 supports long burst, does not support write interleaving;

2) Axi4-stream, is designed for large-scale data dissemination applications, is a point-to-point protocol, without the address channel, only the data channel;

3) Axi4-lite, is a simplified version of the AXI4, mainly used in the peripheral need APB, do upgrade.

  

software-based coherency approach:

The coherency of the cache can also be resolved in software, in the previous single processor,just small L1 cache;

However, the current SOC is multiprocessor, and there are l2,l3 and other caches, as well as other cache master,gpu.

The possibility of software implementation is already very small, too difficult, the performance is very low,

hardware-based coherency approaches:

1) snooping cache coherency protocols, all master "listening in" All Shared-data transactions,

     Read, operation, addr input, all processor check their own cache, whether there is the addr, there are words, direct return ACK, no longer access memory;

     Write, operation, addr input, all processor check their cache, if there is addr copy, need to invalid off.

This way the coherency tranffic is relatively large , N (N-1), because it takes broadcast to all the processor , when the processor is getting bigger,

The efficiency will be less and less.

2) directory-based cache coherency protocols: In the system, there is a single Directory that holds the list of cache line in the system,

In this way,master issues a transaction, finds the directory first, and then directed the cache coherency traffic to some master, reducing coherencytraffic.

In the best case, traffic is 2N and the worst case is n (n-1+1), because you need to check your directory first.

In this way, a large piece of on-chip RAM is required, and if placed in off-chip, it increases the latency of the system.

In practical applications, you can do some optimization, such as Snoop based system, you can add some Snoop filters, to reduce coherency traffic.

  Aces are supported in the way Snoop and directory-based, and even other hybrid types of protocol ,

Based on Axi, Ace has added three channelsto send and receive coherency transaction,

    

In the existing channel, a new signal is added :

Arsnoop and Awsnoop, representing the snoop transactions of shareable transactions;

Arbar and Awbar, used to denote barrier signal;

    

Ace-lite, adding new signals on the basis of Axi, without adding new channels,

  Ace-lite Master is mainly used to snoop other ace-compliment master,

But themselves cannot be snooped.

Take CCI400 's interconnect, for example, support two clusters CPUs, three ace-lite I/O coherent master,

    

Ace introduces many new transactions, which can be grouped according to memory attribute .

    

Ace-lite I/O coherency,ace-lite Master can be implemented, Non-shared,non-cached,cache maintenance transaction, three kinds of group

Transaction, implemented the uncached Masters to Snoop ACE coherent master,

    For example Gigabit Ethernet directly read and write cached data shared with CPU.

 

DVM (Distributed Virtual Memory), used to ensure consistency of the MMU internal TLB, support TLB Invalidation,brach predictor,instruction cache Invalidation.

  

Cache Coherence Basics :

The main purpose of the cache coherence design is that in multicore systems, multiple caches behave the same as the Sing-core system .

The cache coherence define, which can be described as multiple memory copy, allows Single-writer-multiple-reader (SWMR), in a

In logic time, there is only a maximum of one core write a, or multiple cores read a.

Coherence's granularities is usually defined by the size of the cache line installed.

 The read operation on the same address must be disabled after the write operation, until all caches send a feedback signal (ACK) indicating that the cache has been invalid or update.

In memory system, thecache controller is responsible for issue coherence req and received coherence RSP,

          the memory controller is responsible for received coherence req and issue coherence RSP,

  between the two are connected by interconnect .

    

There are two types of coherence protocol, snooping and directory,transactions/action are different, but stable state is the same .

1) stable states, a lot of coherence protocol are a subset of the Moesi model,

M (Modified), indicating that a cache line is valid,exclusive,owned, possibly dirty.

S (Shared), indicating that a cache line is valid, but not exclusive, not dirty, not owned.

I (Invalid), which indicates that a cache line is Invalid, or is not read-write,

    The MSI is the most basic protocol status, and there are two extensible status,o and E,

O (owned), indicating that the cache line is valid,owned, but not exclusive, and possibly dirty, the data in main memory is probably stale .

E (Exclusive), which indicates that the cache line is valid,exclusive and is clean.

      

2) common transaction issued by the cache controller:

    

3) Common Core to the cache controller's req:

    

4) snooping protocol,broadcasting a req message to all coherence controllers, these req to the order of each core can be non-volatile.

See the realization of concrete interconnect.

Directory protocol,unicast the req to a specific cache controller or memory controller.

The snooping structure is simple, but not easy to large numbers of core,

directory, which can scale to large num of core, but adds lantency per coherence req.

5) When a core write cache line, the coherence protocol action, can be divided into invalidation/update two kinds, and

Snooping is not associated with directory.

    invalidation, when a core sends a write cache line operation, the other cache copy is updated to invalid.

    Update, when a core sends a write cache line operation, the other cache copy is updated to the latest value .

The actual update is rarely used , because the update operation, relative or relatively occupy bus bandwidth, and this way

The memory consistency model is complicated , because in atomic operations, if multiple caches appear to update the data in the cache,

The situation can be complicated.

The structure between the cache and the MMU:

According to the principle of work, the cache has physical index physical tagged, virtual index virtual tagged, physical index virtual tagged and several other ways of working.

1) Physical index physical Tagged,cache only for physical addresses, simple and rude, and there will be no ambiguity.

Defect: In a multi-process operating system, each process instruction and code is in the form of virtual address, the CPU issued memory access instructions are sent as a virtual address,

So for each memory access operation, we must wait for the MMU to translate the virtual address into a physical address, which increases the latency of the operation.

2) virtual index virtual tagged is purely addressed with a dummy address, because multiple virtual addresses can correspond to a physical address, each row of data on the basis of the original tag

It is an issue to add the process identity to differentiate between multiple processes, and when dealing with shared memory, there are different virtual addresses in the same process within the share.

The structure is too complex

3) Virtual index physical tagged mode is now used more , virtual index means that when the CPU sends an address request, the low address goes to match the index in the cache

( low-level is usually the offset address in the page, the virtual address is the same as the Physical Address Low section ),

Physical tagged refers to the high address of the virtual address to match the page table in the MMU to get the physical address of the page.

This allows the matching operation of virtual index to work in parallel with the Smmu conversion operation .

ARM MPCore cache structure, L1 cache is generally placed in processor inside, can be divided into L1 data cache,l1 instruction cache. (8KByte----64KByte)

L1 instruction Cache, not only can do instuction caching, can also do dynamic branch prediction,

Some use the PC as the purpose register operation, BXJ instruction, Return from exception instruction, will not do prediction;

    Most are 2-way set associative structure, 64byte cache line.

L1 data cache, a physically indexed physically tagged cache,

The internal includes a internal exclusive monitor, which holds a list of currently valid exclusive accesses, which can be returned directly to Exokay,

can produce ACE transaction and CHI transaction,

    most are 4-way set associative structure, 64byte cacheline.

  The L2 cache includes an integrated SCU (connected to 4 cores within a cluster), a L2 cache, (128KByte------2MB)

The SCU contains L1 Data cache tags to make the coherency between 4 cores,

Snoop hardware operation is not supported in the L2 cache to ensure coherency between caches, you can configure select Ace or Chi to connect to main memory

    Physically index, physically tagged cache,8ways-----16ways.

SCU supports direct Cache-to-cache Transfer,dirty cache lines to be moved between cores, built-in tags filter, to send the specified

Coherent requests.

Arm-cache coherency

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.