Linux kernel synchronization mechanism (ii): PER-CPU variable

Source: Internet
Author: User

Transferred from:

First, Source: Why the introduction of PER-CPU variables?

1. Performance issues with Lock bus

Prior to ARMV6 on the arm platform, the SWP and SWPB directives were used to support access to the shared memory:

SWP <rt>, <rt2>, [<rn>]

In Rn, the memory address of the SWP instruction is saved, which allows the memory data specified by RN to be loaded into the RT register, while the values in the RT2 register are saved to the memory specified in Rn.

The problem with the read-modify-write that we described in the atomic operation of that document is essentially an issue that keeps the atomicity of read and write accesses to memory. This means that access to the memory read and write cannot be interrupted. The solution to this problem can be done through hardware, software, or the combination of a/SW method. The solution given by the early arm CPUs is hardware dependent: SWP This assembly instruction performs a read memory operation, a write memory operation, but from the programmer's point of view, the SWP directive is atomic and will not be interrupted by any asynchronous events between read and write. How does the underlying hardware work? At this time, the hardware will provide a lock signal, in the memory operation when the lock signal, tell the bus this is a non-interrupted memory access, until the completion of the SWP required two memory access after the clear lock signal.

Lock memory Bus has a serious impact on the performance of Multicore systems (other processor in the system are holding the lock's memory bus access), how to solve this problem? The best locking mechanism is not to use locks, so to solve this problem can use the drastic method, that is not in the system to share data between multiple processor, to each CPU allocated a not OK?

Of course, with the development of technology, the arm CPU after ARMV6 has not recommended the use of SWP such instructions, but provides Ldrex and strex such instructions. This method is a combination of hardware and software to solve the atomic operation problem, it seems that the code is more complex, but the performance of the system can be improved. In fact, from the hardware point of view, Ldrex and Strex Such instructions are also adopted lock-free practice. OK, because no longer lock bus, it seems that the basis of the existence of the PER-CPU variable is broken. But considering the operation of the cache, it actually does make sense.

2, the impact of the cache

In the memory hierarchy documentation, we have learned some basic knowledge about Memory, some basic content, and this is no longer repeated here. Let's assume that the cache in a multicore system is as follows:

Each CPU has its own L1 cache (both the data cache and the instruction cache), and all CPUs share a L2 cache. The difference between the access speed of L1, L2, and main memory is very large, with the highest performance in the case of course the L1 cache hit, so there is no need to access the next-order memory to load the cache line.

We first look at how memory is shared between multiple CPUs. In this case, any CPU that modifies the shared memory causes the corresponding cache line on the L1 cache of all other CPUs to become invalid (hardware complete). While performance is an impact, the system must do so because the cache synchronization needs to be maintained. Turning a shared memory into per-cpu memory is essentially a method that consumes more memory to solve performance. When a variable that is shared between multiple CPUs becomes a private variable for each CPU, we do not have to consider concurrency from multiple CPUs, just consider the concurrency on this CPU is OK. Of course, it is also important to note that when accessing the PER-CPU variable, it is not possible to dispatch, although it is more accurate to say that the task cannot be dispatched to other CPUs. The current kernel approach is to disable preemptive when accessing the PER-CPU variable, although there is no mechanism to completely avoid locking (disable preemptive is also a locking mechanism), but there is no doubt that this is a relatively small price lock.

Second, the interface

1. The API for statically declaring and defining PER-CPU variables is shown in the following table:

API for declaring and defining PER-CPU variables Describe
DECLARE_PER_CPU (type, name)
DEFINE_PER_CPU (type, name)
The normal, non-specific per CPU variable defines the interface function. No alignment required
Declare_per_cpu_first (type, name)
Define_per_cpu_first (type, name)
The per CPU variable defined by the API is located at the front of the entire per CPU related section.
declare_per_cpu_shared_aligned (type, name)
define_per_cpu_shared_aligned (type, name)
The per CPU variable defined by the API is aligned to the L1 cache line in the case of SMP and does not need to be aligned to Cachine line for up
declare_per_cpu_aligned (type, name)
define_per_cpu_aligned (type, name)
Whether SMP or up, it is necessary to align to L1 cache line
declare_per_cpu_page_aligned (type, name)
define_per_cpu_page_aligned (type, name)
API interface set to define page aligned per CPU variable
declare_per_cpu_read_mostly (type, name)
define_per_cpu_read_mostly (type, name)
The per CPU variable defined by the API is read mostly

See the API for this "colorful" per-cpu variable, and you are not already drunk. These definitions are used on different occasions, and the main factor include:

-the position of the variable in the section

-The alignment of the variable

-this variable is different for SMP and up processing

-Access to the form of per CPU

For example, if the per CPU variable you are going to define is required to be aligned by PAGE, then you need to use declare_per_cpu_page_aligned when defining the per-CPU variable. If you only want to align to the cache line in the case of SMP, use declare_per_cpu_shared_aligned to define the PER CPU variable.

2. Access the API to statically declare and define PER-CPU variables

A statically defined per-CPU variable cannot be accessed as a normal variable, and requires a specific interface function, as follows:

Get_cpu_var (Var)

Put_cpu_var (Var)

The above two interface functions are already embedded in the lock mechanism (preempt disable), the user can directly invoke the interface for this variable copy of the CPU access. If the user confirms that the current execution environment is already preempt disable (or a more powerful lock, such as a CPU interrupt being turned off), then the Api:__get_cpu_var of the Lock-free version of the PER-CPU variable can be used.

3. The API for dynamically allocating PER-CPU variables is shown in the following table:

APIs for dynamically allocating and releasing PER-CPU variables Describe
ALLOC_PERCPU (Type) The allocation type is the per CPU variable of type, and the address of the per CPU variable is returned (note: Not a copy on each CPU)
void Free_percpu (void __percpu *ptr) Releases the per-CPU variable space that PTR points to

4. Access the API for dynamically assigning PER-CPU variables as shown in the following table:

API to access the PER-CPU variable Describe
Get_cpu_ptr This interface is similar to the Get_cpu_var interface that accesses static PER-CPU variables, and of course, this interface is for dynamic allocation of PER-CPU variables
Put_cpu_ptr Ditto
Per_cpu_ptr (PTR, CPU) Returns the address of the per CPU variable on the specified CPU number, based on the address and CPU number of the per CPU variable

Third, the realization

1. Static PER-CPU variable definition

We use the implementation of DEFINE_PER_CPU as an example to describe how to implement static PER-CPU variable definitions in Linux kernel. The specific code is as follows:

#define DEFINE_PER_CPU (type, name) \
Define_per_cpu_section (type, Name, "")

The type is the variable, and name is the per CPU variable symbol. The Define_per_cpu_section macro can put a PER-CPU variable into the specified section, with the following code:

#define Define_per_cpu_section (Type, name, sec) \
__pcpu_attrs (sec) per_cpu_def_attributes \-----Arrangement Section
__typeof__ (type) name----------------------defining variables

Here the specific arch specific in the PERCPU code (ARCH/ARM/INCLUDE/ASM/PERCPU.H) can define per_cpu_def_attributes, in order to control the properties of the per CPU variable, of course, If the PERCPU code for arch specific is undefined, then the code in general Arch-independent (INCLUDE/ASM-GENERIC/PERCPU.H) is defined as null. Here you can mention the software hierarchy of the PER-CPU variable:

(1) Arch-independent interface. In the include/linux/percpu.h file, the definition of the interface API and related data structures that other modules of the kernel use per CPU mechanism are defined. The other modules of the kernel need to include this header file when using the per CPU variable interface

(2) Arch-general interface. In the Include/asm-generic/percpu.h file. If all the arch-related definitions are the same, extract them and put them in the Asm-generic directory. There is no doubt that the interface and data structures defined in this file are hardware-related, except that the software abstracts the contents of each arch-specific and forms an arch general layer. In general, we do not need to include the header file directly, Include/linux/percpu.h will include the header file.

(3) arch-specific. This is a hardware-related interface, in Arch/arm/include/asm/percpu.h, that defines the interface code for the ARM platform, specific and per CPU.

Let's get back to the point and see the definition of __pcpu_attrs:

#define __PCPU_ATTRS (sec) \
__percpu __attribute__ ((Section (per_cpu_base_section sec)) \

Per_cpu_base_section defines the base section name symbol, which is defined as follows:

#ifndef per_cpu_base_section
#define Per_cpu_base_section ". Data. Percpu "
#define PER_CPU_BASE_SECTION ". Data"

Although there are a variety of static PER-CPU variable definition methods, but are similar, but placed in a different section, the properties are different, here do not look at the other implementation, directly to the section arrangement:

(1) Section arrangement for normal per CPU variable

Smp Up
Build-in kernel ". Data: PERCPU "section ". Data" section
Defined in module ". Data: PERCPU "section ". Data" section

(2) section arrangement for first per CPU variable

Smp Up
Build-in kernel ". Data: Percpu. First section ". Data" section
Defined in module ". Data: Percpu. First section ". Data" section

(3) Section arrangement for SMP shared aligned per CPU variable

Smp Up
Build-in kernel ". Data: Percpu. shared_aligned "section ". Data" section
Defined in module ". Data: PERCPU "section ". Data" section

(4) Section arrangement for aligned per CPU variable

Smp Up
Build-in kernel ". Data: Percpu. shared_aligned "section ". Data: shared_aligned "section
Defined in module ". Data: PERCPU "section ". Data: shared_aligned "section

(5) page aligned per CPU variable section arrangement

Smp Up
Build-in kernel ". Data: Percpu. page_aligned "section ". Data: page_aligned "section
Defined in module ". Data: Percpu. page_aligned "section ". Data: page_aligned "section

(6) section arrangement for read mostly per CPU variable

Smp Up
Build-in kernel ". Data: Percpu. readmostly "section ". Data: readmostly "section
Defined in module ". Data: Percpu. readmostly "section ". Data: readmostly "section

Understand the implementation of statically defined PER-CPU variables, but why introduce so many sections? For normal variables in kernel, after compiling and linking, they are placed in. data or. BSS, and the system is ready for everything at initialization (for example, clear BSS), and because of the specificity of the per CPU variable, the kernel places these variables in other sections, In kernel address space between __per_cpu_start and __per_cpu_end, we call the original variables of the PER-CPU variable (I can't think of any good words).

Only the original variable of the PER-CPU variable is not enough, must establish a copy for each CPU, how to build? Directly statically define an array of Nr_cpus? Nr_cpus defines the maximum number of processor supported by the system, not the actual system processor, which is a waste of memory. In addition, statically defined data is contiguous in memory, OK for the UMA system, and for NUMA systems, a copy of the PER-CPU variable on each CPU should be located in its fastest access memory. This means that each CPU copy of the PER-CPU variable may be scattered across the entire memory address space, which is empty between the spaces. Essentially, the allocation of the replica per CPU memory is attributed to the memory management subsystem, so the memory allocated per CPU variable copy will not be detailed in this article, the general idea is as follows:

The memory management subsystem allocates a chunk of memory for each CPU based on the current RAM configuration, and for Uma, this memory is also in main memory, and for NUMA, it is possible to allocate the memory closest to the CPU (that is, the CPU accesses the fastest) , but anyway, these are the memory management subsystems that need to be considered. Whether static or dynamic per CPU variable allocation, the mechanism is the same, but for the static per CPU variable, need to be in the system initialization, corresponding to the per CPU section, pre-dynamically allocated a same size per CPU chunk. In the file, the arrangement of the PERCPU section is defined:

#define PERCPU_INPUT (cacheline) \
Vmlinux_symbol (__per_cpu_start) =.; \
* (. Data: Percpu. First) \
.                        = ALIGN (page_size); \
* (. Data: Percpu. page_aligned) \
.                        = ALIGN (Cacheline); \
* (. Data: Percpu. readmostly) \
.                        = ALIGN (Cacheline); \
* (. Data: PERCPU) \
* (. Data: Percpu. shared_aligned) \
Vmlinux_symbol (__per_cpu_end) =.;

For those per CPU variables in the build in kernel, the per CPU section between __per_cpu_start and __per_cpu_end must be located. At the time of system initialization (SETUP_PER_CPU_AREAS), allocate per CPU memory chunk and copy the per CPU section to each chunk.

2. Accessing a statically defined per CPU variable

The code is as follows:

#define Get_cpu_var (Var) (* ({\
Preempt_disable (); \
&__get_cpu_var (VAR); }))

Again see Get_cpu_var and __get_cpu_var These two symbols, believe that the vast masses of people have been quite familiar with a lock version, a lock-free version. To prevent the current task from being dispatched to other CPUs due to preemption, a lock mechanism such as preempt_disable is required to access per CPU memory. Let's see __get_cpu_var:

#define __get_cpu_var (Var) (*this_cpu_ptr (& (Var)))

#define THIS_CPU_PTR (PTR) __this_cpu_ptr (PTR)

For ARM platforms, we do not define __THIS_CPU_PTR, so we use the Asm-general version:

#define __THIS_CPU_PTR (PTR) shift_percpu_ptr (PTR, __my_cpu_offset)

SHIFT_PERCPU_PTR This macro definition literally shows that it is available from the original per CPU variable address, through a simple transform (SHIFT) to the actual address of a copy of the CPU variable. In fact, the per CPU memory Management module guarantees a simple linear relationship (that is, a fixed offset) to the address of the original per CPU variable and the address of the per CPU variable copy on each CPU. __MY_CPU_OFFSET This macro definition is related to offset, if the arch specific is not defined, then you can use the ASM General version, as follows:

#define __my_cpu_offset Per_cpu_offset (raw_smp_processor_id ())

RAW_SMP_PROCESSOR_ID can get the ID of this CPU, if no arch specific is not defined __PER_CPU_OFFSET this macro, then offset is saved in __per_cpu_ In the array of offset (the following is just an array declaration, specifically defined in the Mm/percpu.c file), as follows:

#ifndef __per_cpu_offset
extern unsigned long __per_cpu_offset[nr_cpus];

#define PER_CPU_OFFSET (x) (__per_cpu_offset[x])

For armv6k and ARMV7 versions, offset is stored in the TPIDRPRW register, which is intended to improve system performance.

3. Dynamic allocation per CPU variable

This part of the content is left to the memory management subsystem.

Original articles, forwarded please indicate the source. Snail Nest Technology


Linux kernel synchronization mechanism (ii): PER-CPU variable

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.