This article will analyze whether the huge page can bring performance gains under any conditions, especially NUMA architectures.
Article welcome reprint, but reproduced when please retain this paragraph text, and placed on the top of the article Lu Junyi (Cenalulu) This article address: http://cenalulu.github.io/linux/huge-page-on-numa/
Prepare knowledge
Before reading this article, you need to understand at least the following basics
- The basic concept of CPU Cache, see what the ape needs to know about the CPU cache– program.
- The basic concept of NUMA, see the introduction to Popular science before the blog
- Current Linux thread scheduling mechanism based on the busy degree of multi-core CPU, see
Chip Multi Processing aware Linux Kernel Scheduler
paper
About huge Page
Before formally starting this article analysis, we will introduce the historical background and usage scene of huge page first.
Why do I need huge PageIf you know the approximate CPU cache schema, you must have heard the TLB cache.Linux
In the system, the memory address that is visible to the program can be usedVirtual Address
。 The memory address of each program starts at 0. and the actual data access is going throughPhysical Address
to the. Therefore, each memory operation, the CPU needs to bepage table
In theVirtual Address
translated into the correspondingPhysical Address
, then for a large number of memory-intensive programspage table
will become a bottleneck for the program. So in modern CPUs there is a mapping relationship in which the TLB (translation lookaside buffer) cache caches a small number of hot-spot memory addresses. However, due to manufacturing costs and process constraints, response times need to control cache capacity at the CPU cycle level to store only dozens of objects. Then the TLB cache is dealing with a lot of hot dataVirual Address
The transition is stretched out. Let's figure it out. By standard Linux page size (page size) 4 K, a TLB cache capable of caching 64 elements can only cover4K*64 = 256K
The memory address of the hotspot data is obviously very remote from the ideal. So the huge page is produced.Tips: Don't confuse Virutal Address
virtual memory with Windows. The latter is the technique of swapping content from memory to other devices (similar to the swap mechanism of Linux) in order to cope with insufficient physical memory.
What is huge Page since it cannot change the capacity of the TLB cache, it can only increase the physical memory size of a TLB cache entry from the system level, thus increasing the amount of hot memory data that the TLB cache can cover. Assuming Page Size
that we add Linux to 16M
, then the same 64-element TLB cache will be able to take into account 64*16M = 1G
the memory hotspot data, this size compared to the above is 256K
very suitable for practical application. The technology that will be enlarged like this Page Size
is Huge Page
.
Huge page is omnipotent?
Having understood the origins and principles of huge page, it is not difficult to conclude that the program that can benefit from huge page is necessarily a program that has scattered hotspot data and at least 64 4K page size. In addition, if the program's primary run time is not consumed on the page Table lookup after the TLB Cache miss, then how large the TLB is, and how the page size is incremented is futile. This principle is mentioned in an introductory introduction to LWN, and a more detailed estimate method is given. Simply put: oprofile
TLB Miss
calculate the expected performance improvement that huge page can bring by capturing the amount of elapsed time that resulted in the total program run time. Simply put, our program if the hotspot data only 256K, and focus on the continuous memory page, then a 64 entry TLB cache is enough to cope with. Here, you may have a question: since we are more difficult to predict whether our Program Access logic will benefit from opening the huge page. Anyway huge page seems to have changed only one page Size, there is no performance loss. Then we simply use huge page for all programs. In fact, such an idea is completely wrong! Also is a main content that this article wants to introduce, in the current common NUMA system huge page also is not the omnipotent key, improper use even can make the program or database performance drops 10%. Below we focus on analysis.
Huge Page on NUMA
Large Pages Harmful, author of the article on NUMA Systems, has done an experiment to test the performance differences of huge page in various scenarios of NUMA environments. From the huge page you can see that a significant portion of the application scenario does not improve performance and can even result in a performance loss of up to 10%.
The main reasons for performance degradation are the following two points
CPU to the same page preemption increased
For write-intensive applications, Huge page greatly increases the probability of cache write collisions. Because of the write consistency of the CPU-independent cache part MESI协议
, the write conflict means:
- Traffic through the bus between the CPUs, causing the bus to be busy
- It also reduces CPU execution efficiency.
- CPU local cache frequently fails
The analogy to the database is equivalent to a lock used to protect 10 rows of data, now used to lock 1000 rows of data. Inevitably, the probability of the scramble between the threads of the lock is greatly increased.
Continuous data needs to be read across CPUs (False sharing)
From what we can see, the original 4K small page can be continuously allocated, and because of the high hit rate and on the same CPU to implement locality data. In the case of huge page, there is a portion of data that is forced to be distributed over two pages in order to fill the space left by the last memory allocation in the consolidated program. And in the huge page of the smaller part of the data, because the CPU affinity when the weight is small, nature is attached to other CPUs. The result: data that should have been in hot form on CPU2 L1 or L2 cache, had to get data through the CPU Inter-connect to the remote CPU. Let's say we continuously declare two arrays, Array A
and the Array B
size is 1536K. Memory allocation because the first page of 2M is not full, so Array B
it was split into two parts, divided into two page. And because of the affinity configuration of the memory, one is allocated in zone 0, and the other in zone 1. Then, when a thread needs to access array B, it has to get another portion of the data through a costly inter-connect.
Delays re-sulting from traversing a greater physical distance to reach a remote node, is the most important source of Performance overhead. On the other hand, congestion on interconnect links and in memory controllers, which results from high volume of data flow ing across the system, can dramatically hurt performance.
Under Interleaving, the memory latency re-duces by a factor of 2.48 for Streamcluster and 1.39 for PCA. This effect was entirely responsible for performance improvement under the better policy. The question is, what's responsible for memory latency improvements? It turns out this interleaving dramatically reduces memory controller and interconnect congestion by allevi-ating the Loa D imbalance and mitigating traffic hotspots.
Countermeasure Ideal
Let's talk about the ideal situation first. The paper mentioned above is in fact his main purpose is to discuss a huge page automatic memory management strategy for NUMA architecture. This management strategy is simply based on a Carrefour
variant of huge page optimization. (Note: Unfamiliar with what is Carrefour
the reader can refer to the blog before the introduction of popular science or read the original) Here are some of the relevant technical means of a brief summary:
- To reduce access to read-only hot-spot data across NUMA zones, you can use replication to copy a copy in direct memory in each NUMA zone to reduce response time by using a very high read-write page.
- In order to reduce
False Sharing
, the monitor causes a large number of cache Miss page, and splits the reorganization. Place the same CPU affinity data in the same page
Reality
Having finished the dream, let's look at the reality. The reality is often brutal, because there is no hardware-level PMU (performance Monitor Unit) support to obtain accurate page access and cache miss information performance is very expensive. So the above ideals only stay in the experimental and thesis stages. So what do we do now before the ideal is fulfilled? The only answer is a test .
The actual test results are the most convincing. The so-called actual test is to give the optimization object to the real environment of the pressure simulation. Verify that the huge page results in performance gains by comparing performance differences when you turn huge page on and off. Most applications, of course, are very difficult to simulate in real-world situations. Then we can use the following理论测试
theoretical test theory tests can be used to estimate the potential improvement of huge page. The specific principle is to calculate the current application runtime TLB miss caused by the page walk cost accounted for the total program execution time. Of course, this type of test does not take into account the two performance losses mentioned above, so it can only be used to calculate the upper limit of the potential performance gains that huge page can bring. If the calculated value is very low, you can assume that using the huge page will result in an additional performance penalty. Concrete methods See LWN on the method of the specific calculation formula such as:
If there is no hardware PMU support, the calculation needs to use oprofile
and calibrator
.
Summarize
Not all optimization scenarios are 0 performance loss. Adequate testing and understanding of the principle of optimization are prerequisites for a successful optimization.
Is Huge Page a panacea for saving performance?