Is Huge Page a panacea for saving performance?

Source: Internet
Author: User

This article will analyze whether the huge page can bring performance gains under any conditions, especially NUMA architectures.

Article welcome reprint, but reproduced when please retain this paragraph text, and placed on the top of the article Lu Junyi (Cenalulu) This article address: http://cenalulu.github.io/linux/huge-page-on-numa/

Prepare knowledge

Before reading this article, you need to understand at least the following basics

    • The basic concept of CPU Cache, see what the ape needs to know about the CPU cache– program.
    • The basic concept of NUMA, see the introduction to Popular science before the blog
    • Current Linux thread scheduling mechanism based on the busy degree of multi-core CPU, see Chip Multi Processing aware Linux Kernel Scheduler paper
About huge Page

Before formally starting this article analysis, we will introduce the historical background and usage scene of huge page first.

Why do I need huge PageIf you know the approximate CPU cache schema, you must have heard the TLB cache.LinuxIn the system, the memory address that is visible to the program can be usedVirtual Address。 The memory address of each program starts at 0. and the actual data access is going throughPhysical Addressto the. Therefore, each memory operation, the CPU needs to bepage tableIn theVirtual Addresstranslated into the correspondingPhysical Address, then for a large number of memory-intensive programspage tablewill become a bottleneck for the program. So in modern CPUs there is a mapping relationship in which the TLB (translation lookaside buffer) cache caches a small number of hot-spot memory addresses. However, due to manufacturing costs and process constraints, response times need to control cache capacity at the CPU cycle level to store only dozens of objects. Then the TLB cache is dealing with a lot of hot dataVirual AddressThe transition is stretched out. Let's figure it out. By standard Linux page size (page size) 4 K, a TLB cache capable of caching 64 elements can only cover4K*64 = 256KThe memory address of the hotspot data is obviously very remote from the ideal. So the huge page is produced.Tips: Don't confuse Virutal Address virtual memory with Windows. The latter is the technique of swapping content from memory to other devices (similar to the swap mechanism of Linux) in order to cope with insufficient physical memory.

What is huge Page since it cannot change the capacity of the TLB cache, it can only increase the physical memory size of a TLB cache entry from the system level, thus increasing the amount of hot memory data that the TLB cache can cover. Assuming Page Size that we add Linux to 16M , then the same 64-element TLB cache will be able to take into account 64*16M = 1G the memory hotspot data, this size compared to the above is 256K very suitable for practical application. The technology that will be enlarged like this Page Size is Huge Page .

Huge page is omnipotent?

Having understood the origins and principles of huge page, it is not difficult to conclude that the program that can benefit from huge page is necessarily a program that has scattered hotspot data and at least 64 4K page size. In addition, if the program's primary run time is not consumed on the page Table lookup after the TLB Cache miss, then how large the TLB is, and how the page size is incremented is futile. This principle is mentioned in an introductory introduction to LWN, and a more detailed estimate method is given. Simply put: oprofile TLB Miss calculate the expected performance improvement that huge page can bring by capturing the amount of elapsed time that resulted in the total program run time. Simply put, our program if the hotspot data only 256K, and focus on the continuous memory page, then a 64 entry TLB cache is enough to cope with. Here, you may have a question: since we are more difficult to predict whether our Program Access logic will benefit from opening the huge page. Anyway huge page seems to have changed only one page Size, there is no performance loss. Then we simply use huge page for all programs. In fact, such an idea is completely wrong! Also is a main content that this article wants to introduce, in the current common NUMA system huge page also is not the omnipotent key, improper use even can make the program or database performance drops 10%. Below we focus on analysis.

Huge Page on NUMA

Large Pages Harmful, author of the article on NUMA Systems, has done an experiment to test the performance differences of huge page in various scenarios of NUMA environments. From the huge page you can see that a significant portion of the application scenario does not improve performance and can even result in a performance loss of up to 10%.

The main reasons for performance degradation are the following two points

CPU to the same page preemption increased

For write-intensive applications, Huge page greatly increases the probability of cache write collisions. Because of the write consistency of the CPU-independent cache part MESI协议 , the write conflict means:

    • Traffic through the bus between the CPUs, causing the bus to be busy
    • It also reduces CPU execution efficiency.
    • CPU local cache frequently fails

The analogy to the database is equivalent to a lock used to protect 10 rows of data, now used to lock 1000 rows of data. Inevitably, the probability of the scramble between the threads of the lock is greatly increased.

Continuous data needs to be read across CPUs (False sharing)

From what we can see, the original 4K small page can be continuously allocated, and because of the high hit rate and on the same CPU to implement locality data. In the case of huge page, there is a portion of data that is forced to be distributed over two pages in order to fill the space left by the last memory allocation in the consolidated program. And in the huge page of the smaller part of the data, because the CPU affinity when the weight is small, nature is attached to other CPUs. The result: data that should have been in hot form on CPU2 L1 or L2 cache, had to get data through the CPU Inter-connect to the remote CPU. Let's say we continuously declare two arrays, Array A and the Array B size is 1536K. Memory allocation because the first page of 2M is not full, so Array B it was split into two parts, divided into two page. And because of the affinity configuration of the memory, one is allocated in zone 0, and the other in zone 1. Then, when a thread needs to access array B, it has to get another portion of the data through a costly inter-connect.

Delays re-sulting from traversing a greater physical distance to reach a remote node, is the most important source of Performance overhead. On the other hand, congestion on interconnect links and in memory controllers, which results from high volume of data flow ing across the system, can dramatically hurt performance.

Under Interleaving, the memory latency re-duces by a factor of 2.48 for Streamcluster and 1.39 for PCA. This effect was entirely responsible for performance improvement under the better policy. The question is, what's responsible for memory latency improvements? It turns out this interleaving dramatically reduces memory controller and interconnect congestion by allevi-ating the Loa D imbalance and mitigating traffic hotspots.

Countermeasure Ideal

Let's talk about the ideal situation first. The paper mentioned above is in fact his main purpose is to discuss a huge page automatic memory management strategy for NUMA architecture. This management strategy is simply based on a Carrefour variant of huge page optimization. (Note: Unfamiliar with what is Carrefour the reader can refer to the blog before the introduction of popular science or read the original) Here are some of the relevant technical means of a brief summary:

    • To reduce access to read-only hot-spot data across NUMA zones, you can use replication to copy a copy in direct memory in each NUMA zone to reduce response time by using a very high read-write page.
    • In order to reduce False Sharing , the monitor causes a large number of cache Miss page, and splits the reorganization. Place the same CPU affinity data in the same page
Reality

Having finished the dream, let's look at the reality. The reality is often brutal, because there is no hardware-level PMU (performance Monitor Unit) support to obtain accurate page access and cache miss information performance is very expensive. So the above ideals only stay in the experimental and thesis stages. So what do we do now before the ideal is fulfilled? The only answer is a test .

The actual test results are the most convincing. The so-called actual test is to give the optimization object to the real environment of the pressure simulation. Verify that the huge page results in performance gains by comparing performance differences when you turn huge page on and off. Most applications, of course, are very difficult to simulate in real-world situations. Then we can use the following理论测试

theoretical test theory tests can be used to estimate the potential improvement of huge page. The specific principle is to calculate the current application runtime TLB miss caused by the page walk cost accounted for the total program execution time. Of course, this type of test does not take into account the two performance losses mentioned above, so it can only be used to calculate the upper limit of the potential performance gains that huge page can bring. If the calculated value is very low, you can assume that using the huge page will result in an additional performance penalty. Concrete methods See LWN on the method of the specific calculation formula such as:

If there is no hardware PMU support, the calculation needs to use oprofile and calibrator .

Summarize

Not all optimization scenarios are 0 performance loss. Adequate testing and understanding of the principle of optimization are prerequisites for a successful optimization.

Is Huge Page a panacea for saving performance?

Large-Scale Price Reduction
  • 59% Max. and 23% Avg.
  • Price Reduction for Core Products
  • Price Reduction in Multiple Regions
undefined. /
Connect with us on Discord
  • Secure, anonymous group chat without disturbance
  • Stay updated on campaigns, new products, and more
  • Support for all your questions
undefined. /
Free Tier
  • Start free from ECS to Big Data
  • Get Started in 3 Simple Steps
  • Try ECS t5 1C1G
undefined. /

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.