Application of large-page memory (hugepages) in general program optimization

Source: Internet
Author: User

Today to introduce a more novel program performance optimization Method-Large page memory (hugepages), simply by increasing the size of the operating system page to reduce the page table, so as to avoid the fast table missing. This information is poor, and most of the information on the web is about its application in Oracle database, which gives the illusion that this technology can only be applied in Oracle databases. But in fact, large pages of memory can be considered a very general optimization technology, the application is very wide, for different applications, the maximum possible to bring 50% performance improvement, optimization effect is very obvious. In this blog, we will introduce a specific example of how to use large-page memory.

Before the introduction, we need to emphasize that the large page memory also has the scope of application, the program consumes very little memory or the program's access to the locality is very good, large page memory is difficult to obtain performance improvement . So, if you are facing a program optimization problem with the above two features, please do not consider large page memory. The following explains in detail why the program with the above two features is invalid for large page memory.

1. Background

Recently has been engaged in the company to listen to the song Qu Project development, the details can be consulted: based on fingerprint music retrieval, is now online to Sogou Voice cloud open platform. In the process of development, encountered a very serious performance problems, single-threaded test performance can meet the requirements, but in the multi-threaded stress test, the algorithm the most time-consuming part of the sudden slowed down several times! Later after careful debugging, found that the most influential performance is actually a compilation option-PG, remove it after the performance will be much better, but it will be twice times slower than the performance of single-threaded, which will lead to the system's real-time rate of more than 1.0, the response capacity of a serious decline.

Through more careful analysis, we find that the most time-consuming part of the system is the process of accessing the fingerprint library, but this part simply has no room for optimization and can only be swapped for machines with higher memory bandwidth. Swapping with a higher-memory-bandwidth machine does bring a lot of performance gains, but it still doesn't meet the requirements. In the case of heavy water, inadvertently saw Msra's Dr. Hong Chuntao mentioned in Weibo that they use large pages of memory to optimize the access problem of a random array to achieve a good performance improvement. Then he turned to him for help, finally through the large page of memory this method to further improve the system performance, real-time rate has dropped to about 0.4. Achieve YOUR goals!

2. Introduction to fingerprint-based music retrieval

The retrieval process is the same as the search engine, the music fingerprint is equivalent to the keywords in the search engine, and the fingerprint library is equivalent to the back page library of the search engine. The construction of the fingerprint library is the same as the Web page library of the search engine, using inverted index form. Such as:

Figure 1 Fingerprint-based inverted index table

Except that the fingerprint is an int (the illustration takes up only 24 bits) and contains too little information, so you need to extract many fingerprints to complete a match, approximately thousands of per second. Each fingerprint will need to access the fingerprint library to obtain the corresponding inverted list, and then based on the music ID to construct a positive arrangement table, to analyze which music match, such as:

Fig. 2 Similarity of statistical matching

The final result is the music with the highest sorting results.

At present, the fingerprint library is about 60G, which is the result of extracting the fingerprint from the 25w song. Each fingerprint corresponding to the inverted list length is not fixed, but there is an upper limit of 7500. The number of music in the table is also 25w, each music corresponding to the maximum time difference is 8192. A single retrieval will generate about 1000 fingerprints (or more).

Through the above introduction, we can see that based on the fingerprint of music retrieval (listening to songs) a total of three parts: 1. extract fingerprints; 2. Access the fingerprint library; 3. Sort the time difference. In multithreaded situations, the time-consuming proportions of these three parts are approximately 1%, 80%, and 19%, which means that most of the time is spent looking at the operation of the fingerprint library. The more troublesome point is that the access to the fingerprint library is all random access, there is no local, so the cache has been missing, the general optimization method is invalid, can only be replaced by a higher memory bandwidth server.

But precisely because of the above features- consuming large memory (around100G ), random-order visit and the bottleneck of the visit, which makes big page memory particularly suitable to optimize the performance bottlenecks encountered.

3. Principle

The principle of large page memory involves the conversion process of the virtual address of the operating system to the physical address. In order to be able to run multiple processes concurrently, the operating system provides a virtual process space for each process, on a 32-bit operating system, the size of the process space is 4g,64-bit system 2^64 (which may actually be less than this value). For a long time, I was very puzzled about this, so that does not lead to multiple processes to visit the conflict, for example, two processes are access to address 0x00000010. In fact, the process space for each process is virtual, which is not the same as the physical address. The two are accessing the same virtual address, but the transition to the physical address is different. This conversion is implemented through page tables, and the knowledge involved is the paging storage management of the operating system.

Paging Storage management divides the virtual address space of a process into several pages and numbers the pages. Accordingly, the physical memory space is also divided into blocks, which are also numbered. The page and block are the same size. Assuming that the size of each page is 4K, the paging address structure in the 32-bit system is:

To ensure that the process can find the actual physical block corresponding to the virtual page in memory, it is necessary to maintain an image table for each process, that is, the page table. The page table records the physical block number that each virtual page corresponds to in memory, three. When the page table is configured, the process executes, and by finding the table, the physical block number of each page in memory is found.

A page table register is set up in the operating system, which holds the page table at the beginning of the memory and the length of the page table. The Start and page table lengths of the page table are placed on the PCB of the process when the process is not executed, and the data is loaded into the page table register when the scheduler dispatches the process.

When the process wants to access data in a virtual address, the paging address transformation mechanism automatically divides the valid address (relative address) into the page number and the in-page address two parts, and then the page number as the index to retrieve the page table, the lookup operation is performed by hardware. If the given page number does not exceed the page table length, the product of the page table entry and page number and page table item length is added to get the table entry in the page table, so that the physical block address of the page can be loaded into the physical address register. At the same time, the in-page address in the valid address register is then fed into the in-block address field of the Physical address register. This completes the transformation from the virtual address to the physical address.

Figure 3 The role of the page table

Because the page table is stored in memory, this allows the CPU to access the memory two times per data access. The first time the page table is accessed in memory, the physical block number of the specified page is found, and the block number is spliced with the in-page offset to form the physical address. The second time you access memory, you get the data you need from the first-time resulting address. Therefore, using this method will reduce the processing speed of the computer by nearly 1/2.

In order to improve the speed of the address transformation, a special cache with parallel lookup capability can be added in the address transformation mechanism, also known as the Fast Table (TLB), which is used to store the page table entries currently accessed. The address transformation mechanism with a fast table is shown in four. Because of the cost of the relationship, the fast table can not be done very large, usually only 16~512 page table items.

The above address transformation mechanism is very good for small and medium programs, fast table hit rate is very high, so there will not be much performance loss, but when the program consumes a lot of memory, and the fast table hit rate is not high, then the problem comes.

Figure 4 An address transformation mechanism with a fast table

4. The plight of small pages

Modern computer systems support very large virtual address spaces (2^32~2^64). In such an environment, the page table becomes very large. For example, suppose the page size is 4K, and for programs that consume 40G of memory, the page table size is 10M, and space is required to be contiguous. To solve the spatial continuity problem, you can introduce a level two or three page table. However, this affects performance even more because if the fast table is missing, the number of times the page table is accessed is changed from two to three or four times. Because the program can access a large amount of memory, if the program is not well-locality, it will cause the fast table has been missing, which seriously affect performance.

In addition, because the page table entry has more than 10M, and the fast table can only cache hundreds of pages, even if the program's memory performance is very good, in the case of large-footprint, fast table missing probability is also very large. So, is there any good way to solve the fast table missing? Big page Memory! Suppose we change the page size to 1g,40g memory, the page table entry is only 40, and the fast table is not missing at all! Even if missing, due to the small number of table items, you can use a one-page table, missing will only result in two visits. This is the root cause of the large page memory that can optimize the performance of the program-the fast table is almost never missing!

In the previous we mentioned that if the program to be optimized consumes little memory, or if the locality of the memory is very good, the optimization effect of the large page RAM will be very low, and now we should understand why. If the program consumes little memory, such as only a few m, then the page table entries are very small, and the fast table is likely to be fully cached, even if the missing can be replaced by a one-level page table. If the program is also good for local access, then for a period of time, the program accesses adjacent memory, the probability of missing the fast table is also very small. So in both cases, the fast table is hard to miss, so large pages of memory can not show advantages.

5. Configuration and use of large-page memory

Much of the information on the web is accompanied by its use in the Oracle database when it comes to introducing large pages of memory, which can create the illusion that large pages of memory can only be used in Oracle databases. From the above analysis, we can know that, in fact, large-page memory is a very general optimization technology. Its optimization method is to avoid the fast table missing. So how to apply this, the following details the steps used.

1. Installing the LIBHUGETLBFS Library

The LIBHUGETLBFS library provides access to large pages of memory. Installation can be done via apt-get or Yum commands, and can be downloaded from the website if the system does not have the command.

2. Configuring the Grub boot file This step is critical and determines the size and number of pages you allocate for each large page. The specific action is to edit the/etc/grub.conf file, as shown in five.

Figure 5 grub.conf startup script

Specifically, add several startup parameters at the end of the kernel option: Transparent_hugepage=never default_hugepagesz=1g hugepagesz=1ghugepages=123. Of these four parameters, the most important is the last two, Hugepagesz is used to set the size of each page, we set it to 1G, the other optional configuration has 4k,2m (where 2M is the default). If the operating system version is too low, it may cause 1G page Setup to fail, so the setup fails check the version of your operating system. Hugepages used to set how many pages of large pages of memory, our system memory is 128G, now allocate 123G dedicated to serving large pages. It is important to note that the allocated large pages are not visible to the regular program, for example, our system still has 5G of normal memory, when I start a program that consumes 10G according to the usual method will fail. After modifying the grub.conf, restart the system. Then run the command cat/proc/meminfo|grep huge command to see if the large page setting takes effect, and if it does, the following will appear:

Figure 6 Current Large page consumption

We need to focus on the four values, hugepages_total indicates how many large pages are currently in total, hugepages_free indicates how many large pages remain after the program is running, HUGEPAGES_RSVD indicates the total number of hugepages that the system is currently holding , the more specific point is that the program has been applied to the system, but because the program has no substantive hugepages read and write operations, so the system has not actually assigned to the program hugepages number. Hugepagesize represents the size of each large page, in this case 1GB.

We found a problem in the experiment that the value of free and the value of RSVD may not be the same as the literal meaning. If the large page we requested at the beginning is not sufficient to start the program, the system will prompt the following error:

Ibhugetlbfs:WARNING:New heap Segment map at 0x40000000 Failed:cannot Allocate memory

At this point, looking again at these four values, you will see a situation where Hugepages_free equals A,HUGEPAGES_RSVD equals A. This is very strange, obviously there are the remaining large pages, but the system error, prompting large page allocation failure. After many attempts, we think that the free should include RSVD's large page, so when it is equal to RSVD, there is no big page available. Free minus RSVD is the big page that can really be redistributed again. For example, in Figure VI there are 16 large pages that can be assigned.

How many large pages should be allocated to fit, this need to try several times, we get a experience is: Sub-threading on the use of large pages is a waste, preferably all the space is allocated in the main thread, and then allocated to the individual sub-threads, which will significantly reduce the large page waste.

3. Run the application

In order to enable large pages, you cannot start the application in the usual way, you need to start it in the following format:

This method loads the LIBHUGETLBFS library, which is used to replace the standard library. The specific operation is to replace the standard malloc for the large page of malloc. At this point, the program requests memory is the large page memory.

By following the three steps above, large page memory can be enabled, so it is easy to enable large pages.

6. Large-page memory optimization effect

If your application is very serious, then large pages of memory will bring greater benefits, just as we do now listen to the song is such an application, so the optimization effect is obvious, the following is the music library for 25w, enable large pages and do not enable large page program performance.

As you can see, when large page memory is enabled, the program's access time is significantly reduced, with performance improvements of nearly 50%, achieving performance requirements.

7. Large-page Memory usage scenarios

Any optimization method has its scope of application, large page memory is no exception. As we have always emphasized, the large-page memory of a program that consumes only a large amount of memory, is randomly stored, and has a bottleneck is a significant performance boost. In our listening song system, the memory consumption is close to 100G, and the memory access is disorderly access, so it brings the obvious performance improvement. Online examples have been using Oracle database as an example is not unreasonable, this is because the Oracle database consumes a huge amount of memory, and database additions and deletions are also lack of local. The database behind the additions and deletions is basically the operation of the B-tree, the operation of the tree generally lack of local.

What kind of programs are poorly localized? I personally think that the use of hashing and tree strategy to implement the program often has poor memory locality, at this time, if the program performance is not good can try large pages of RAM. On the contrary, simple array traversal or the breadth traversal of graphs have good memory locality, so it is difficult to get performance improvement by using large pages. I have tried to enable the Sogou speech recognition decoder on the large page of memory, hoping to get performance gains, but the effect is disappointing, no elevation leads to performance degradation. This is because the speech recognition decoder is essentially a broad search of a graph, has a good memory locality, and the visit is not a performance bottleneck, the use of large pages can incur additional overhead, resulting in performance degradation.

8. Summary

This blog to listen to the songs of the example of the details of the large page of memory and how to use. Due to the prosperity of big data, the current application processing data volume is increasing, and the data access is increasingly irregular, these conditions for the use of large page memory is possible. So, if your program runs slowly, and the use of large pages of memory to meet the conditions, then try it, anyway very simple and no loss, in case it can bring good results.

Application of large-page memory (hugepages) in general program optimization

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.