memcached Hash Performance Optimization (eight)--summary Report

Source: Internet
Author: User

Transferred from: http://m.blog.csdn.net/blog/hzwfz1989/39120005

memcached Hash Performance Optimization report One, memcached analysis the two months have been in memcached optimization and find work between the busy, while reviewing the optimization of the code is a very difficult to forget the summer. Do this project really harvested a lot, whether it is the knowledge of Linux system, or the understanding of the memcached is a step closer than before, in addition to add block hash, replace the LRU algorithm and change the hash algorithm to modify the source code, accidentally put the original code test to kneel , the ability to debug with GDB has also improved. It feels like a worthwhile project. Still have to memcached brief introduction, after all, after all the things to use. 1. Memory management first of all, this memory refers to the memory management of Slab, which is a major feature of memcached, and all of his item's memory space is allocated from the slab managed memory allocator. This memory management is basically a few of the following points
    1. and general memory management Poor However, memcached from the operating system to obtain a large chunk of memory, then the chunk of memory is divided into various sizes of chunk blocks, chunk block size gradually increased, this can be specified by the user to complete.
    2. Each slabclass contains multiple slab, a slab is a block of memory of size 1M, the slab block will be divided according to the size of the chunk block to allocate.
    3. When you assign item from the Slab Manager, first calculate the size of the item, find the Slabclass that is just above or equal to a chunk block size, and then assign it.
    4. The logic of allocation is very simple, if there is idle or recycled back item, then directly allocated from this chunk block, otherwise try to apply for the new slab and then to allocate, and then no can only return null let LRU to replace.
The disadvantage of this kind of distribution is actually quite many, give the following 2 examples:
    1. Memory fragmentation, such as storing 48 bytes of item to a 64-byte chunk, has 16 bytes of memory wasted.
    2. In addition, since slab is unique, the Slab_lock must be used to guarantee the atomicity of the distribution, that is, if the multi-threaded allocation, if the cache_lock conflict between the assumption is not, to this slab to apply for memory can only be a single application, So the final bottleneck in the process of insertion is here, and the management of the memory here, if he is divided into chunks, leads to the subsequent logic being harder to handle, so this is not actually handled.
2. Hash and Lruhash and LRU this thing memcached adopted the simplest design method, the implementation of the hash chain implementation, LRU is based on the implementation of the list, the deconstruction of the figure is probably the following look (forgive my image of the theft map from Baidu Encyclopedia). This model is classic, whether it is a chain hash or the implementation of the LRU doubly linked list, but there are several problems
    1. The implementation of the chain hash is classic, but because it is a chain table preservation, the data locality is very bad, so the memory access efficiency will be reduced.
    2. LRU of the two-way linked list implementation, insert and read to adjust the position of item in the LRU table, these 2 operations obviously need to be locked, in fact, the read itself does not need to be locked, but to adjust the location of the LRU, so there is no way in Do_item_link and do_item_ We can all find the cache table locked in the unlink.
    3. If the use of open addressing method, then the hash of the search efficiency may not be high, so the efficiency of the hash is extremely critical.
3. Threading model This optimization did not pay attention here, I would like to enumerate the following establishment process.
    1. Initializes the main thread event_base, and the worker thread
    2. Establishing a notification pipeline for worker threads
    3. Registering the Libevent event for a worker thread pipeline
    4. Initialize the CQ queue for each worker
    5. Start worker thread, main thread listens for socket event, worker thread listens for socket read-write event
If there is a link to establish or run the process see this picture (image from http://bachmozart.iteye.com/blog/344172), the specific code analysis is not listed, this year this blog has. Second, the optimization content of memcached 1. Block Hash Optimization This is the main way to be able to multi-threaded read and find to do the optimization of use, the main ideas borrowed from the following this article a few ideas. "Cphash:a cache-partitioned Hash Table", the main idea is the following two
    1. The hash table is divided into blocks, each block by a different thread processing, of course, can bind to a fixed core, this has a MEMC3 is the binding
    2. Each hash block has its own LRU table and hash table, and the chunking is handled by its own LRU and hash table.
So the efficiency of reading can certainly take advantage of the multi-core, processing read performance can certainly go up, but the efficiency of the insertion is no way, because the main reason is the Slab_lock this above, because the distribution of slab obviously not can be parallel, because the distribution of slab by slab_ Lock for processing, and this because of the reasons for the lock, multiple workerthread corresponding to only one slab dispenser, then you can be sure that the insertion bottleneck is not a way to solve through multithreading, but read can use multithreading to improve the efficiency of reading. So the block hash here is mainly to improve the performance of reading. Because of the use of block hash processing, so that some of the logic of the item is not the original, the specific code can see Clockdev that branch on the code, the branch is used clock replacement and Mulit-hash is doing so. 2. Clock algorithm optimization

The goal is very simple, sacrifice hit rate and elimination times, improve get read and write performance, especially in the case of multi-core, the clock replacement algorithm is roughly described as follows

Class 1 (a=0, m=0): Indicates that the page has not been accessed recently and has not been updated, and is the best elimination page.
Class 2 (A=0, m=1): Indicates that the page has not been accessed recently, but has been modified and is not a good retirement page.
Class 3 (A=1, m=0): Recently accessed, but not modified, the page is likely to be accessed again.
Class 4 (A=1, m=1): Recently accessed and modified, the page may be accessed again. The execution process can be divided into the following three steps:
(1) From the current position indicated by the pointer, scan the loop queue, look for the first category of a=0 and M=0 page, and the first page encountered as the selected elimination page. Access bit A is not changed during the first scan.
(2) If the first step fails to find a week after the first category of the page, then start the second round of scanning, looking for a=0 and m=1 of the second category of pages, the first encountered such a page as the elimination page. During the second scan, the access bits of all scanned pages are set to 0.
(3) If the second step fails, that is, the second type of page is not found, the pointer is returned to the starting position, and all access bits are re-0. Then repeat the first step, if you still fail, and then repeat the second step if necessary, you will be able to find the eliminated page. Specific implementation of the following strategies to achieve
    1. Only take the time to mark, in the update and get the time to record the latest time, in the replacement when the time is not extended, if the extended is replaced, if not extended is not replaced.
    2. In order to support the loop lookup, the list of LRU is changed into a two-way circular linked list, while adding hand pointers, pointing to the element that needs to be replaced, if the extended is replaced, if not extended then the next attempt, if the failure, then the element is replaced
    3. During the update process, the element is not placed at the head and only the access time is updated so that the linked list is not broken, so the item _unlink and Item_link are not used.
In doing so, get performance can actually go up, but the process of checking does reduce some of the hits. 3. hash algorithm optimization using the hopscotch_hashing algorithm, the idea of this wikihttp://en.wikipedia.org/wiki/hopscotch_hashing. This is a kind of linear detection hash algorithm deformation, the main purpose is to improve the query speed. The original linear detection algorithm has too many detection times in the case of key, and the purpose of this algorithm is to reduce the number of probes. His steps are mainly so three steps:
    1. First detect the map to the bucket, see if it is occupied, if not occupied then direct use
    2. If it is already occupied, the position POS is detected using the linear detection method
    3. If the position of the POS is greater than the given threshold H for the bucket, adjust this position so that there is an empty slot on the H-1 bucket, and if there is no empty slot, resize the hash table and try again.
This is used in the latest master, but the performance seems to be not very good, mainly expand Hashtable process has become very expensive, and expand the timing of the choice becomes more uncertain, and expand must be locked in the process, During the period can hardly handle other requests, if not expand bucket, there is a better performance of the situation, so this process and memc3 cuckoo hashing more similar, it did discard the expand hash this background processing thread, It seems that the linear detection method in this expand is really difficult to achieve optimal. 4. Tag query optimization This is the main reference to this article "Memc3:compact and Concurrent MemCache with dumber Caching and Smarter Hashing", found on the GitHub project, but it seems Compile is not run, insert key seems to be error (>_<), but the idea is fine
    1. First, a hash function is used to calculate a 1-byte size tag, which is stored directly in the corresponding item of Hashtable.
    2. Then find the time, first compare tag is not consistent, if consistent, then to compare key.
This avoids unnecessary pointers to address operation, and when the key is very large, the cost of comparing key is actually quite high, and many times the tag compared to the first will not be satisfied, so there is no need to parse the key corresponding content, And if the key is stored in another memory block is likely to cause the cache miss, performance is not good. This is also the front hash algorithm optimization of a problem, especially when the collision key and linked together with a chain link, and then distinctly the entire list, the worst case if the subsequent linked list of nodes are in different memory, performance that is worse. Third, the optimization results are mainly so several conclusions, the previous blog has also written part of the test results, now unified description of the following. Test environment a 4-core PC, Memory 4g,2 24-core workstations are tested as client use. Client uses 48 threads to read and write 1. Test PC single-threaded work this test process is primarily a Test cap, due to the slab_lock limit in fact the result is inserted about 100,000 operations per second, read also at 110,000 operations per second
[OVERALL], RunTime (ms), 45906.0[overall], throughput (OPS/SEC), 108917.52712063782[insert], Operations, 4999968[ INSERT], averagelatency (US), 428.84410700228483[insert], minlatency (US), 123[insert], maxlatency (US), 59953[insert], 95thPercentileLatency (ms), 0[insert], 99thPercentileLatency (ms), 1[insert], return=0, 4999968
2. Test PC 8 Threads work with Clock+mulitihash: The insertion average is 90,000 operations per second.
[OVERALL], RunTime (ms), 53169.0[overall], throughput (OPS/SEC), 94039.15815606838[insert], Operations, 4999968[insert ], averagelatency (US), 497.43873040787463[insert], minlatency (US), 120[insert], maxlatency (US), 156936[insert], 95thPercentileLatency (ms), 0[insert], 99thPercentileLatency (ms), 6[insert], return=0, 4999968
The approximate upper limit uses the original memcached: the insertion average is 60,000 operations per second,
[OVERALL], RunTime (ms), 77420.0[overall], throughput (OPS/SEC), 64582.381813484884[insert], Operations, 4999968[ INSERT], averagelatency (US), 722.7500804005145[insert], minlatency (US), 120[insert], maxlatency (US), 118476[insert], 95thPercentileLatency (ms), 1[insert], 99thPercentileLatency (ms), 12

Using 2 client clock + mulithash:2 read is 200,000 operations per second, basically a single client overlay using the original Memcached:2 read is 200,000 operations per second, basically a single client overlay And the test of a single client is basically 110,000 operations per second this is clock + multihash
[OVERALL], RunTime (ms), 167550.0[overall], throughput (OPS/SEC), 119367.16204118174[read], Operations, 19959611[read] , Averagelatency (US), 389.58720097300494[read], minlatency (US), 94[read], maxlatency (US), 58364[read], 95thPercentileLatency (ms), 0[read], 99thPercentileLatency (ms), 3
The original can be slightly near, but the order of magnitude is almost
[OVERALL], RunTime (ms), 168112.0[overall], throughput (OPS/SEC), 118968.11649376607[read], Operations, 19960241[read] , Averagelatency (US), 390.19616010648366[read], minlatency (US), 97[read], maxlatency (US), 166104[read], 95thPercentileLatency (ms), 0[read], 99thPercentileLatency (ms), 3
Get in the process of the memcahed of the PC's CPU utilization can reach 100%, but the client CPU is already 100%, that is, get processing performance is not up to the highest, However, during the insertion process, it is true that CPU utilization is up to 100% under Htop, and the total number of processing is increased by the addition of a single client, so 2 clients in the get process do not really reach the highest value of the PC's processing. Get insertion is really no way to detect, from the htop to see the CPU utilization, Clock+multihash can be slightly higher, this is the test environment is limited, really can not detect 2 actual gap. However, there is a trend in the single machine is to adopt the Multihash method in the case of the initiation thread promotion, the hash effect will not be too large, and the original will fall more obvious. Four, summed up the memcached of the open-source summer camp is also a road, during the query a lot of information, see a lot of paper and projects, from the previous Baglru to later MemC3, From the beginning to study optimistic lock and multi-version concurrency control to later study hash algorithm and LRU algorithm difference, from Skiplist algorithm optimization to later structure modification, middle tried many ways, also took some detours, but found the process of exploration still learned a lot of things. However, many of the process in the middle still found themselves a lot of shortcomings, hoping to be able to base on this, and further improve their ability, in addition, found that the open source project is really good, I hope to continue to research and feedback open source projects.

memcached Hash Performance Optimization (eight)--summary Report

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.