A shameful failure was counted as a "Ten-success" by Qianlong"

Source: Internet
Author: User

Ramcloud source code analysis (III)
  
1. Overview
  
This section mainly analyzes the servers in the ramcloud system. This section focuses on memory management and corresponds to the log-structured memory for dram-based storage article on fast '14. First, log-based memory management is copying allocators. The main purpose is to meet the following requirements: first, fragments can be eliminated through copy object; second, garbage collection can be implemented without global scanning. The first requirement is very easy to implement. It can be achieved only through copy. The second requirement requires the use of index structure to quickly locate the segment (there is a natural index in ramcloud, that is hashtable ). Therefore, the following describes the memory management mechanism from hashtable and segment.
  
2. MMAP system call
  
Before talking about hashtable and segment, I would like to talk about MMAP functions, because all the memory applications in the ramcloud system are implemented through it.
  
2.1 MMAP Principle
  
First, in order to effectively use the physical memory on the machine, the memory is divided into several functional areas in the system initialization phase, as shown in 1. The Linux kernel occupies the starting part of the physical memory, next is the high-speed buffer for disks, floppy disks, and other devices (excluding the memory occupied by the display card and rom bios, the address range is 640 KB-1 MB ). When a process needs to read data from a block device, the system first reads the data to the high-speed buffer. When data needs to be written to the block device, the system first places the data in the high-speed buffer, and then the block device driver program writes the data to the corresponding device. The last part of the memory is the main memory zone that all programs can apply for and use at any time. When using the main memory area, the kernel program also needs to submit an application to the kernel memory management module before using it. For systems with a ram virtual disk, the primary memory header must be allocated to the Virtual Disk for data storage.
  
Write the image description here
  
Secondly, MMAP is a memory ing file method that maps a file or other objects to the address space of the process, implements a one-to-one ing relationship between the file disk address and a segment of virtual address in the virtual address space of the process. On the contrary, modifications made to this region in the kernel space also directly reflect the user space, so that files can be shared between different processes. As shown in figure 2, the virtual address space of a process consists of multiple virtual memory areas. The text data segment (code segment), initial data segment, BSS data segment, heap, stack, and memory ing shown in the figure are all independent virtual memory areas. The address space for the memory ing service is in the spare part between stacks. As shown in 3, the Linux kernel uses the vm_area_struct structure to represent an independent virtual memory area. Because the functions and internal mechanisms of each virtual memory area vary with different quality, therefore, a process uses multiple vm_area_struct structures to represent different types of virtual memory areas. Each vm_area_struct structure uses a linked list or tree structure link to facilitate quick access by processes. The vm_area_struct structure contains the region start and end addresses and other related information. It also contains a vm_ops pointer, which can internally lead to all system call functions available for this region. In this way, the process can obtain the required information for any operation in a virtual memory area from vm_area_struct. The MMAP function creates a new vm_area_struct structure and connects it to the physical disk address of the file.
  
Write the image description here
  
Write the image description here
  
Function prototype: void * MMAP (void * Start, size_t length, int Prot, int flags, int FD, off_t offset );
  
Finally, if the file descriptor FD in the function is set to-1, it can be used for Memory sharing between processes instead of the specific file ing. This method is also used in the ramcloud system to apply for large memory and use segment and tablehash for storage.
  
2.2 MMAP usage
  
In ramcloud, MMAP is simply encapsulated. For details, see the source code in/src/largeblockofmemory. h.
  
3. hashtable
  
Hashtable stores key-to-object ing to quickly determine the location of the object. A large hashtable is divided into several hash table buckets. Each bucket contains one or more cacheline (64 bytes). For cacheline, each entry occupies 8 bytes. The first 16bits of the entry stores the secondary hash, 1bit indicates whether the last entry is a pointer to the next cacheline, And the last 47bits is used to store the pointer. It is worth mentioning that the cacheline size is designed to match a single L2 cache line in the CPU.
  
In addition, cacheline in hashtable is allocated by calling the above MMAP system. First, the server applies for a certain number of cacheline based on the parameters to form a cacheline pool. When necessary, the server allocates cacheline from the pool to a specific bucket. If not, it needs to be returned to the cacheline pool. Release the cacheline pool when the server goes down or the process ends.
  
4. Segment
  
In ramcloud, logs are used to store objects, and each log is divided into fixed-size segment (the fixed here is not absolute. In the compaction process of cleaning, the segment size will be changed, ). In order to achieve two-level cleaning, the segment is divided into multiple fixed-size seglets (the default value is 64 KB ). There are two types of entries in the seglet, one is a normal object, and the other is tombstones (used to mark objects updated and deleted as invalid). The specific storage format is as follows, you can refer to the source code (/src/object. h ).
  
In addition, the seglet in segment uses the above MMAP system to call the allocation. First, the server applies for a certain number of seglets based on the parameters entered by the command line user to form the seglet pool. To prevent deadlocks, The seglet pool is divided into three parts: emergency head pool, cleaner pool, and default pool, which are used in different scenarios respectively (For details, refer to the thesis or source code ). Each segment applies for seglet use from the default pool. When the segment is cleaning, the seglet is released to the default pool.
  
5. Memory Management
  
5.1 log metadata
  
In a log-based file system, ramcloud uses a hash table to access data in memory quickly. Log metadata is divided into three types:
  
Object metadata
  
Contains the table ID, key, version, and value of the basic unit object. When data is restored, the hash table is rebuilt based on the entries of the latest version.
  
Log Digest
  
Each new segment will have a log digest, which contains all the logs belonging to this segment. During data recovery, all metadata is loaded based on the latest log digest of the segment.
  
Tombstone
  
This log is special. It represents the deleted object. Once a log is written, it cannot be modified. What should I do if an object is deleted? Add a tombstone record to the log. The record contains the table ID, key, and version. In normal operations, Tombstone is not used. However, when data is restored, it ensures that the deleted object will not be rebuilt.
  
Tombstone's mechanism is simple, but there are also many problems. One problem is its GC. After all, it will eventually be deleted from the log. Tombstone can be deleted only after an object is deleted. Otherwise, it violates the original intention of introducing tombstone: using it to prevent reconstruction of the deleted object. When cleaner processes the tombstone log, it checks whether the segment specified by tombstone is still in any log. If not, the object has been deleted. Otherwise, the segment has valid data and the tombstone cannot be deleted.
  
5.2 two-level cleaning
  
For log-based memory allocation, the main bottleneck lies in log cleaner. Fragment management, or memory recovery, is very expensive, which requires clever design. Especially with the increase in memory utilization, the recovery cost is also growing.
  
For memory, bandwidth is not a problem. We focus on space utilization. After all, DRAM is relatively expensive. For hard disks, the bandwidth is expensive. It can be said that many system bottlenecks are caused by hard disk I/O read/write speed, which also gave birth to many memory www.lafei333.cn-based solutions, of course, it is also a problem that needs to be solved by ramcloud. Therefore, ramcloud adopts two different recovery policies for memory and disk.
  
The two-phase cleaning ensures that the cleaning of memory does not affect backup. In general, the memory utilization is higher than that of disk. The memory utilization can reach 90%, and the disk usage is much lower. Of course, this is understandable because the cleaning frequency of disk is lower.
  
The first level of cleaning, known as the segment compaction, only processes the in-memory segment and does not consume network or disk Io. Each time it compresses a segment, it copies the data above it to a smaller segment, so that the empty space can be used for other purposes. Segment compaction maintains the same logical log in both memory and disk. Because the deleted object and discarded tombstone have been completely removed, it actually occupies less space.
  
The second level of cleaning is called combined cleaning. As the name suggests, this level cleaning not only cleaning disk, but also cleaning memory. Because segment compaction already exists, cleaning at this level is also very efficient. In addition, because this stage has been postponed, more delete operations can be merged.
  
5.3 parallel cleaning
  
Ramcloud also makes full use of the CPU's multi-core: it also executes cleaner in multiple threads. Due to the log structure and the use of simple metadata, parallel execution of cleaner becomes simple. In addition, logs cannot be modified. Cleaner does not have to worry about modifying these objects when copying and moving objects. In addition, hashtable stores the indirect address of the object, so it becomes very easy to update the object reference in hashtable. Therefore, the basic clean mechanism becomes simpler: Cleaner copies live data to the new segment, updates the object reference in hashtable, and finally releases cleaned segment.
  
When the Clean thread and the service thread process read/write requests, there may be three conflicts:
  
A. All data needs to be added to the log head.
  
B. Conflicts may occur when updating hashtable.
  
C. When the service thread is using a segment, cleaner www.yunfanfei.cn cannot release them.
  
Concurrent log updates
  
The simplest cleaning method is to add the data to be moved to the log header. However, this will conflict with the write request of the service. To avoid this, ramcloud moves the data that needs to be moved to the new segment called sidelog. Each cleaner will apply for a group of segments to save the moving data, which must be synchronized only when the segment is applied. After the application is completed, each cleaner can operate its own segment without any additional synchronization operations. After moving, cleaner puts these segments in the next log digest, and the recycled segments are also deleted from log digest. In addition, these segments are synchronized to different backups, which is different from the backup that stores head segment. These segments can be backed up to different disks to increase the master's throughput.
  
Hashtable conflict
  
As hashtable is used by cleaner and service threads at the same time, there may be conflicts. Because hashtable indicates which objects are available and their addresses, cleaner can use hashtable to determine whether an object is alive (by determining whether the address to be pointed is actually the object ). At the same time, the service thread may use hashtable to read or delete an object. Ramcloud uses a smaller range of locks for synchronization on each hash bucket.
  
Time to release memory space
  
After cleaner thread clean completes a segment, the space of the segment can be released or reused. At this time, no service thread will see the data of this segment, because there is no hashtable entry pointing to this segment www.huafanyun.cn. However, the old service thread may still use this segment. Therefore, the free segment may cause serious problems.
  
Ramcloud uses a very simple mechanism to handle this problem: the system will release the segment until all current service thead data requests are processed completely. This design is simple, avoiding the use of locks to complete this normal read/write operation.
  
Release disk space
  
When the segment is cleaned, redundant backup on the backup also needs to be deleted. However, this deletion can only be performed after the migrated segment is correctly written to the log in disk. This requires two steps. The merged segment requires redundant copy in backup. Because all redundant copies are transmitted asynchronously, only cleaner receives all the responses. The new log digest must contain the merged segment and delete the cleaned segment. This redundant backup can be safely deleted only after the data is persisted.
  
5.4 avoiding cleaner deadlocks
  
Because the clean process requires additional memory, when the memory usage is high, the clean process may consume the final memory, run out-of-memory. Therefore, deadlock is formed. To avoid this kind of deadlock, ramcocould first applies for a seglet pool for cleaner (previously mentioned cleaner pool ). This prevents deadlocks caused by seglet applications. In addition, when you clean a segment, it calculates whether this clean will increase the space utilization. 4. The clean method in this example leads to a larger fragmentation, which leads to a waste of space (see the thesis for solutions ). After cleaning, before the segment is released, ramcloud may need to write a new log digest for sidelog and remove the digest of the old segment. At www.sratchina.com, resource depletion may cause a deadlock. To solve this problem, ramcloud sets a dedicated emergency head pool. Based on the above technologies, ramcloud's space utilization can reach 98% without generating deadlock.
  
Write the image description here
  
6. Summary
  
Through the above mechanisms, ramcloud eliminates memory fragments and does not perform global scanning, which achieves garbage collection and makes full use of the memory bandwidth and disk capacity, improves memory utilization

A shameful failure was counted as a "Ten-success" by Qianlong"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.