A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Pangu is a distributed file system, in the entire Alibaba cloud computing platform-"flying", it is the earliest developed services, so with the ancient Chinese mythology epoch-making Pangu named for it, hoping to create a new "cloud world." In the "flying" platform, it is the cornerstone of the data storage system, which hosts a series of cloud services (shown in Figure 1). Pangu's design goal is to aggregate the storage resources of a large number of general-purpose machines to provide users with large, high availability, high throughput and good scalability storage services. Pangu's upper service, which requires high throughput, expecting I/O ability to grow linearly with cluster size, and "elastic computation" requiring low delay, and Pangu, as the core module of the underlying platform, must be balanced, with high throughput and low delay.
In the internal architecture of Pangu using MASTER/CHUNKSER ver structure (as shown in Figure 2), master management metadata, multiple master Primar y-secondaries mode, based on the Paxos protocol to ensure high availability of services; Chunkserver is responsible for the actual data reading and writing, providing data security through redundant copies; Client provides a POSIX-like proprietary API that systematically provides rich file formats to meet the high throughput requirements of offline scenarios, low latency requirements in online scenarios, and random access requirements for special scenarios such as virtual machines.
Since the 5K project, the cluster scale has expanded rapidly to 5,000 nodes, and the related problems caused by the scale are on the rise. The first is the issue of Pangu Master ioPS, because larger clusters mean more files and more access, and it is clear that the upper application has a significant difference between storing billions of files and the iops of level 1 billion file clusters. At the same time, larger clusters allow fast-growing top-level applications to see more possibilities, leading to more business clouds, more data storage and, indirectly, higher demand for IOPS. Previously, Pangu Master Lower IOPS has limited the rapid expansion of the upper business, there is alarm at peak business, urgent to upgrade. Another size related problem is Pangu Master cold start speed, more files and chunk number resulting in longer cold start time, affecting cluster availability.
To solve the above problems, performance optimization is imperative. But for Pangu such a complex large-scale system, it is impossible to Biching, need to solve different performance bottlenecks at different stages. Optimization is usually accompanied by the whole lifecycle, so the optimization work itself needs to accumulate experience, tools and so on to provide convenience for continuous optimization. Therefore, in the optimization of the scale problem, we have actively built our own lock profile tool, which solves the performance problem caused by multiple locks. After solving the main lock bottleneck, we have optimized the architecture, including pipeline optimization and group commit optimization, and achieved good results. Finally, through the continuous deepening of the details, repeated attempts to better solve the problem caused by the size of the cold start time too long.
In the complex large-scale system such as Pangu, lock is the problem that the optimization process often encounters. The first task of optimization is to identify specific bottlenecks, avoid guessing and blindly modify the code. First with the pressure test tool on Pangu master pressure, through the top, Vmstat and other system tools to find CPU load less than half of the physical core, the context switch up to hundreds of thousands of, the initial suspicion is the lock competition caused. Colleagues also feel that some of the path of the lock and optimize the space, but in the end what is the lock competition? Which locks hold for a long time? Which locks wait for a long time? What are the specific operations that require these locks? Due to the lack of reliable data support, can not blindly start, and adhere to the data to speak is an engineer due quality.
Pangu Master provides a large number of reading and writing interface, the internal different modules used a large number of locks, we need to know exactly which type of operation caused which lock competition is serious, some of the previously used profile tools difficult to meet the needs, so we need to build their own "scalpel"-Lock analysis tools, facilitate follow-up work. First, in order to distinguish between different locks, we need to name all the locks in the code, and name records to the lock implementation, and to differentiate between operations, we need to have a unique type number for all operations. When a worker thread reads a specific request from RPC, it writes the type number to the worker thread's private data, and then, during the processing of the RPC request, the lock-taking operations of the different locks are grouped into the operation type.
Within each lock, maintain a vector data, Lockperfrecord record the lock profile information, specifically defined as follows:
Each operation corresponds to an element in the vector, and the operation number recorded in the thread's private data is the vector subscript. As a whole, a sparse two-dimensional array is formed in table 1, and an empty record indicates that the operation does not use a corresponding lock.
In the concrete realization, the acquire counts adopts the atomic variable and the time measurement adopts RDTSC. When it comes to RDTSC, a lot of people change color, it is considered that there are some problems, such as CPU frequency conversion and the inconsistency between multiple CPUs, but in the background of long time granularity (minutes) and large number of calls (billion times), these possible effects will not interfere with the final result, and many experimental results confirm this. Other implementation details, such as timed rationing (every n minutes or every x operation), initiate perf dump that does not affect the main process. The entire code is around hundred lines, and the intrusion on the platform is small, just add an initialization name for each lock, and set the operation number at the entry point of the RPC handler function.
Sharp weapon in the hand, netizens, to the existence of performance problems of the lock for various optimizations. For example, multiple locks are used to replace the original globally unique lock, reduce the probability of collision, reduce the lock range, use a lock-free data structure, or use a more lightweight lock to optimize. The whole optimization process is very interesting, often to solve a lock bottleneck, found that the bottleneck and transferred to another lock, in the tool profile results have a very intuitive display. has optimized the client session, placement and other modules in the use of locks, achieved significant results, the CPU can basically run full, contextswitch greatly reduced, the whole process pitched battle dripping.
After the lock competition problem has been solved to a certain extent, it is very difficult to get more benefit from the IOPS when the lock is optimized. Combining the business logic at this point, we find that there are some architectural changes that can be made to improve the overall iops.
Read and write separation (speed and separation)
Pan Pangu Master has many external interfaces, according to whether it is necessary to sync operation log between primary and secondaries to read and write two broad categories. All read operations do not need to sync operation Log, and all write operations must be synchronized based on data consistency. Given that primary and secondaries can be switched at any time, to ensure that the data is consistent, each write request sent by the client side, primary must be fully successful in synchronizing operation Log before returning to the client. It is obvious that writing is much slower than reading, what will happen if the same line pool at the same time to read and write? An image metaphor is a multi-channel highway, imagine if a highway is not according to speed to divide the multiple lanes, but to run away, 20 yards and 120 yards of the car run in the same lane, the overall throughput will not be high in any way. Referring to the design of the expressway, the slow and fast separation, the time-consuming reading operations occupy a thread pool, time-consuming write operation using another thread pool, the two sides do not interfere with each other, such a simple segmentation, read IOPS has significantly improved, and write operations are not affected.
After reading and writing separation, the read IOPS performance is greatly improved, but the write operation still needs to be improved. The basic flow of the write operation is shown in Figure 3.
At the master end, the maximum time cost of the write operation is synchronized operation log to secondaries. The key flaw here is that during synchronous operation log, the worker thread can only passively wait for a considerable amount of time, which involves synchronizing the data to the secondaries and writing the data to disk synchronously by secondaries, as it involves synchronizing the write physical disk ( Rather than writing the page Cache, this time is a millisecond magnitude. So the iops of write operations is certainly not high. After locating the problem, tweak the structure slightly as shown in Figure 4.
Similar to the interrupt handler, the entire RPC processing process is divided into the upper half and the lower half, the top half by the worker thread processing request, the operation log submitted to the Oplog Server, no longer blocking waiting for synchronization oplog success, Instead, the next request is processed. The work in the lower half includes filling response, writing response to RPC Server, and the lower half being assumed by another thread pool, triggered by a successful message Oplog synchronization. This way the worker thread continues to process the new request, with the write IOPS significantly elevated. Data consistency is the same as previous implementations, because only synchronous Oplog will return response when it is completely successful.
After pipeline optimization, the performance of the write operation significantly increased, but we are still not satisfied with the results, want to further enhance IOPS, to provide a better result for the top customers. Continue profile, found that the new implementation of synchronization Oplog throughput is lower, mainly because each write request leads to a primary standby synchronization oplog, and this oplog on primary and secondaries need to write to disk synchronously. High pressure, there will be a large number of synchronous RPC and small pieces of data synchronous write disk, resulting in low throughput. Typically, a distributed system uses group commit technology to optimize the problem by forming a group of Oplog, the entire group commits once, and throughput can be significantly improved. But the traditional group commit brings a significant increase in latency and requires trade-offs between throughput and latency. How do you get the fish and bear's paw? We have optimized the group commit process and solved the problem better.
Separates the group and synchronization group, which is assumed by a serialize thread, which is completed by the sync thread. Serialize threads as producers, sync threads as consumers, both share data through a queue, and a new group is paused when the serialize thread discovers that the queue is more than M group waiting to be synchronized; When the sync thread discovers that the waiting group in the queue is less than N, it wakes up the new group of the Serialize thread group, and as the Serialize thread accumulates a batch of data after a period of waiting, it can form a larger group, At the same time does not cause latency ascension. When the whole system load is low, the queue is empty, serialize need not wait, latency is very low, when the system load is high, there is accumulation in the queue, pause serialize at this time, will not add extra latency, and then can form larger group, Get high throughput benefits. The optimized m and n values can be determined by stress testing of serialize and sync operations.
Focus on detail and do deep penetration is especially important for a large and complex system. In the whole optimization process, we encountered a lot of interesting details, after in-depth excavation, achieved good results. One of the representative examples is the deep optimization of sniff. In the 5K project, Pangu Master Cold start time in hours, 5K later due to scale expansion, chunk number increase, cold start time will be longer. Cold start, you need to do sniff operations, access to all chunkserver on the chunk information (chunk number of 1 billion orders of magnitude), this information into a few map structures, in a multithreaded environment, the map structure inserts and updates must be locked protection. The lock tool profile also confirms that the bottleneck is on the lock-protected map. So shortening the startup time translates into a very specific detail about how to optimize the read-write performance of the lock-protected map. To reduce the lock competition, the adjustment structure is shown in Figure 5.
The introduction of a lock-free queue, all worker threads update data to a lock-free queue, followed by a single thread from the lock-free queue to the bulk update to the map, adjusted sniff period from 400,000 to about 40,000, reduced time to half an hour, the effect is significant. But we still feel too long, continue to dig, found at this time the CAS operation is unusually frequent, and a single update thread has been filled with a core, continue to optimize the feeling is not able to, at this time to study the business characteristics, to see if the business characteristics to reduce some constraints. In the end, we found some very interesting details, that is, in the sniff phase, we basically do not read these maps, only after the sniff is completed, the first write request processing needs to read map, which means that as long as the sniff completed after the map results are guaranteed to be correct, In the sniff process, the map update is not timely caused by the inaccurate is acceptable. Based on this detail, we have each worker thread cache a map of the same type in TSD (thread private data) to form the structure in Figure 6.
Each worker thread updates the sniff data directly into the thread-private map. This process does not require lock protection, when the data gathered in the TSD more than a certain size, or data precipitation for a certain period of time before committing to the lock-free queue, so that each data into a single lock queue into a batch of data into the queue, Increased efficiency. Of course, there is also a new problem, the map into the lock-free queue is driven by the front-end sniff data, when the sniff completed, the front-end no data-driven, TSD may be stranded in a number of "sunk" data can not be submitted to the queue, affecting the accuracy of the final results, Here we introduce a timeout thread to solve this problem. After the final optimization is completed, the sniff is completed in a few minutes.
Summary and Prospect
After this round of optimization, Master iops a number of times to improve the cold start time significantly reduced, and achieved good results. In the process of optimization, we have formed some common understandings, hoping to be guided by the following work.
Persist with data to identify bottlenecks. A point is not a bottleneck, whether the need to optimize, which is a fundamental directional problem. Can be bold conjecture, but must use data to identify specific bottlenecks, otherwise the direction of the problem is wrong, will lead to subsequent waste of resources on the non-critical path.
Closely combined with business logic. Many times the optimization of the system is not a simple computer science domain problem, but is highly related to the business logic, combined with the characteristics of business logic optimization, often will be twice the effort.
Pursuit of the Ultimate. After meeting your expectations, ask yourself if you have reached the theoretical limit. Is there any potential for digging? Sometimes, the performance may have a qualitative leap.
Performance optimization is not only a work, but also a kind of perseverance, excellence attitude, optimization has no end, we have been on the road!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service