Tachyon is a fast-growing new project within the spark ecosystem. In essence, Tachyon is a distributed memory file system that reduces spark memory pressure while giving spark memory the ability to read and write large amounts of data quickly. Tachyon separates the functions of memory storage from spark so that spark can focus more on the computation itself, in order to achieve higher execution efficiency through a finer division of labor. This article will first introduce to the reader the use of tachyon in the spark ecosystem, and will also share the performance improvements that Baidu has made using Tachyon on the big data platform, as well as some of the problems and solutions that are encountered in the actual use of the tachyon process. Finally, we will introduce some new features of Tachyon.
Tachyon Introduction
The spark platform, which achieves higher computational performance in distributed memory computing, has recently attracted wide attention from the industry, and its open source community is also very active. Take Baidu as an example, in Baidu internal computing platform has been set up and run thousands of scale of the spark computing cluster, Baidu also through its BMR open cloud platform to provide spark computing platform services. However, the pattern of distributed memory computing is a double-edged sword, which has to face the problems caused by distributed data storage while improving performance, and the main problems are as follows:
When two spark jobs need to share data, the write disk operation must be done. For example: Job 1 First write the generated data to HDFs, then job 2 then read the data from HDFs. Here, the disk reads and writes can cause performance bottlenecks.
Since Spark uses its own JVM to cache data, when the Spark program crashes, the JVM process exits and the cached data is lost, so the data needs to be read again from HDFs when the work is restarted.
When two spark jobs need to operate the same data, the JVM of each job needs to cache a copy of the data, not only to waste resources, but also easily cause frequent garbage collection, resulting in degraded performance.
After careful analysis of these issues, you can confirm that the root cause of the problem is data storage, because the computing platform tries to manage its own storage so that spark cannot focus on the computation itself, resulting in a decrease in overall execution efficiency. Tachyon's approach is to solve these problems: In essence, Tachyon is a distributed memory file system that reduces spark memory pressure while giving spark memory the ability to read and write large amounts of data quickly. Tachyon separates the functions of storage and data read and write from Spark, making spark more focused on computing itself in order to achieve higher execution efficiency through a finer division of labor.
Figure 1:tachyon Deployment
Figure 1 shows the deployment structure of the tachyon. Tachyon is deployed under the computing Platform (SPARK,MR) and on the storage platform (HDFS, S3), and by globally isolating the computing platform and storage platform, Tachyon can effectively address several of the issues listed above:
When two spark jobs need to share data, they do not need to write to disk, but instead use Tachyon for memory reading and writing, which increases computational efficiency.
After the data is cached using Tachyon, the cached data is not lost even after the spark program crashes after the JVM process exits. This allows the data to be read directly from the Tachyon memory when the spark works to restart.
When two spark jobs need to manipulate the same data, they can be obtained directly from Tachyon and do not need to cache one copy of the data, thereby reducing the memory pressure on the JVM and reducing the frequency of garbage collection occurring.
Tachyon System Architecture
In the previous chapter we introduced the design of Tachyon, this chapter we will briefly look at the Tachyon system architecture and implementation. Figure 2 shows the deployment of the Tachyon on the spark platform: Overall, Tachyon has three main components: Master, Client, and worker. On each Spark Worker node, a tachyon Worker,spark worker is deployed to read and write data through Tachyon Client Access Tachyon. All Tachyon workers are managed by Tachyon Master, Tachyon Master determines whether the worker has crashed and the amount of memory left by each worker by Tachyon the worker's timed heartbeat.
Figure 2:tachyon Deployment on the spark platform
Figure 3 shows the structure of the Tachyon master with the following main functions: First, Tachyon Master is a manager, processing requests from various clients, and this series of work is done by the service handler. These requests include obtaining information about the worker, reading the file's block information, creating a file, and so on; second, Tachyon Master is a name Node that holds all the files, and the information for each file is encapsulated into an inode. Each Inode records all block information belonging to this file. In Tachyon, the block is the smallest unit of file system storage, assuming that each block is 256MB, and if one file is 1GB, the file will be cut to 4 blocks. Each block may have multiple replicas stored in multiple tachyon workers, so master must also record each block's stored worker address, and thirdly, Tachyon Master manages all the workers at the same time. The worker periodically sends a heartbeat notification to master to the active state and the remaining storage space. Master worker Info records the last heartbeat time of each worker, the amount of memory used, and the total storage space.
Master Design of Figure 3:tachyon
Figure 4 shows the structure of the Tachyon worker, which is primarily responsible for storage management: First, the service handler of the Tachyon worker processes requests from the client, including: reading a block's information, caching a block, Lock a block, request space to the local memory store, and so on. Second, the main component of the Tachyon worker is the worker Storage, which is to manage local Data (the native memory file system) and the under File System (Tachyon The following disk filesystem, such as HDFs). Thirdly, the Tachyon worker also has a data server to process other client-initiated read and write requests to it. When the request is reached, Tachyon will first find the data in the local memory store, and if not found it will attempt to find the other Tachyon worker's memory store. If the data is not in Tachyon at all, it needs to be read through the interface of the under File system to the disk filesystem (HDFS).
Worker design for Figure 4:tachyon
Figure 5 shows the structure of the Tachyon client, whose main function is to abstract a filesystem interface to the user to block out the underlying implementation details. First, the Tachyon client interacts with Tachyon master through the master client part, such as where a block of a file can be queried to Tachyon master. The Tachyon client also interacts with the Tachyon worker through the worker client component, such as requesting storage space to a tachyon worker. The most important part of the Tachyon client implementation is the Tachyon file. The block out Stream is implemented under Tachyon file, which is mainly used for writing local memory files, and the block in Stream is responsible for reading the memory files. There are two different implementations in the block in stream: the local block in stream is primarily used to read native memory files, while the remote block in stream is primarily read by non-local memory files. Note that the non-local can be in the other Tachyon worker's memory file, or in under File system files.
The client design of Figure 5:tachyon
Now we're going to string together all the parts in a simple scenario: Assuming a spark job initiates a read request, it first goes through the Tachyon client to tachyon the location of the block where the master query is needed. If the block is not in the local Tachyon worker, the client sends a read request to the other Tachyon worker through the remote block in stream, while in the block read-in process, The client also writes blocks to the local memory store through the block out stream, so that the next request can be made locally.
Tachyon in Baidu's internal use
Inside Baidu, we use spark SQL for big Data analysis, and since Spark is a memory-based computing platform, we expect most of the data queries to be completed in seconds or more than 10 seconds to achieve interactive queries. However, in the operation of the Spark computing platform, we found that the query took hundreds of seconds to complete, for 6 of the reasons: our computing resources (data Center 1) and the DW (Datacenter 2) may not be in the same data center, in which case Every time we query the data, we may need to read from the Remote Data center, due to the network bandwidth between the data center and the problem of latency, it takes a long time (>100 seconds) to complete each query. Worse still, many queries are very repetitive, and the same data is likely to be queried multiple times, and if it is read from the Remote data center every time, it will inevitably result in wasted resources.
To solve this problem, we use Tachyon to cache data locally and try to avoid data transfer across data centers. When Tachyon is deployed to the data center where Spark resides, we pull data from the remote Data Warehouse every time the data is cold-queried, but when the data is queried again, spark reads the data from the Tachyon in the same datacenter, improving query performance. The experiment shows that if the data is read from non-native Tachyon, the time is reduced to 10-15 seconds, and the performance is 10 times times higher than the original, the best case, if the Tachyon read data from this machine, the query takes only 5 seconds, the performance is 30 times times higher than the original, the effect is quite obvious.
After using this optimization, the thermal query performance reached the requirement of interactive query, but the user experience of cold query is still very poor. After analyzing the user behavior, we find that the pattern of user query is fixed: for example, many users run the same query every day, but the date of filtering data will change. With this feature, we can be based on the user's needs for offline pre-query, advance the required data into the tachyon, so as to avoid user cold query.
Figure 6:tachyon Deployment of Big data platform in Baidu
In the process of using Tachyon, we also encountered some problems: at the beginning of the deployment of Tachyon, we found that the data can not be cached at all, the first time with the subsequent query time is the same. 7 Source code: The block will only be cached if the entire data block is read, otherwise the cached operation will be canceled. For example, if a block is 256MB, if you read 255MB, the block will not be cached because it only needs to read some of the data in the block. Inside Baidu, many of our data is stored in the determinant, such as orc and parquet file, each query will only read some of the columns, so do not read the full block, so that the block cache failed. To solve this problem, we have modified the tachyon, if the data block is not too large, cold query even if the user requested only a few columns, we will also read the entire block, to ensure that the entire block can be cached, Then the query can be read directly from the Tachyon. With the modified version, Tachyon achieves what we expect, and most queries can be completed in 10 seconds.
Figure 7:tachyon Caching Data logic
Some new features of Tachyon
We use Tachyon as a cache, but each machine has limited memory and memory runs out quickly. If we have 50 machines, each allocating 20GB of memory to Tachyon, then a total of only 1TB cache space, far from meeting our needs. The latest version of Tachyon has a new feature: Hierarchical Storage, even with different storage mediums for hierarchical caching of data. 8, it is class to the CPU's cache design: the memory reads and writes the fastest, so it can be used for the No. 0 level cache, then the SSD can be used for the 1th level cache, and the last local disk can be the underlying cache. This design can provide us with more cache space, the same 50 machines, now each of us can contribute 20TB of cache space, the total cache space reached 1PB, basically can meet our storage needs. Similar to the CPU cache, if Tachyon's block replacement policy is designed properly, 99% of requests can be met by a level No. 0 cache (memory), allowing for a second level of response in most of the time.
Figure 8:tachyon Hierarchical Storage
When Tachyon receives a read request, it first checks whether the data is on the No. 0 level, and if it hits, returns the data directly, otherwise it queries the next level of cache until the requested data is found. When the data is found, it is returned directly to the user, and it is promote to the No. 0 cache, and the No. 0 layer of the replaced data block is displaced by the LRU algorithm to the next level of caching. As a result, if the user requests the same data again, it will be quickly obtained from layer No. 0, thus fully locality the cache's characteristics.
When Tachyon receives a write request, it first checks if there is enough space on layer No. 0, and if so, returns after writing the data directly. Otherwise, it will query the next level of caching until it finds enough space for a layer of cache, then pushes a block from the previous layer to the next layer with the LRU algorithm, and so on, until the No. 0 layer has enough space to write the new data, and then return. The purpose of this is to ensure that the data is written to the No. 0 level, and the data can be read quickly if the read request occurs immediately after the write request. However, the performance of the writing may be poor: for example, the first two levels of cache is full, it needs to put a block from the 1th layer to the 2nd layer, and then a block from the No. 0 layer to the 1th layer, and then write the data to the No. 0 layer, and then back to the user.
For this we have made an optimization, and with the layer analogy space, our algorithm directly writes the data to a cache layer with sufficient space, and then quickly return to the user. If the cache is full, the underlying block is replaced, and then the data is written back to the underlying cache. After experiments, we find that the optimized method will reduce the write delay by about 50%, greatly improving the efficiency of writing. But the efficiency of reading, because in the Tachyon, write is through the memory-mapped file, so it is written in memory, and then flush to disk, if read is immediately after writing, in fact, from the operating system buffer, that is, memory read data, Therefore, read performance does not degrade.
Hierarchical storage a good solution to the problem that we don't have enough cache, and we'll continue to optimize it for the next step. For example, now it only LRU a permutation algorithm, and does not meet all the application scenarios, we will design a more efficient displacement algorithm for different scenarios, to maximize the cache hit ratio.
Conclusion
I personally believe that a finer division of labor will achieve higher efficiency, and spark, as a memory computing platform, can cause frequent garbage collection, system instability, or performance impact if excessive resources are used to cache data. At the beginning of our use of spark, system instability is the biggest challenge we face, and frequent garbage collection is the biggest cause of system instability. For example, when a garbage collection takes too long, the Spark worker becomes less responsive and can easily be mistaken for a crash, causing the task to be re-executed. Tachyon solves this problem by separating the memory storage functionality from spark and allowing spark to focus on the computation itself. As memory becomes cheaper, we can expect that the memory available on our servers will continue to grow over time, and Tachyon will play an increasingly important role in the big Data platform. It is still the beginning of tachyon development, in the completion of this article Tachyon only ready to release version 0.6, there are many features need to be perfected, this is also a good opportunity, interested students can pay more attention to Tachyon, to the community for technical discussion and function development.
Liu Shaoshan
Senior architect of the Silicon Valley Research and Development Center in the US, focusing on distributed systems and big data computing and storage platforms.
Distributed memory file system in Tachyon:spark ecosystem