Shifei: Hello, my name is Shi fly, from Intel company, Next I introduce you to Tachyon. I'd like to know beforehand if you have heard of Tachyon, or have you got some understanding of tachyon? What about Spark?
First of all, I'm from Intel's Big Data team, and our team is focused on software development for big data and the promotion and application of these software in the industry, and my team is primarily responsible for the development and promotion of Spark and its software stack. We were the first team to participate in spark development and promotion in the country, and we joined the spark community in 2012. A significant amount of manpower has been invested in spark and related projects, and over the long term we have over 10 active contributors in spark and related projects, and our team code contribution in the Spark project is the top 3.
Here is an overview of what I am going to talk about today, first of all I will give you a tachyon introduction, including why the Tachyon,tachyon basic architecture, tachyon and existing system integration and tachyon some basic principles, Next I will introduce our experience in using Tachyon and some examples of applications, and finally we will introduce the development of Tachyon and Intel's work on Tachyon.
What is the background of the tachyon appearance? First memory for the king This sentence is very popular for two years, big data processing on the pursuit of speed is endless. The speed of memory and the speed of the disk is not an order of magnitude, on the other hand, the price of memory is getting lower and the memory capacity is getting lower and larger, which makes the data exist in the middle of memory is feasible. With this trend, a large number of memory-based computing frameworks are also available, such as Spark and SAP Hana are excellent memory-based computing frameworks, but existing computing frameworks face some challenges. The emergence of tachyon is to solve these problems, so the current memory-based Big Data computing framework has encountered those problems? Then I'll take spark as an example and analyze it for you.
The first problem is data sharing, where a cluster may run multiple computing frameworks and multiple applications, such as the possibility of running spark on a cluster and running Hadoop, where data sharing between the two is now through HDFs. In other words, if the output of a spark application result is another MapReduce task input, the intermediate result must be written and read HDFs to achieve, we know that HDFs read and write first is a disk IO, in addition to its backup strategy, by default it has three copies of the backup, This in turn introduces the network IO, which is a very inefficient process. The second problem is the problem of cache data loss, such as a computational framework such as spark its memory management module and the calculation executor is inside the same JVM, if its executor has some exceptions cause the execution error, resulting in the JVM exit, Then the data that is cached in the JVM heap space is lost at the same time, which results in the loss of the cached data. The third problem is the GC, because most of the computational framework for big data is now running on the JVM, and the overhead of GC is an unavoidable problem. For a memory-based computing framework like SPARK, the GC problem is particularly prominent, it will cache a large amount of data in the JVM heap space, which is the data to be used in the calculation, the GC can not be removed, every time the full GC will do a global scan of the data, This is time consuming, and as the computational time increases and the heap memory data grows, the cost of the GC becomes larger.
What is the solution? We first analyze what the root cause of this problem is, and the fundamental reason is that the existing memory-based computing framework lacks a memory management module that is disconnected from the JVM. The solution is the memory-based distributed storage System Tachyon that comes with spark. Tachyon's design ideas are mainly two, the first is the memory-based OFFHEAP distributed storage, it is necessary to store the data in the JVM heap space, so that the GC can be avoided. The second is to implement fault tolerance by storing the data at the storage layer, which is the idea introduced in Spark, lineage records the current data of the source data and what is calculated from the source data, and tachyon the information that is lineage in the compute layer to the storage layer. Tachyon just keeps a copy of the data in memory, memory is a valuable resource. and HDFs It saves three copies of the disk by default in order to implement fault tolerance, so if tachyon a node is not this data, it will be read through the network. Because the data is also in the middle of the memory at the remote node, it is also very efficient to read IO with no disk at the remote level, which is only the overhead of the network. When data is lost, Tachyon recovers data based on the lineage of the data, a process that is a bit like data re-counting in spark, but it goes farther than spark. Because spark in the operation of the program runtime, when the program is running to find a node hangs, it will recalculate to the data, the problem is that if the entire job has been finished, the data is lost again there is no way, tachyon can solve the problem. Because Tachyon stores the entire data dependency on the storage layer, including what framework the data is generated by, and when the data is lost, Tachyon restarts the applications and generates new data for data recovery.
This is the goal of the Tachyon design, Tachyon in the entire Big data processing software stack in place, the lowest layer is the storage tier, like HDFs, S3. With Spark on the top, H2o,tachyon equivalent to the cache layer between the storage layer and the compute layer, Tachyon is not replacing any storage system, its role is to speed up the computing layer's access to the storage layer. This is the basic architecture of Tachyon, from which you can see that Tachyon and HDFs are very much alike, with master and worker. Master is used to manage metadata for all data in the middle of an entire cluster, including the size of the data and the location of the data. The worker is used to manage the memory data on each node, all the memory data is stored on the RAMDisk, RAMDisk is to map a memory space into a block device, Tachyon can read and write files at the speed of memory. The worker communicates with master on a regular basis and reports the data on the worker to the Master,master to send commands to the worker based on the information reported by the worker. On the far left side of the graph is the zookeeper, which selects one of the most available master as the master node. There is also a module that is not in this picture, that is the client. It is a programming interface that provides applications that read and write data to Tachyon through the client.
There are two kinds of fault tolerance in Tachyon data, one is fault tolerance of metadata, data fault tolerance on master node, and the other is fault tolerance of memory data, which is the fault tolerance on worker. The fault tolerance of metadata is very similar to HDFs, which is implemented through the log. Image stores metadata and Editlog records recent changes to the metadata. While memory data fault tolerance is unique to Tachyon, for example: Fileset a generates Fileset B,fileset C through a spark job and generates Fileset D through another spark job, while file set C and file set D again through a MapReduce job generated fileset E, such a data generation process will be saved in Tachyon, if Fileset E is lost, and Fileset B and Fileset D are present, Then Tachyon will restart the MapReduce job. Fileset e is regenerated by Fileset B and Fileset D, and if neither Fileset B nor fileset e exists, then Tachyon will re-create the spark job Filesetb and Fileset D, and finally the MapReduce job is generated by FILESETB and Fileset D fileset E.
Now look back at the three questions that I've talked about in the existing memory-based computing framework, how did this problem get resolved after Tachyon? Data sharing issues, spark and Hadoop can be tachyon to store intermediate result data, and if MapReduce needs the output of spark, it can be read directly by Tachyon, without the need to access HDFs. The problem of cache data loss, spark can cache the RDD in Tachyon, so that the cached data is not lost when Spark's application is crash. The third is the cost of the GC, which is obvious, because the GC does not manage this part of the data in Tachyon.
below to introduce you how tachyon and the existing large data processing framework integration, first of all, Mapreduce,mapreduce is not and tachyon do any integration, if you want to use Tachyon in MapReduce, It is necessary to refer to Tachyon as a foreign package or as a library. The first approach is to put the Tachyon jar package in Hadoop's class path, the second is in the Lib directory of Hadoop, and the third is distributed as part of the application. There is also a need to configure Hadoop to configure the Tachyon file system so that MapReduce can load and write data directly through Tachyon, using the same method as HDFs. Spark has integrated tachyon, and if you use Tachyon in the middle of spark, you just need to do some simple configuration of spark, and configure the URI of Tachyon master in sparkconf. In this way, spark can cache all of the RDD data in the Tachyon, and by setting the Rdd Storagelevel to Off_heap,spark it will automatically place the Rdd inside the tachyon. If Spark is going to load and write data through Tachyon, it needs to configure the Tachyon filesystem like in MapReduce, so that spark can read and write data from the tachyon like read/write HDFs.
Let me tell you about Tachyon Basic working principle, first of all its communication mechanism, Tachyon use thrift for communication, can be configured by the interface between the master client and the worker automatically generated. There is herabeat communication, in order to maintain a connection between the Tachyon components, master and worker will also exchange information through Heartbeeat, the worker will be on their own node of the recent increase in data or changes to the data submitted to master, Master will modify the metadata on master based on the file information provided by the worker, and master will also return some information to the worker. If the metadata for the file information provided to master by the worker is not already in master, Master will tell the worker to delete the file. If the worker does not communicate with Master for a period of time, Master will assume that the worker has been disconnected from him, and when the worker communicates with master, Master will tell the worker to re-register. Re-sends all file information on the node to master. The worker also has a self-test, and if it detects and the master communication times out, it will re-register with master. There is a connection between the client and the master worker, and the heartbeat,master that the client sends to master is temporarily not processed. The communication between the client and the worker is to maintain the relationship between the client and the worker, and if the worker checks that the client connection times out, the worker frees the resources assigned to the client.
Tachyon the organization of the file, the first introduction to the organization of workers, the worker has two file systems, one is RAMDisk, that is, the memory file system, the other is the underlying file system, the most commonly used is HDFS. In the memory file system, the files are stored in block mode. stored as an entire file on the underlying file system. On the memory file system, the file name is Blockid, and the file name is Fileid in the underlying filesystem. Tachyon metadata organization, you can see that this is a tree-like structure, each node is a Inode,inode record the information of the file, all the files are from the root node, according to the name of the path can be found step-by-step. If the inode represents a directory, it records all subdirectories and files in its own directory, if the inode represents a file, it records all blocks of the file, and the file is not a backup on the underlying file system, and the path of the backed up file.
When the application reads and writes data through the Tachyon client, the client sends a request to master, fetches the block's information from master, including the Block's ID and location information, and the client first requests the worker when the message is received. Lock The block, indicating that the block is being accessed, the client will read the file after the lock is read, then ask the worker to unlock the file, and finally ask the worker to update the block's access time, because when writing the data, If the space is low, the worker makes an LRU-based file delete operation based on the access time. If the file is not on a local worker, the client will go to the remote worker to read it, and the remote worker will send the data back to the client over the network after receiving the request.
Tachyon read the data when there are two ways to read, the first is the way the cache, that is, if the local data so directly read, if there is no local data from the remote reading, after reading will be locally created a cached copy. The purpose of this strategy is to say that the user believes that the data will be used again and again in the next, rather than from the remote to read the data repeatedly to create a copy directly locally to save overhead, if it is read from the underlying file system? The cache policy also creates a copy of the local memory, and the No-cache policy is read-only once, and the user believes that the file will not be accessed again next.
When writing a file, the client to the worker to apply for memory space, the worker first to determine their own memory space is not enough, if not enough it will be based on a specific algorithm, in the current LRU algorithm, will not be accessed recently the block file directly deleted, free space. Once the memory has been allocated, the client will be told to assign a successful assignment and the client will write the data he wants into the local ramdisk. After the client writes, it notifies the worker to cache the file, which is the process of moving the data from the user directory to the data directory, and the worker sends the master a new block file after the cache is finished. In the writing aspect also has the various tactics, first has one must-cache,client request must write the file in the memory middle, if the memory is not able to write, Tachyon will error. And Try-cache means to write the data into memory as much as possible. The through is to write the file directly into the underlying file system, which will not write memory. and Cach and-through will save two copies, and Async-through is to write the file into the local memory directly to save, Tachyon from the memory back to the disk file. The first 2 strategies are optimized for reading, and if the file is just a temporary file that does not need to be permanently stored, and it may be read or read repeatedly after writing, it is placed in the middle of memory, and because it is a temporary file does not need to do permanent storage in the underlying file. And through is only in the underlying file system to write this file, if the file is an application output, and after the write is not accessed within a short time, then the file is written directly to the Tachyon management of the underlying file system space. If the result is to be used in the future, it can be quickly accessed based on the user's need to put the file into Tachyon memory. While Cache-through two of the above, and asynchronous through is not guaranteed to be stored to the underlying filesystem, it is to improve the response time, reduce latency, but it and cache-through the final effect is the same.
Tachyon user interface, Tachyon provides two user interfaces, the first of which is the command line. This is similar to the command-line interface of HDFs, which provides some basic file system operation commands, such as cat, LS, mkdir, RM, etc., so as to make it easier for users to do some basic operations on the files in the middle of tachyon memory. The second interface is the programming interface, in Tachyon there are two main interfaces for user programs, one is Tachyonfs, which is the most basic programming interface provided by Tachyon, which covers all the functions Tachyon provides to the user program, including such as Delete, mkdir, rename and so on, through the basic functions of programming, you can achieve the operation of the file system. The other is Tachyonfile, which provides some more upper-level interfaces. For example, get a file of instream or OutStream.
Our experience with Tachyon is the first prototype system that our team has developed for log data processing. This project can be found on GitHub and is named Thunderain. Streaming data is first placed in the Kafka, it is a message queue, Kafka inside the data after sparkstraming processing will be written into Tachyon in-memory tables inside, because this data is stored in memory In table, its access data is very fast, so it can support running some online analysis or interactive query in the background, which is more sensitive to response time delay. At the bottom of the processing process, the data in Kafka can also be stored in HDFS through ETL processing, which can be used as historical data, historical data and tachyon memory data can be combined query. For example, a video site, users of the video click on the log can be streamed through the above, the background of the application staff can be very quick to query, in the recent time what video is played the most times, which video is the most popular, and the role of historical data is, Online data can be compared to historical data, meaning that the last day or one hours, what video is the hottest.
Another Tachyon application example is OFFHEAP storage, which we have done a domestic video site case, the purpose of this case is to do video content recommendation, which is a graph algorithm N-degree cascade problem. The N-degree cascade means to calculate the number of hops between 2 nodes in a graph, and the degree of correlation between the n-th treaties, which means that there is an n-distance correlation, and its algorithm is here. Its algorithm is probably the meaning, assuming that the two nodes between the M-bar length is the path of N, then WEIGHTK (X, Y) represents the weight of the K-path? The N-degree cascade requires that all M-path weights be combined, and each path is weighted equal to the weight level of all edges on the entire path, because each edge is weighted by a number between 0 and 1, and the weight product must also be 0 to 1 floating-point numbers, and their correlation decreases as n increases. In reality this is a social network to calculate the relevance of different users, e-commerce website product recommendations and so on. In solving this problem, we use two kinds of diagram framework to achieve, one is bagel one is graphx. First of all, I'll talk about the implementation of bagel, when Spark implements Bagel, the new data for each node that is generated in each superstep and the messages that the node sends to the next Superstep node are placed in the Spark's RDD. We found that as the number of iterations increased, the cost of the GC was quite large, because the amount of data was large, and each iteration would cache the data. Our solution is to use Tachyon to cache this part of the data, equivalent to let Bagel run on the Tachyon, the benefit is to solve a large number of GC problems. The second implementation is GRAPHX, its calculation process is such, first, the collection phase, it collects each node and edge of the data, and then generate messages sent to each node, in the calculation phase each node will generate new node data based on the received message, In this step it is equivalent to a new figure from an old exploded. At this stage, spark actually caches this part of the data, and we use it to cache the data on the tachyon and then mitigate the GC problem.
The third user's case is remote data access, where the user has multiple clusters, one of which is dedicated to the storage service, and several compute clusters where it is possible to run spark or mapreduce on those clusters. They have the demand that, on a computing cluster, the application needs to access the data on the cluster that provides the storage service multiple times, which is a huge overhead, which is equivalent to reading data between two clusters. The solution we offer is to use Tachyon, where the application only needs to read data from a cluster that provides storage services remotely, caches the data on local tachyon, and then repeatedly accesses and computes the data on Tachyon, which can save a lot of data read overhead.
Tachyon applicable scenarios, the first scenario calculates intermediate results that need to be shared between different applications and computing frameworks, meaning that intermediate results may be used by different background applications. The second scenario is the need for a quick response to be sensitive to latency, such as when users in the background may be doing some online queries, or some interactive queries, using Tachyon can actually improve response and reduce latency. The third is a large amount of memory data, and has a long time and iterative computing needs, we have done before a user case inside the use of Tachyon, in the performance has increased by more than 30%. The fourth scenario is the need to access a large number of remote data, tachyon the role of the remote data can be placed on the local multiple access, so as to reduce the cost of remote access. But the Tachyon is also limited, the first limitation is that the CPU load will increase, because the data in the tachyon is stored in a file, then there will be serialization and deserialization overhead, which is a very CPU-intensive work. The second limitation is that Tachyon temporarily can only use memory to make storage space, this limitation in the next version will not exist, because we are now using other high-speed memory like SSD to expand the tachyon storage space.
Tachyon Current development situation, Tachyon is a very new project, is the summer of 2012 began, there are 5 initiators, the leftmost is Li Haoyuan, he is the author of Tachyon, is a Chinese, now UC Berkeley read PhD. Tachyon's current release is 0.5.0, the main branch version is 0.6-snapshot,tachyon in the domestic still few companies use, foreign already have more than 50 companies in the attempt to use. Existing spark and MapReduce programs can run directly on Tachyon without modification, and Spark will tachyon as the default Offheap storage system, and these are some of the organizations that have contributed to Tachyon, At home, there are two universities doing tachyon work, Nanjing University and Tsinghua Univ..
Finally speaking about Intel's contribution, we have 3 contributors, a total of more than 100 submissions, these submissions inside, we did important functional components, also did the work of improving usability and usability, and some bug fixes, and then I will introduce you to a functional component: multilayer level of local storage. It solves the problem that I just mentioned, Tachyon now can only use memory as its storage space. We use SSD, HDD to expand the storage space of Tachyon, the whole storage structure is a pyramid structure, the top of the space is relatively small but fast, the middle and lower layer is slower but the space is very large, we design some strategies, the hottest data on the top level, to achieve faster access speed , the data that is not too hot is placed on an SSD or HDD and can be re-placed back into memory when the data becomes hot again in the future. The benefit is that when the memory is not enough, you can put the data on SSD or HDD, instead of losing the data, and the performance is not too much impact.
Finally I hope that we can try to use Tachyon, because from the situation I understand, the domestic companies with Tachyon very few, also hope you can join Tachyon community, Tachyon Project released on GitHub, When we use it, we find that there are any problems that can give us some advice.
"Spark/tachyon: Memory-based distributed storage System"-Shifei (engineer, Big Data Software Division, Intel Asia Pacific Research and Development Co., Ltd.)