UCBerkeley developed Tachyon ( hyper-photon [' T?ki??? N], the name should not be so arrogant ah : is a variety of cluster concurrency computing framework to provide memory data management platform, can also be said to be a memory-based file system bar. For example, it is at a level where existing storage systems, such as HDFS , are under various computational frameworks such as Spark,MapReduce,Impala , and so on.
Why should there be such a framework? MapReduce does not say, but like Spark , the memory-computing framework, why do you need to add a layer of memory-managed file systems? Because, like Spark , the framework actually provides only powerful memory computing power, but does not provide storage capacity. So is it not enough for Spark to manage data directly in memory by default? Let's take a look at some of the existing problems.
problem1: Slow exchange of data between different tasks or frames
Data sharing between different tasks or different computing frameworks is unavoidable, such as Spark 's two tasks belonging to different stages, or spark interacting with the MapReduce Framework's data. In this case, it is generally necessary to complete the data exchange via disk, which is usually inefficient.
When the Tachyon layer is introduced, the data exchange is actually in memory.
problem2: The execution engine and the storage engine are the same process
This is the problem that has been mentioned earlier, allowing Spark to manage memory on its own. By default, the task execution ofSpark and the data itself are within one process. When a problem occurs, it causes the entire process to crash and loses all data in the process.
The introduction of the Tachyon layer is equivalent to pulling the storage engine out of Spark , so that each task process is only responsible for execution. Process crashes do not lose data because the data is inside the tachyon .
problem3: Data is repeatedly loaded andGC
Different Spark tasks may access the same data, for example, both tasks have access to some of the blocks in HDFS , such as Block1 in and 3. So there's no way, every task has to go to the disk to load the data into memory. Tachyon not only saves one piece of data, but it also uses out-of-heap memory to avoid GC overhead .
TachyonHow to fault tolerance?
We've seen it before.TachyonHow to further improveSparkof performance, including avoiding data landing to disk, sharing data, and out-of-heap memoryGCsuch as ButTachyonHow is fault-tolerant in itself? No landingDFSAre data not lost in the same manner? AndTachyonOnly one copy of the data is saved in memory. There is an image of the saying:TachyonWillLineageFromSparkMoved down to himself. Since the handLineage, there is a way. WithSparkSimilarly, it exploitsLineageInformation(lineage-based recovery)and asynchronous Records.CheckpointTo recover the data(AndSparkSimilar, are based on theRDDImmutability and coarse-grained operations can be done with different pointsTachyonCan be managed in a cross-frameLineageAnd not limited toRDDAndSparkThe conversion?)SoTachyonBe confident and aggressive(aggressively)Use memory.
Second, the master ofTachyon itself is managed by the ZooKeeper cluster, and the new leaderis automatically elected at down-machine, and The worker is automatically connected to the new leader .
Now the tachyon version is only 0.5, the data is also relatively small. The graph algorithm for its asynchronous checkpointing also finds what information is not yet understood. But it looks pretty interesting, keep your eye on it.
References
1 tachyon-a Reliable Memory centric Storage for Big Data Analytics
2 tachyon-reliable File sharing at Memory-speed Across Cluster frameworks
Distributed memory file system Tachyon