Distributed memory File System Tachyon

Last Update:2014-11-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

UCBerkeley developed Tachyon ( hyper-photon [' T?ki??? N], the name should not be so arrogant ah : is a variety of cluster concurrency computing framework to provide memory data management platform, can also be said to be a memory-based file system bar. For example, it is at a level where existing storage systems, such as HDFS , are under various computational frameworks such as Spark,MapReduce,Impala , and so on.

Why should there be such a framework? MapReduce does not say, but like Spark , the memory-computing framework, why do you need to add a layer of memory-managed file systems? Because, like Spark , the framework actually provides only powerful memory computing power, but does not provide storage capacity. So is it not enough for Spark to manage data directly in memory by default? Let's take a look at some of the existing problems.

problem1: Slow exchange of data between different tasks or frames

Data sharing between different tasks or different computing frameworks is unavoidable, such as Spark 's two tasks belonging to different stages, or spark interacting with the MapReduce Framework's data. In this case, it is generally necessary to complete the data exchange via disk, which is usually inefficient.

When the Tachyon layer is introduced, the data exchange is actually in memory.

problem2: The execution engine and the storage engine are the same process

This is the problem that has been mentioned earlier, allowing Spark to manage memory on its own. By default, the task execution ofSpark and the data itself are within one process. When a problem occurs, it causes the entire process to crash and loses all data in the process.

The introduction of the Tachyon layer is equivalent to pulling the storage engine out of Spark , so that each task process is only responsible for execution. Process crashes do not lose data because the data is inside the tachyon .

problem3: Data is repeatedly loaded andGC

Different Spark tasks may access the same data, for example, both tasks have access to some of the blocks in HDFS , such as Block1 in and 3. So there's no way, every task has to go to the disk to load the data into memory. Tachyon not only saves one piece of data, but it also uses out-of-heap memory to avoid GC overhead .

TachyonHow to fault tolerance?

We've seen it before.TachyonHow to further improveSparkof performance, including avoiding data landing to disk, sharing data, and out-of-heap memoryGCsuch as ButTachyonHow is fault-tolerant in itself? No landingDFSAre data not lost in the same manner? AndTachyonOnly one copy of the data is saved in memory. There is an image of the saying:TachyonWillLineageFromSparkMoved down to himself. Since the handLineage, there is a way. WithSparkSimilarly, it exploitsLineageInformation(lineage-based recovery)and asynchronous Records.CheckpointTo recover the data(AndSparkSimilar, are based on theRDDImmutability and coarse-grained operations can be done with different pointsTachyonCan be managed in a cross-frameLineageAnd not limited toRDDAndSparkThe conversion?)SoTachyonBe confident and aggressive(aggressively)Use memory.

Second, the master ofTachyon itself is managed by the ZooKeeper cluster, and the new leaderis automatically elected at down-machine, and The worker is automatically connected to the new leader .

Now the tachyon version is only 0.5, the data is also relatively small. The graph algorithm for its asynchronous checkpointing also finds what information is not yet understood. But it looks pretty interesting, keep your eye on it.

References

1 tachyon-a Reliable Memory centric Storage for Big Data Analytics

2 tachyon-reliable File sharing at Memory-speed Across Cluster frameworks

Distributed memory file system Tachyon

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Distributed memory File System Tachyon

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Distributed memory File System Tachyon

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support