Tachyon structure Analysis and existing issues discussion

Last Update:2018-07-25 Source: Internet

Author: User

Tags mutex spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Author: Liu Xuhui Raymond Reprint Please specify the source

Email:colorant at 163.com

blog:http://blog.csdn.net/colorant/

Tachyon is a memory-based distributed file system developed by Li Haoyuan of Amplab, and the starting point is an integral part of Bdas as a amplab.

Overall design ideas

From the Tachyon design goal, is to provide a memory-based distributed file-sharing framework, the need for fault-tolerant capabilities, but also to reflect the performance advantages of memory

Tachyon organizes clusters in a common Master/worker, which is managed by the master node to maintain file system metadata, and file data is maintained in the Worker node's memory.

In terms of fault tolerance, the main technical points include:

The underlying support for plugable file systems such as HDFs for user-specified file persistence uses the journal mechanism to persist the file system metadata using zookeeper to build master ha without using replica to replicate memory data, Instead, use the idea of lineage like spark Rdd for disaster Recovery (discussed later)

In addition, HDFS-compliant API interfaces are provided for compatibility with Hadoop applications

Concrete Implementation Analysis

Initialization process

The initialization of the Tachyon file system is actually the working directory needed to create and empty the Master/worker

For the master node, these directories include the Data/worker/journal directory on the underlying persistent file system, where the worker directory is actually used by the worker node (to hold a few zero persistence files, data blocks that lose meta information, and so on). But placed in the master node to create, essentially to simplify the creation logic (because it is placed on HDFs, only once)

The directory required for the worker node is the local RAMDisk directory

In addition, in the Master's Journal folder, a specific prefix is created for the empty file to be used for the flag file system formatting complete

Tachyon Master the startup process

Tachyon Master Start-up process, of course, is to read the master-related configuration parameters, is now passed through the-d parameter to Java, ideally through the configuration file to do. At present, some of these parameters are set in the Env file, and then through the-d parameter settings, and some directly to die in the-D parameter, but also in the startup script is not configured by default, in the MASTERCONF code uses the default value of the

Determine whether the file system is formatted by reading a specific format file

The next step is to rebuild the file system information in memory.

The file system information of Tachyon relies on journal Log, Journal includes two parts, one is the snapshot image of meta information at one time, and the other is increment log. Tachyon Master starts by reading the file system meta information from the snapshot image file, including information about various data nodes (File/directory/raw table/checkpoint/dependencies, etc.), and then reads the incremental operation records from the continuation Editlog (possibly multiple) , the content of Editlog basically corresponds to some related operation of Tachyon file system client, including adding, deleting, renaming, adding data blocks, etc.

Note that the log records here do not include the actual file content data, only meta information, so if the contents of the cache file is lost, if there is no persistence, there is no binding related lineage information, then the specific contents of the corresponding file will be lost

After the file system information is restored, Tachyon Master writes the current meta data to the new snapshot image before Tachyon Master officially starts the service.

With Zookeepeer enabled, master standby periodically merges Editlog and creates an image of standby, and if there is no standby master, it is merged into the new image only during the boot process. Here multiple master concurrency operation Image Editlog, no lock or mutex mechanism, do not know whether there will be a competition conflict, data stale or missing issues

Storage of files

Tachyon files stored on RAMDisk are divided by block (default 1G), and master assigns a blockid to each block, The worker stores data for the block directly on the RAMDisk with Blockid as the actual file name

Read and write Data

Tachyon files Read and write, as far as possible through the Java NIO API to map files directly into memory, as a data stream for reading and writing operations, in order to avoid the use of large amounts of memory in the Java heap, thereby reducing the cost of GC, improve response speed

During the read and write process, all meta-related information needs to be performed by calling Tachyon Master via thrift exposed Serverapi.

The Tachyon file read operation supports both local and remote modes, which are transparent to the user from the client API's point of view. Read the implementation of the file, the process is basically to get the corresponding file offset location of the block ID

Then connect the local worker to obtain the corresponding ID corresponding to the file name, if the file exists, the client code will notify the worker to lock the corresponding block, and then the client code directly mapped related files for Randomaccessfile directly read operations, does not read actual data through the worker agent

If there is no worker locally or the file does not exist on the local worker, the client code further obtains the worker corresponding to the block from the Master API, Then through the worker exposed to the DataServer interface to read the contents of the block, within the DataServer, the same continuation lock corresponding block, the process of mapping the file read and return the data to the client

In addition, based on the reading data when using the Tachyonfile API interface, if the use of FileStream interface, when the remote worker does not have a corresponding file block, Remoteblockinstream also attempts to read data from the underlying persisted file system layer (if there is a corresponding file), while the Readbytebuffer interface does not have a corresponding process (personal sense, should be done in two ways to match the behavior).

Tachyon currently only supports local write operations, write operations can be divided by the write location

Cache: Write to Tachyon Memory file system

Through: Write to the underlying persistent file system

The specific type is the legal combination of the above cases, such as single cache,cache +through, etc.

There is also an async pattern: Asynchronously writes to the underlying persistent file system, presumably to optimize the need for data persistence, but also for performance latency and other requirements

Existing problems with read and write operations related to concurrency operations

The client side notifies Workerlock of the block when reading the data mentioned earlier. It is important to note that the lock here does not actually mean a mutex, but a sign indicates that the user is currently using the relevant files and data, so that when the worker needs to allocate memory to retire the old data, the files currently in use will not be deleted.

And in the process of writing, the current implementation seems to be related to the concurrency of the content is basically not considered

For example Read

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More