Author: Liu Xuhui Raymond Reprint Please specify the source
Email:colorant at 163.com
blog:http://blog.csdn.net/colorant/
Tachyon is a memory-based distributed file system developed by Li Haoyuan of Amplab, and the starting point is an integral part of Bdas as a amplab.
Overall design ideas
From the Tachyon design goal, is to provide a memory-based distributed file-sharing framework, the need for fault-tolerant capabilities, but also to reflect the performance advantages of memory
Tachyon organizes clusters in a common Master/worker, which is managed by the master node to maintain file system metadata, and file data is maintained in the Worker node's memory.
In terms of fault tolerance, the main technical points include:
The underlying support for plugable file systems such as HDFs for user-specified file persistence uses the journal mechanism to persist the file system metadata using zookeeper to build master ha without using replica to replicate memory data, Instead, use the idea of lineage like spark Rdd for disaster Recovery (discussed later)
In addition, HDFS-compliant API interfaces are provided for compatibility with Hadoop applications
Concrete Implementation Analysis
Initialization process
The initialization of the Tachyon file system is actually the working directory needed to create and empty the Master/worker
For the master node, these directories include the Data/worker/journal directory on the underlying persistent file system, where the worker directory is actually used by the worker node (to hold a few zero persistence files, data blocks that lose meta information, and so on). But placed in the master node to create, essentially to simplify the creation logic (because it is placed on HDFs, only once)
The directory required for the worker node is the local RAMDisk directory
In addition, in the Master's Journal folder, a specific prefix is created for the empty file to be used for the flag file system formatting complete
Tachyon Master the startup process
Tachyon Master Start-up process, of course, is to read the master-related configuration parameters, is now passed through the-d parameter to Java, ideally through the configuration file to do. At present, some of these parameters are set in the Env file, and then through the-d parameter settings, and some directly to die in the-D parameter, but also in the startup script is not configured by default, in the MASTERCONF code uses the default value of the
Determine whether the file system is formatted by reading a specific format file
The next step is to rebuild the file system information in memory.
The file system information of Tachyon relies on journal Log, Journal includes two parts, one is the snapshot image of meta information at one time, and the other is increment log. Tachyon Master starts by reading the file system meta information from the snapshot image file, including information about various data nodes (File/directory/raw table/checkpoint/dependencies, etc.), and then reads the incremental operation records from the continuation Editlog (possibly multiple) , the content of Editlog basically corresponds to some related operation of Tachyon file system client, including adding, deleting, renaming, adding data blocks, etc.
Note that the log records here do not include the actual file content data, only meta information, so if the contents of the cache file is lost, if there is no persistence, there is no binding related lineage information, then the specific contents of the corresponding file will be lost
After the file system information is restored, Tachyon Master writes the current meta data to the new snapshot image before Tachyon Master officially starts the service.
With Zookeepeer enabled, master standby periodically merges Editlog and creates an image of standby, and if there is no standby master, it is merged into the new image only during the boot process. Here multiple master concurrency operation Image Editlog, no lock or mutex mechanism, do not know whether there will be a competition conflict, data stale or missing issues
Storage of files
Tachyon files stored on RAMDisk are divided by block (default 1G), and master assigns a blockid to each block, The worker stores data for the block directly on the RAMDisk with Blockid as the actual file name
Read and write Data
Tachyon files Read and write, as far as possible through the Java NIO API to map files directly into memory, as a data stream for reading and writing operations, in order to avoid the use of large amounts of memory in the Java heap, thereby reducing the cost of GC, improve response speed
During the read and write process, all meta-related information needs to be performed by calling Tachyon Master via thrift exposed Serverapi.
The Tachyon file read operation supports both local and remote modes, which are transparent to the user from the client API's point of view. Read the implementation of the file, the process is basically to get the corresponding file offset location of the block ID
Then connect the local worker to obtain the corresponding ID corresponding to the file name, if the file exists, the client code will notify the worker to lock the corresponding block, and then the client code directly mapped related files for Randomaccessfile directly read operations, does not read actual data through the worker agent
If there is no worker locally or the file does not exist on the local worker, the client code further obtains the worker corresponding to the block from the Master API, Then through the worker exposed to the DataServer interface to read the contents of the block, within the DataServer, the same continuation lock corresponding block, the process of mapping the file read and return the data to the client
In addition, based on the reading data when using the Tachyonfile API interface, if the use of FileStream interface, when the remote worker does not have a corresponding file block, Remoteblockinstream also attempts to read data from the underlying persisted file system layer (if there is a corresponding file), while the Readbytebuffer interface does not have a corresponding process (personal sense, should be done in two ways to match the behavior).
Tachyon currently only supports local write operations, write operations can be divided by the write location
Cache: Write to Tachyon Memory file system
Through: Write to the underlying persistent file system
The specific type is the legal combination of the above cases, such as single cache,cache +through, etc.
There is also an async pattern: Asynchronously writes to the underlying persistent file system, presumably to optimize the need for data persistence, but also for performance latency and other requirements
Existing problems with read and write operations related to concurrency operations
The client side notifies Workerlock of the block when reading the data mentioned earlier. It is important to note that the lock here does not actually mean a mutex, but a sign indicates that the user is currently using the relevant files and data, so that when the worker needs to allocate memory to retire the old data, the files currently in use will not be deleted.
And in the process of writing, the current implementation seems to be related to the concurrency of the content is basically not considered
For example Read