Distributed File System fastdfs Design Principle

Source: Internet
Author: User

 

Address: http://blog.chinaunix.net/uid-20196318-id-4058561.html

Fastdfs is an open-source lightweight distributed file system, which consists of three parts: Tracker server, storage server, and client, it mainly solves the massive data storage problem and is especially suitable for online services with small and medium-sized files (recommended range: 4 kb <file_size <500 mb) as the carrier.

 

Storage Server

Storage Server (hereinafter referred to as storage) is organized by a group (volume, group, or volume). A group contains multiple storage machines, and data is backed up to each other, the storage space is subject to the minimum storage space in the group. Therefore, we recommend that you configure multiple storage spaces in the group as much as possible to avoid wasting storage space.

Storage is organized in groups to facilitate application isolation, Server Load balancer, and number of copies customization (the number of storage servers in a group is the number of copies of the group ), for example, you can isolate application data by saving different application data to different groups, and allocate applications to different groups for Load Balancing Based on application access features; the disadvantage is that the group capacity is limited by the storage capacity of a single machine. When a machine in the group breaks down, data recovery can only rely on other machines in the Group mainland, resulting in a long recovery time.

The storage of each storage in a group depends on the local file system. storage can be configured with multiple data storage directories. For example, 10 disks are attached to/data/disk1-/data/disk10, respectively, you can configure these 10 directories as storage data storage directories.

When receiving a file write request, storage selects a storage directory to store the file according to the configured rules (described later. To avoid a large number of files in a single directory, when storage is started for the first time, two sub-directories are created in each data storage directory, with 256 sub-directories at each level and a total of 65536 files, newly written files are routed to a subdirectory in hash mode, and the file data is directly stored as a local file in this directory.

Tracker server

Tracker is the coordinator of fastdfs and is responsible for managing all the storage servers and groups. After each storage is started, it connects to tracker, notifies its group and other information, and maintains a periodic heartbeat, tracker creates a ing table of group => [Storage Server LIST] based on the heartbeat information of storage.

The metadata that tracker needs to manage is very small and will be stored in the memory. In addition, the metadata on tracker is generated by the information reported by storage and does not need to persist any data, this makes the tracker very easy to expand. The tracker machine can be directly added to the tracker cluster for service. Each tracker in the cluster is completely equal, and All trackers accept the heartbeat information of the Stroage, metadata information is generated to provide read/write services.

Upload File

Fastdfs provides users with basic file access interfaces, such as upload, download, append, and delete, which are provided to users as client libraries.

Select Tracker server

When there are more than one Tracker server in the cluster, because the tracker is completely equivalent, the client can select any trakcer when uploading files.

Select the storage group

When the tracker receives an upload file request, it will allocate a group for the file to store the file. The following group rules are supported: 1. round robin: Round Robin between all groups. specified group, specifying a specific group 3. load Balance, more storage space remaining, group priority

Select Storage Server

After the group is selected, tracker selects a storage server in the group to the client. The following storage selection rules are supported: 1. round robin: Round Robin between all storage in the group. first server ordered by IP, sorted by IP 3. first server ordered by priority, sorted by priority (Priority configured on storage)

Select storage path

After the storage server is allocated, the client will send a file write request to storage. storage will allocate a data storage directory to the file. The following rules are supported: 1. round robin: Round Robin between multiple storage directories. top priority for the remaining storage space

Generate fileid

After the storage directory is selected, storage generates a fileid for the file, which is spliced by the storage server IP address, file creation time, file size, file CRC32, and a random number, encode the binary string with base64 and convert it to a printable string.

Select two levels of directories

After the storage directory is selected, storage will allocate a fileid to the file. Each storage directory has two sub-directories with a level of 256*256. storage will perform two hash (guessing) Operations Based on the file fileid ), route to one of the sub-directories, and store the files in this sub-directory as the file name.

Generate file name

After a file is stored in a subdirectory, it is deemed that the file is successfully stored. Next, a file name is generated for the file, the file name is spliced by group, storage directory, two-level sub-directories, fileid, and file suffix name (specified by the client, mainly used to differentiate file types.

File synchronization

When writing a file, the client writes the file to a storage server in the group, that is, the file is successfully written. After the storage server finishes writing the file, the background thread will synchronize the files to other storage servers in the same group.

After each storage writes a file, a BINLOG is written at the same time. The BINLOG does not contain file data and only contains metadata such as the file name. This BINLOG is used for background synchronization, storage records the progress of synchronization to other storage in the group so that the synchronization can continue with the previous progress after restart. The progress is recorded in a timestamp, therefore, it is best to ensure that the clock of all servers in the cluster is synchronized.

The synchronization progress of storage is reported to the tracker as part of the metadata, and tracke uses the synchronization progress as a reference when selecting read storage.

For example, a group has three storage servers, A, B, and C. The synchronization progress from A to C is T1 (all files written before T1 have been synchronized to B ), when B synchronizes data to C with a timestamp of T2 (T2> T1), tracker sorts the synchronization progress information and uses the smallest one as the synchronization timestamp of C, in this example, T1 is the synchronization timestamp of C as T1 (that is, all data written before T1 has been synchronized to C). Similarly, according to the above rules, tracker generates a synchronization timestamp for A and B.

Download file

After the client Upload File is successful, it will get a file name generated by storage. Then the client can access the file based on the file name.

Like upload file, the client can select any Tracker server when downloadfile.

Tracker sends a download request to a tracker. The file name information must be included. tracke parses the group, size, creation time, and other information of the file from the file name, then select a storage for the request to serve the Read Request. Because files in the group are synchronized asynchronously in the background, it is possible that the files are not synchronized to some storage servers at the time of reading. In order to avoid access to such storage, tracker selects readable storage in the group according to the following rules.

1. The source storage to which the file is uploaded-the source storage must be included as long as it is stored. The source address is encoded in the file name. 2. file Creation timestamp = timestamp when storage is synchronized (current time-File Creation timestamp)> maximum file synchronization time (for example, 5 minutes)-after a file is created, after the maximum synchronization time, the data must have been synchronized to other storage devices. 3. File Creation timestamp <The timestamp at which storage is synchronized. -The file before the synchronization timestamp has been synchronized. 4 (current time-File Creation timestamp)> The synchronization latency threshold value (such as one day ). -After the synchronization latency threshold, the file must have been synchronized.

Merge storage of small files

The merge storage of small files mainly solves the following problems:

1. The number of inode in the local file system is limited, so the number of small files to be stored is limited. 2. Multi-Level directory + many files in the directory, resulting in high File Access overhead (may lead to many Io operations) 3. Small file storage, low backup and recovery efficiency

Fastdfs introduces the small file storage mechanism in V3.0 to store multiple small files to a large file (trunk file). To support this mechanism, the fileid of the file generated by fastdfs requires an additional 16 bytes.

1. Trunk File ID 2. File offset within the trunk file 3. Size of the storage space occupied by the file (byte alignment and delete space reuse, file occupied storage space> = file size)

Each trunk file is uniquely identified by an ID. The trunk file is created by the trunk server in the group (the trunk server is selected by the tracker) and synchronized to other storage in the group, after the file storage is merged and stored in the trunk file, the file can be read from the trunk file according to its offset.

The offset encoding of the file in the trunk file to the file name determines that its position in the trunk file cannot be changed, and the space of the deleted file in the trunk file cannot be recycled by compact. However, when a trunk file contains a deleted file, the deleted space can be reused. For example, if a kb file is deleted, next, you can store a kb file to reuse the deleted bucket.

HTTP access support

Both tracker and storage of fastdfs have built-in HTTP support. The client can download files through HTTP. When a tracker receives a request, requests are redirected to the storage where the file is located through the HTTP redirect mechanism. In addition to the built-in HTTP protocol, fastdfs also provides support for downloading files through Apache or nginx extension modules.

Other features

Fastdfs provides an interface for setting/obtaining extension attributes of a file (setmeta/getmeta ), the extended attributes are stored in the same name (with a special prefix or suffix) on the storage as key-value pairs, for example,/group/m00/00/01/some_file is the original file, the extended attributes of the file are stored in/group/m00/00/01 /. the some_file.meta file (which is not necessarily true, but has a similar mechanism) allows you to locate the file storing the extended attributes based on the file name.

The authors of the above two interfaces are not recommended to use. The extra meta files will further "Enlarge" the storage of massive and small files. At the same time, the storage space utilization is not high because the meta files are very small, for example, a bytes Meta File also occupies 4 K (block_size) storage space.

Fastdfs also provides appender file support, which is stored through the upload_appender_file interface. appender file allows you to append the file after creation. In fact, appender files are stored in the same way as normal files. The difference is that appender files cannot be merged and stored in trunk files.

Problem Discussion

From the overall design of fastdfs, it is basically a simple principle. For example, backing up data on machines simplifies tracker management. Storage directly uses the local file system to store files as they are, simplifying storage management; writing a single copy of a file to storage is successful and then synchronized in the background, simplifying the file writing process. However, the problem that a simple solution can solve is usually limited. fastdfs still has the following problems (Welcome to discuss ).

Data security

  • Writing one copy is successful: writing a file from the source storage to synchronizing it to another storage time window in the group. Once the source storage fails, user data may be lost, data loss is generally unacceptable to storage systems.
  • There is a lack of Automatic Recovery mechanisms: When a storage disk fails, you can only replace the storage disk, and then manually restore data. Because backup by machine, it seems impossible to have an automatic recovery mechanism, unless there is a pre-prepared hot backup disk, the lack of Automatic Recovery mechanisms will increase system O & M.
  • Low data recovery efficiency: when data is restored, it can only be read from other storage in the Group. At the same time, because the access efficiency of small files is low, the efficiency of file-based recovery will be very low, low recovery efficiency means that the data is in an insecure state for a longer period of time.
  • Lack of multi-data-center Disaster Tolerance support: currently, multi-data-center disaster tolerance is required. Only additional tools can be used to synchronize data to the backup cluster without automatic mechanisms.

Storage space utilization

  • The number of files stored on a single machine is limited by the inode quantity.
  • Each file corresponds to a file stored in the local file system. On average, block_size/2 is a waste of storage space.
  • File merge storage can effectively solve the preceding two problems. However, because the merged storage does not have a space recycle mechanism, the space for deleting files cannot be reused and there is also a waste of space.

Server Load balancer

  • The Group mechanism itself can be used for load balancing, but this is only a static load balancing mechanism that requires you to know the application access characteristics in advance; at the same time, the Group mechanism also makes it impossible to migrate data between groups for dynamic load balancing.

Remarks

  • Most of the above content is personal understanding, and does not represent the real situation of fastdfs. If you find any problem, please help us to point it out.
  • The picture in this article is from the Internet, chinaunix fastdfs discussion area, and UC technology blog. If copyright is involved, please contact me to delete it.

Reprinted please indicate Transfer From Yun notes , Original article: Design Principles of the Distributed File System fastdfs

Distributed File System fastdfs Design Principle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.