GFS Google File System (Chinese translation)

Last Update:2015-11-21 Source: Internet

Author: User

Tags switches

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Google File system
GFS is a scalable, distributed file system for large, distributed applications that access large amounts of data. It runs on inexpensive, common hardware, but can provide fault-tolerant functionality. It can provide a large number of users with high overall performance services.
1. Design Overview
(1) Design to decide
GFS has many of the same goals as the distributed file system of the past, but the GFS design is driven by the current and anticipated workload of the application and the technical environment, reflecting its apparent differences from earlier file systems. This requires a re-examination of the traditional selection and the exploration of a completely different design perspective.
The different views of GFS from previous file systems are as follows:
1, part errors are no longer treated as exceptions, but are treated as common cases. Because file systems are made up of hundreds of machines for storage, these machines are made up of inexpensive generic components and accessed by a large number of clients. The number and quality of parts makes it possible for some machines to work at any time and some of them may not be recoverable. So real-time monitoring, error detection, fault-tolerant, automatic recovery is essential for the system.
2, according to the traditional standards, documents are very large. Files up to several gigabytes in length are very common. Each file typically contains many application objects. It is difficult to manage thousands of KB-sized chunks of files, even if the underlying file system provides support, when it is often necessary to deal with fast-growing datasets containing tens of thousands of objects and terabytes of data. Therefore, the parameters of the operation in the design, the size of the block must be reconsidered. The management of large-scale files must be efficient, and small files have to be supported, but not optimized.
3. Most files are updated by adding new data instead of changing existing data. Random operations in a file are virtually nonexistent in practice. Once written, the file is only readable, and many of the data has these features. Some data may form a large warehouse for the data analysis program to scan. Some are data streams that are continuously generated by running programs. Some are file-quality data, and some are intermediate data that is generated on a machine and processed on another machine. Because of these access to large files, adding operations becomes the focus of performance optimization and atomicity assurance. Caching of data blocks in the client is not attractive.
4, the workload mainly consists of two kinds of reading operations: the flow of large amounts of data read operations and a small amount of data in a random way to read operations. In the previous read operation, it is possible to read hundreds of KB, usually up to 1MB and more. Successive operations from the same customer typically read a contiguous area of the file. Random reads are usually read in a few kilobytes at a random offset. Performance-sensitive applications typically classify and batch read operations on small amounts of data so that read operations move steadily forward without having to read them back and forth.
5, the workload also contains many of the large amount of data, continuous, add data to the file write operations. The size of the written data is similar to that of reading. Once written, the files are rarely changed. Write operations for small amounts of data are also supported at random locations, but not necessarily very efficient.
6, the system must effectively achieve a well-defined number of customers to the same file at the same time the semantics of adding operations.
(2) System interface
GFS provides a similar file system interface, although it does not implement the standard API to POSIX. Files are organized hierarchically in the directory and identified by path names.
(3) Architecture:
A GFS cluster is comprised of a master and a large number of chunkserver, and is accessed by many clients (client). As shown in 1. Master and Chunkserver are typically Linux machines that run the user-tier service process. Chunkserver and client can run on the same machine as long as the resources and reliability allow.
The file is divided into blocks of fixed size. Each block is identified by an immutable, globally unique, 64-bit chunk-handle, Chunk-handle is assigned by master when the block is created. Chunkserver stores blocks as Linux files on a local disk and can read and write data specified by the Chunk-handle and bit intervals. For reliability reasons, each block is replicated to multiple chunkserver. By default, 3 copies are saved, but this can be specified by the user.
Master maintains the file system's metadata (metadata), including namespaces, access control information, mapping from file to block, and the current location of the block. It also controls system-wide activities such as block lease (lease) management, orphan block garbage collection, and chunkserver between block migrations. Master periodically communicates with each chunkserver through heartbeat messages, passing instructions to Chunkserver and collecting its state.
The GFS client code associated with each application implements the file system APIs and communicates with master and Chunkserver to read and write data on behalf of the application. The customer's exchange with Master is limited to the operation of metadata (metadata), and all data communication is directly linked to the Chunkserver.
Neither the customer nor the Chunkserver cache file data. Because the benefits of user caching are negligible, this is due to too much data or too large a working set to cache. Not caching data simplifies the client program and the entire system, because cache consistency issues are not to be considered. But the user caches the metadata (metadata). Chunkserver also does not have to cache files because the blocks are stored as local files.
(4) Single master.
Only one master also greatly simplifies the design and allows master to make advanced block placement and replication decisions based on the global context. But we have to minimize the participation of master in reading and writing so that it does not become a bottleneck in the system. Client never reads and writes file data from master. The client simply asks the master which chunkserver it should contact. The client caches this information for a limited period of time, and the client interacts directly with Chunkserver in subsequent operations.
The interaction of a simple read operation is explained in Figure 1.
1. The client uses a fixed block size to convert the file name and byte offsets specified by the application to a block index (chunk index) of the files.
2. Send a request to master that contains the file name and the block index.
3, Master responds to the corresponding chunk handle and the location of the copy (multiple replicas).
4. The client caches the information with the file name and the block index as the key. (Handle and the location of the replicas).
5. The Client sends a request to one of the replicas, most likely a recent copy. The request specifies a chunk handle (Chunkserver with chunk handle identity chunk) and a byte interval within the block.
6. Unless the cached information is no longer valid (cache for a limited time) or the file is reopened, subsequent reads to the same block no longer require interaction between the client and master.
Typically, the client can ask for multiple chunk addresses in a single request, and master can respond to these requests very quickly.
(5) Block Size:
Block size is a key parameter in the design. We chose 64MB, which is much larger than the average file system block size. A copy of each block is stored as a normal Linux file and can be extended when needed.
The benefits of larger blocks are:
1, reduce the interaction between the client and master. Because reading and writing the same block is only necessary to request block location information to master at the beginning. This reduction is especially important for reading and writing large files. Even for random read operations that access a small amount of data, it is convenient to store block location information for a working set of up to several terabytes in size.
2, the client is likely to perform multiple operations on a given block, and a TCP connection that chunkserver for a longer period of time can reduce network load.
3. This reduces the size of the metadata (metadata) saved on master, allowing metadata to be placed in memory. This will bring some other benefits.
Negative side:
A small file may contain only one block, and if many client accesses change the file, the chunkserver that stores the blocks will become hotspots for access. But in real-world applications, applications typically read files that contain multiple blocks sequentially, so this is not a major issue.
(6) Meta data (metadata):
Master stores three types of metadata: the namespace of the file and the namespace of the block, the map from the file to the block, and the location of the copy of the block. All metadata are placed in memory. The first two types of metadata are kept intact by registering modifications to the operation log, the operations log is stored on the master's local disk, and a copy is left on several remote machines. The use of logs allows us to update master's state very simply and reliably, even if there is no inconsistency in the case of master crashes. Instead, Mater asks about each Chunkserver's own block each time it starts and when a chuankserver joins.
A, memory data structure:
Because metadata is stored in memory, master operates quickly. Further, master can easily and efficiently scan its entire state in the background on a regular basis. This periodic scan is used to implement block garbage collection, copy replication in the event of a chunkserver failure, and block migrations for load balancing and disk space.
A potential problem with this approach is that the number of blocks is the total system capacity is limited to the memory of master. In fact, this is not a serious problem. Master maintains a metadata of less than 64 bytes for each 64MB block. Except for the last piece, all the pieces of the file are full. Similarly, the namespace data for each file is less than 64 bytes, because the file name is stored in a predetermined compression mode. If you want to support a larger file system, the way to add some memory is to keep the metadata (metadata) in memory for simplicity, reliability, High performance and flexibility, this is a small price.
B, block position:
Master does not save a constant record for a copy of a block owned by Chunkserver. It obtains this information at startup with a simple query. Master can keep this information up-to-date because it controls the placement of all blocks and monitors the state of chunkserver with heartbeat messages.
The benefit of this: because Chunkserver may join or leave the cluster, change the pathname, crash, restart, and so on, a cluster of hundreds of servers, these events occur frequently, this method excludes master and Chunkserver synchronization problem.
Another reason is that only chunkserver can determine which blocks it has, and because of the error, some blocks in the chunkserver may disappear naturally, so there is no need to save a constant record for this in master.
C, Operation log:
The operations log contains a history of changes made to metadata. It defines the order of execution of concurrent operations as a logical timeline. files, blocks, and their version numbers are uniquely and permanently identified by the logical time they were created.
The operational log is so important that we have to reliably save it and present the changes to the user only after the metadata changes are fixed. So we copy the operation log to several remote machines and answer the user's request only after the corresponding log records are written to the local and remote disks.
Master can use the action log to restore the state of its file system. In order to minimize startup time, the log must be smaller. Whenever the length of the log grows beyond a certain size, master checks its state, and it can load the nearest checkpoint from the local disk to restore the status.
Creating a checkpoint is time-consuming, and the internal state of master is organized in a way that does not delay the upcoming modification operation when creating a checkpoint. Master switches to a new day file and creates checkpoints in a separate thread. This new checkpoint records all the changes before switching. It can be done in a cluster of hundreds of thousands of files in a minute or so. Once created, write it to a local and remote disk.
(7) Data integrity
Namespace modifications must be atomic, and they can only have master processing: the namespace lock guarantees the atomicity and correctness of the operation, and the operations log of Master defines the order of these operations at the global scope.
The state of a file interval depends on the modified type after modification, regardless of whether the operation succeeds or fails, and whether it is a concurrent operation or not. If all clients see the same data, regardless of which copy they read, the area of the file is consistent. If the area of the file is consistent and the user can see the data written by the modification operation, then it is defined. If the modification is done under the influence of no concurrent write operations, the affected area is defined and all clients can see what is written. A successful concurrent write operation is undefined but consistent. A failed modification will cause the interval to be in an inconsistent state.
The write operation writes data at the offset specified by the application, and the record append operation causes the data (record) to be added at least atomically to the offset specified by the GFS, even in the case of concurrent modification operations, and the offset address is returned to the user.
After a series of successful modifications, the final modification operation ensures that the file area is defined. GFS does this by performing the same sequential modification of all replicas and using the block version number to detect obsolete copies (lost modifications due to chunkserver exit).
Because the user caches the location information, it is possible to read the data from an obsolete copy before updating the cache. However, this has a cache cutoff time and file re-opening is limited.
After the modification operation succeeds, the part failure can still be the data being corrupted. GFS uses checksums to detect damage to data through regular handshake between master and Chunkserver. Once detected, it is re-stored from a valid copy as soon as possible. Only if all copies are invalidated before GFs is detected, the block is lost.
2. System interaction
(1) lease (lease) and modification order:
(2) Data flow
Our goal is to take full advantage of the network bandwidth of each machine to avoid network bottlenecks and delays
In order to effectively utilize the network, we separate the data stream from the control flow. The data is passed linearly on the selected Chunkerserver chain in a pipelined manner. The entire external bandwidth of each machine is used to transmit data. To avoid bottlenecks, each machine receives data and transmits it to the nearest machine as soon as it receives it.
(3) Atomic record Append:
GFS provides an atomic add operation: the record append. In a traditional write operation, the client specifies the offset of the data being written, and concurrent writes to the same interval are discontinuous: The interval may contain data fragments from multiple clients. In record append, the client simply specifies the data. GFS adds the data at least atomically to the file at its selected offset, and returns the offset to the client.
In distributed applications, many clients on different machines may add operations to a file at the same time, and add operations are frequently used. With traditional write operations, additional, complex, costly synchronizations may be required, such as through distributed lock management. In our workload, these files are usually in the form of multiple producers of a single consumer queue or contain comprehensive results from multiple different clients.
The Record append is similar to the control flow of the write operation described earlier, just a few more logical judgments on the primary. First, the client sends the data to all copies of the last piece of the file. The request is then sent to primary. Primary checks if the add operation will cause the block to exceed the maximum size (64M). If so, it expands the block to its maximum size and tells the other copies to do the same, notifying the client that the operation needs to be re-tried on the next block. If the record meets the largest requirement, primary will add the data to its copy and tell the other copies to write the data at the same offset, and finally primary to the client to report the write operation success. If the record append operation fails on any of the replicas, the client will retry the operation. At this point, a copy of the same block may contain different data, because some may replicate all of the data, and some may only copy the part. GFS does not guarantee that all replicas are the same for each byte. It only guarantees that each data is written as an atomic unit at least once. This is the result: if the operation succeeds, the data must be written at the same offset on all copies. Further, since then, all copies are at least as long as the records, so subsequent records will be assigned to a higher offset or to a different block, even if the other copy becomes primary. According to the consistency guarantee, the interval of the successful record append operation is defined. and the range of interference is inconsistent.
(4) Snapshot (snapshot)
The snapshot operation constructs a copy of the file and the directory tree almost instantaneously, minimizing the impact of the other modifications being made.
We use copy-on-write technology to achieve snapshot. When Master is subjected to a snapshot request, it first will snapshot the lease on the block of the file. This allows any operation that writes data to these blocks to interact with master to find a copy of the owning lease. This gives master a chance to create a copy of this block.
After the copy is revoked or terminated, master enlists the action performed on the disk, and then copies the metadata of the source file or directory tree to enforce the registered operation on its memory state. This newly created snapshot file and the source file (its metadata) point to the same block (chunk).
After snapshot, the customer writes to Chunk C for the first time, it sends a request to master to find a copy of the lease. Master notes that the reference count of Chunk C is greater than 1, it delays the response to the user, chooses a chunk handle C ', and then requires each chunkserver that has chunk C's copy to create a block C '. Each chunkserver created locally chunk C ' avoids network overhead. There is no difference between this and the operation of the other blocks.
3. Master Operation
Master performs all namespace operations, and in addition, he manages the replication of data blocks at the system level: determining the placement of blocks, generating new chunks of data and backing them up, and collaborating with other system-wide operations to ensure the integrity of data backups, Balance the load across all data block servers and reclaim unused storage space.
3.1 Name space management and locking
Unlike traditional file systems, GFS does not have a data structure associated with each directory that can list all of its files, nor does it support aliases (hard or symbolic connections in Unix), either to files or directories. The GFS namespace is logically a look-up table from the file metadata to the path name Mapping.
Master obtains a series of locks before performing an operation, for example, to perform operations on/d1/d2.../dn/leaf, it must obtain/D1,/D1/D2, ...,/d1/d2/.../dn read lock,/d1/d2.../dn/ The leaf's read or write lock (where the leaf can make the file a directory). The parallelism of the master operation and the consistency of the data are achieved through these locks.
3.2 Backup Storage Placement Policy
A GFS cluster file system may be multi-layered. In general, thousands of file-block servers are distributed across different racks, and these file-block servers are accessed by customers distributed across different racks. As a result, communication between two machines on different racks may pass through one or more switches. Data block redundancy configuration strategy to achieve a single purpose: maximum data reliability and availability, maximum network bandwidth utilization. Therefore, it is difficult to meet these two requirements by simply placing a copy of the data on a different machine, and the data must be backed up on a different rack. This ensures the normal use of the data even if the entire rack is destroyed or dropped. This also enables data transmission, especially reading data, to take full advantage of bandwidth, access to multiple racks, and write operations that have to involve more racks.
3.3 Generating, repeating, and balancing data blocks
When master generates new chunks of data, there are several factors to consider: (1) Try to place it on a block server with low disk utilization, so that the disk utilization of each server is balanced slowly. (2) Try to control the number of "newly created" on a server. (3) Due to the reasons discussed in the previous section, we need to place data blocks on different racks.
Master needs to be duplicated when the available block backups are lower than the user-defined number. This happens for several reasons: The server is unavailable, the data is corrupted, the disk is corrupted, or the number of backups is modified. The priority of each block of data that needs to be duplicated is determined according to the following: the first is the distance from the target, and we also increase its priority for blocks that can block the user's program. Finally, master replicates data blocks according to the principle of generating blocks and puts them on servers in different racks.
The master periodically balances the load on each server: it checks the chunk distribution and load balancing to populate a new server in this way instead of putting everything else on it to bring in a lot of write data. The principle of block placement is the same as discussed above, in addition, Master decides that the data blocks are to be removed, in principle he clears those servers that have less than the average free space.
3.4 Garbage Collection
After a file is deleted, GFS does not immediately reclaim disk space, but waits until the garbage collector is retracted at the file and block level checks.
When a file is deleted by the application, master immediately logs the changes, but the resources used by the file are not immediately retracted, but instead the file is given a hidden name and the deleted timestamp is appended. When master periodically checks the namespace, it deletes hidden files that are longer than three days (which can be set). Before that, you can read the file with a new name, and you can restore the previous name. When the hidden file is deleted in the namespace, its metadata in memory is erased, which effectively cuts off the connection to all blocks of data.
In a similar periodic namespace check, master confirms the orphan data block (which does not belong to any file) and erases his metadata, and in the heartbeat information exchange with master, each server reports the block of data he owns, and Master returns a block of metadata that is not in memory. Server to delete the data blocks.
3.5 Detection of outdated data
If the server goes down when the data is updated, the backup of the data that he saves is obsolete. For each block of data, master sets a version number to differentiate between the updated block and the obsolete data block.
When master authorizes a new lease, he increases the version number of the block and notifies the update of the data backup. Master and backup will record the current version number, and if a backup is unavailable at the time, then his version number is unlikely to increase, and master will discover stale data when Chunkserver restarts and reports his set of data blocks to master.
Master cleans up outdated backups in a regular garbage collection program, before, in efficiency, at the client and the British ambassador, he would think there was no outdated data at all. As another security measure, Master's answer to the customer and the data block, or the other server data that reads the data, is given the version information, and the client and the server verify the version information before the operation to ensure that the latest data is obtained.
4. Fault tolerance and diagnosis
4.1 High reliability
4.1.1 Fast Recovery
Regardless of how the service is terminated, both master and block servers will be restored and running in a matter of seconds. In fact, we do not differentiate between normal termination and abnormal termination, and the server process will be cut off and terminated. Clients and other servers experience a small outage, and then their specific requests time out, reconnect the restarted server, and re-request.
4.1.2 Data Block Backup
As discussed above, each block of data is backed up to a different server that is placed on different racks. For different namespaces, users can set different backup levels. When the data block server is dropped or the data is corrupted, master copies the data blocks as needed.
4.1.3 Master Backup
To ensure reliability, Master's status, operational records, and checkpoints are backed up on multiple machines. An operation is successful only if it is refreshed on the hard disk of the block server and logged on the master and its backup. If master or hard disk fails, System Monitor discovers and starts a backup machine by changing the domain name, while the client is only accessed using the canonical name, and does not notice the change of master.
4.2 Data integrity
Each data block server uses checksums to verify the integrity of the stored data. Cause: Every server has the possibility of a crash at any time, and it is unrealistic to compare data blocks between two servers, while copying data between the two servers does not guarantee data consistency.
Each chunk is divided into blocks of size 64kB, each with 32-bit checksum, checksum and log stored together, and user data separated.
When reading the data, the server first checks the checksum of the relevant part of the content being read, so the server does not propagate the wrong data. If the checked content does not match the checksum, the server returns an incorrect message to the data requester and reports the condition to master. The client reads the other server to fetch the data, and master copies the data from the other copy, and when a new copy is completed, master notifies the server that reported the error that the wrong block was deleted.
The checksum calculation is optimized when the write data is attached, as this is the primary write operation. We just update the add-on checksum, even if the checksum data at the end is corrupted and we don't check it out, the new checksum is inconsistent with the data, and this conflict will be checked out the next time it is used.
Conversely, if you overwrite the write of the existing data, before writing, we must examine the first and last data blocks before we can perform the write operation, and finally calculate and record the checksum. If we do not first check the first block of data before overwriting, the computed checksum will cause an error because the data is not overwritten.
During idle time, the server checks the checksum of inactive blocks of data, which can check for errors in infrequently read data. Once the error is checked, the server will copy the correct data block instead of the wrong one.
4.3 Diagnostic tools
Extensive and meticulous diagnostic logs have been traded at a small cost for a significant role in problem isolation, diagnosis, and performance analysis. The GFS server logs significant events (such as server downtime and startup) and remote responses. The remote logging of requests and responses between machines, by collecting log records on different machines and analyzing them for recovery, allows us to reproduce the active scene completely and use this for error analysis.
6 measurement
6.1 Test environment
One master computer, two master computer backup, 16 data block server, 16 clients.
Each machine: 2 piii1.4g processor, 2G RAM, 2 80g5400rpm HDD, 1 100Mbps full duplex NIC
19 servers are connected to a HP2524 switch, and 16 clients are connected to a switch on the outside of the collar, and the two switches are linked by a 1G link.

GFS Google File System (Chinese translation)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More