Google File System (SYSTEM,GFS) is a large distributed system built on top of inexpensive servers. It treats the server fault as normal, and automatically fault-tolerant by software, which greatly reduces the cost of the system while ensuring the reliability and availability of the system.
GFS is the cornerstone of Google Cloud storage, and other storage systems, such as Google Bigtable,google Megastore,google Percolator, are built directly or indirectly over GFS. In addition, the Google large batch system MapReduce also need to use GFS as a huge amount of data input and output.
System architecture
GFS divides the entire system's nodes into three roles: GFs Master (master server), GFs chunkserver (Data Block server, abbreviated CS), and GFS client (client).
The GFS file is divided into fixed-size chunks (Chunk), which are assigned a 64-bit globally unique Chunk handle when the master is created. CS stores chunk on disk in the form of normal Linux files. To ensure reliability, chunk copies multiple copies in different machines, with a default of three copies.
The system's metadata is maintained in master, including the file and chunk namespaces, the mapping between GFS files to chunk, and chunk location information. It is also responsible for global control of the entire system, such as chunk lease management, garbage collection useless chunk,chunk replication, and so on. Master will periodically exchange information with CS through the heartbeat.
The client is the provider that GFS provides to the application, which is a dedicated set of interfaces that do not comply with the POSIX specification and are provided as a library file. When the client accesses GFS, it first accesses the master node, obtains CS information to interact with it, and then accesses these CS directly to complete the data access work.
It is important to note that the client in GFS does not cache the file data, but only the metadata obtained in master, which is determined by the application characteristics of GFS. The main applications of GFS are two: MapReduce and BigTable. For the MAPREDUCE,GFS client is used in sequential reading and writing, there is no need to cache file data, while bigtable as a cloud table system, the internal implementation of a set of caching mechanisms. In addition, how to maintain the consistency between client-side caches and actual data is an extremely complex issue.
Several key issues in the GFS architecture are discussed below.
Lease mechanism
GFS data is appended in record units, with a size of dozens of KB to a few megabytes per record, and if Master is required for each record append, Master will obviously become a performance bottleneck for the system, so the GFS system authorizes the chunk write operation to chunk through the lease mechanism. Server. Chunk server that obtains lease authorization is called Primary Chunk server, and the other replicas are located in Chunk server called Secondary Chunk server. Lease authorization for a single Chunk, in the lease validity period, the Chunk of the write operations are responsible for primary Chunk server, thereby reducing the burden on master. In general, lease is valid for a long period of time, such as 60 seconds, as long as there is no exception, Primary Chunk server can continuously request the master to extend the validity of lease until the entire Chunk is full.
Suppose chunk a holds three copies of A1,a2,a3 in GFs, where A1 is primary. If the replica A2 is located chunk server offline and back online, and in the process of A2 offline, the replica A1 and A3 have new updates, then A2 need to be master as garbage collection. GFS is resolved by maintaining a version number for each Chunk, and master will add the Chunk version number to 1 each time Chunk lease authorized or primary lease server to re-extend Chunk validity. A2 the process of offline, the replica A1 and A3 have new updates, stating that primary Chunk server to master re-request lease and increase the A1 and A3 version number, wait until A2 back online, master can find A2 version number is too low, The garbage collection task, which marks A2 as a removable chunk,master, periodically checks and notifies chunk server to reclaim A2.
Consistency model
GFS is primarily designed to append (Append) rather than rewrite (Overwrite). On the one hand, there is less demand for rewriting, or it can be implemented by appending, such as using GFs's append function to build a distributed tabular system bigtable, on the other hand, because the additional consistency model is simpler and more efficient than rewriting. Considering the three copies of Chunk a a1,a2,a3, there is a rewrite operation that modifies the A1,A2 but does not modify the A3 so that the read operation falling to the copy A3 may read incorrect data, and correspondingly, if an append operation appends a record to the A1,A2 but the append A3 fails, Then even if the read operation falls to the replica A3 it is only read to expire instead of incorrect data.
We only discuss the additional consistency. If no exception occurs, the record of the append success is determined and consistent in each copy of the GFS, but if an exception occurs, some replicas may succeed and some replicas will not succeed, and the failed copy may have some recognizable padding (padding) records. The GFS client append failure will retry, as long as the return user is successfully appended, the description has been appended successfully at least once in all replicas. Of course, there may be records that have been appended multiple times in some chunk copies, that is, duplicate records, or there may be some recognizable fill records that the application layer needs to be able to handle.
In addition, because GFS supports multiple client concurrent append, then the order between multiple clients is not guaranteed, the same client successive append successful multiple records may also be interrupted, for example, the client successively append successfully record R1 and R2, because the process of appending R1 and R2 two records is not atomic, Midway may be interrupted by other clients, the R1 and R2 recorded in GFs's chunk may not be contiguous, with additional data appended by other clients.
This consistency model of GFS is a result of performance, which also increases the difficulty of application development. For MapReduce applications, because of its batch characteristics, the data can be appended to a temporary file, maintaining the index record offset for each appended record in the temporary file, and then renaming the temporary file to the final file once the file is closed. For the upper BigTable, there are two ways of dealing with it, which will be introduced later.
Append process
The append process is the most complex place in the GFS system, and an efficient support record append is critical for GFS-based implementation of BigTable. The append process is roughly as follows:
- The client requests Chunk to master the Chunk server where each replica resides, primary Chunk server holds the modification lease. If no chunk Server holds lease, stating that the chunk has not been written recently, master initiates a task to authorize chunk lease to one of the chunk servers according to a certain policy.
- Master returns location information for client primary and other chunk servers, which the client caches for later use. If there is no failure, the client will not need to request master again after reading and writing the chunk.
- The client sends the records to be appended to each copy. Each chunk server caches this data in an internal LRU structure. GFS uses the method of data flow and control flow separation, which can better dispatch data stream based on network topological structure.
- When all replicas acknowledge receipt of data, the client initiates a write request control command to primary. Because primary may receive multiple client concurrent append operations to the same chunk, primary determines the sequence of these operations and writes them locally;
- Primary submits the write request to all secondary copies. Each secondary performs a write operation in the order determined by the primary;
- Secondary answer primary after successful completion of copy;
- Primary answer client, if there is a copy error, the primary write succeeds, but some secondary is unsuccessful, the client will retry.
The GFS addition process has two features: pipelining and separation of data flow and control flow. pipelined operations are used to reduce latency. When a chunkserver receives some data, it immediately starts forwarding. With a full-duplex network, sending data immediately does not reduce the rate at which data is received. Aside from network congestion, the ideal time to transfer B bytes to R replicas is b/t + RL, where T is the network throughput and L is the delay between the highlights. Assuming a gigabit network, L is typically less than 1ms, and transmitting 1MB data to multiple replicas takes less than 80ms. The separation of data flow and control flow is mainly to optimize data transmission, each machine is to send data to the network topology map "recent" not yet received data. For example, suppose there are three chunkserver s1,s2 and s3,s1 with S3 on the same rack, S2 in another rack, and the client is deployed on machine S1. If the data is forwarded from S1 to S2 and then forwarded from S2 to S3, it will take two cross-rack data transfers; relatively, according to the policies in GFS, the data is sent first to S1, then from S1 to S3, and finally to S2, only one cross-rack data transfer is required.
The premise of separating the data flow from the control flow is that each additional data is larger, such as the MapReduce batch system, and this separation increases the complexity of the append process. If the traditional Primary/backup replication method is used, the append process will be simplified to some extent.
- Additional process with GFS;
- Additional process with GFS;
- Client sends pending APPEND data to Primary Chunk server,primary Chunk Server may receive concurrent append requests from multiple clients, need to determine the sequence of operations, and write to local;
- Primary the data through the pipeline way to all the secondary;
- Each secondary Chunk server receives the record data to be appended and writes locally, all replicas respond to a forward copy when the local write succeeds and receive a reply message from the latter copy, such as a wait for B to answer successfully and the local write succeeds before it can answer the primary.
- Primary answer client. If the client does not receive a primary response within the timeout period, it indicates that an error has occurred and needs to be retried.
Of course, the actual append process is far from simple. The addition of the process may occur primary lease expired and lost chunk modification operation authorization, primary or secondary machine failure, and so on. Due to space limitations, exception handling of the append process remains a reader's consideration.
Fault tolerance mechanism
The fault tolerance of Master is similar to the traditional method, through the operation log Plus checkpoint, and there is a real-time hot standby called "Shadow Master".
There are three kinds of metadata information stored on master:
(1) namespace (name space), which is the directory structure of the entire file system and chunk basic information;
(2) Mapping between files to chunk;
(3) The location information of the chunk copy, each chunk usually has three copies;
The change operation of GFS master always records the operation log before modifying the memory, and when master fails the restart, the memory data structure can be recovered through the operation log on disk, and in order to reduce the master downtime recovery time, Master periodically dumps the in-memory data to disk in the form of a checkpoint file, reducing the amount of log playback. In order to further improve the reliability and usability of master, real-time hot spares are also performed in GFs, and all metadata modification operations must be guaranteed to be sent to real-time hot spares for success. The remote real-time hot standby will receive the operation logs sent by Master in real time and play back these metadata operations in memory. If Master is down, you can also switch to the real-time standby to continue serving in seconds. To ensure that only one master,gfs at a time relies on Google's internal chubby service to select the main operation.
Master needs to persist the first two types of metadata, that is, the mapping between the command space and the file to chunk, and for the third metadata, the location of the chunk copy, master can choose not to persist, because Chunkserver maintains this information, Even if master fails, it can be obtained by chunkserver reporting at reboot.
GFS implements chunk server fault tolerance by replicating multiple replicas, each of which has multiple storage replicas stored on separate chunk servers. For each chunk, all copies must be written successfully to be considered a successful write. If the associated replica is lost or unrecoverable, Master automatically copies the copy to the other chunk Server, ensuring that the replicas are kept to a certain number.
In addition, Chunk server maintains checksums on stored data. GFS divides files by 64MB for chunk size, and each chunk is partitioned in blocks, with a size of 64KB, each block corresponding to a 32-bit checksum. When a Chunk copy is read, Chunk server compares the read data to the checksum and, if it does not, returns an error, and the client selects a copy on the other Chunk server.
Master memory Consumption
Master maintains the metadata in the system, including file and chunk namespaces, file-to-chunk mappings, and chunk copy location information. Where the first two types of metadata need to be persisted to disk, the location information of the chunk replica does not need to be persisted and can be reported through Chunk server.
Memory is a rare resource for master, and the memory usage of master is estimated below. Chunk meta information includes globally unique IDs, version numbers, the chunk server number where each replica resides, reference counts, and so on. Each chunk in the GFS system is 64MB in size, with a default storage of 3 copies and 64 bytes of metadata per chunk. Then the chunk meta information size of 1PB data does not exceed 1PB * 3/64MB * = 3GB. In addition, master compresses the namespace, for example, two files foo1 and Foo2 are stored in the directory/home/very_long_directory_name/, so the directory name needs to be stored only once in memory. After compressing the storage, the metadata for each file in the file namespace is no more than 64 bytes, because the files in GFS are generally large files, so the file namespace consumes little memory. This also means that Master memory capacity does not become a system bottleneck for GFS.
Load Balancing
The distribution strategy of a copy in GFs takes into account a variety of factors, such as network topology, rack distribution, disk utilization, and so on. To improve the usability of the system, GFS avoids all copies of the same chunk in the same rack.
There are three scenarios in the system where chunk replicas need to be created: chunk creation, Chunk re-copying (re-replication), and rebalancing (rebalancing).
When Master creates a chunk, it chooses the initial location of the chunk copy based on the following factors: (1) The disk utilization of the chunk server where the new replica resides is below average, and (2) limits the number of "recent" creation per chunk server. (3) All copies of each chunk cannot be in the same rack. The 2nd is easy to ignore but important because it is often necessary to write the data immediately after the chunk is created, and if you do not limit the number of "recent" creation, when an empty chunk server comes online, the low disk utilization may result in a large number of chunk being migrated to the machine in an instant, thus crushing it.
When the number of replicas for chunk is less than a certain number, Master attempts to replicate a chunk copy again. Possible causes include chunk server outage or chunk server reporting its own copy corruption, or one of its disk failures, or the user dynamically increasing the number of copies of Chunk, and so on. Each chunk replication task has a priority and is queued for execution at master from highest to lowest priority. For example, a chunk with only one replica requires priority replication, as well as the chunk replication priority of a valid file is higher than the chunk of a recently deleted file and, finally, GFS increases the priority of chunk replication tasks for all blocking client operations, For example, the client is appending data to a chunk that has only one replica, and if the limit requires at least two replicas to be appended successfully, then this chunk replication task will block client writes and need to increase priority.
Finally, master periodically scans the distribution of the current replica and performs a rebalance if the disk usage or machine load is found to be unbalanced.
Whether chunk is created, chunk re-copied, or rebalanced, they select the same policy for chunk replica location and need to limit the copy speed of the copy and rebalance tasks, which may affect the system's normal read and write services.
Garbage collection
GFS uses the mechanism of deferred deletion, that is, when a file is deleted, GFS does not require immediate return of the available physical storage, but instead renames the file to a hidden name in the metadata and contains a delete timestamp. Master timed check, if the deletion of files for more than a period of time (the default is 3 days, configurable), then it will remove the files from the memory metadata, and later chunk server and master heartbeat messages, each chunk server will report its own chunk collection, Master replies to Chunk information that does not already exist in the master metadata, and Chunk server releases the Chunk copies. To reduce the load on the system, garbage collection is generally performed at low peak periods of service, such as starting 1:00 every night.
Additionally, the chunk replica may expire due to a loss of chunk modification during chunk server expiration. The system maintains the version number for each chunk, and the expired chunk can be detected by the version number. Master still deletes the expired copy through a normal garbage collection mechanism.
Snapshot
The snapshot (Snapshot) operation is a "snapshot" operation on the source file/directory that generates an instantaneous state of the source file/directory at that point in the destination file/directory. GFS uses the standard copy-on-write mechanism to generate a snapshot, that is, the "snapshot" simply increases the reference count of chunk in GFs, indicating that the chunk is referenced by the snapshot file and waits until the client modifies the chunk in order to chunk The chunk data in the server is copied to generate a new chunk, and subsequent modifications fall on the newly generated chunk.
In order to snapshot a file, you first need to stop the file's write service, then increase the reference count of all chunk of the file, and later modify these chunk to generate a new chunk. The approximate steps to perform snapshot on a file are as follows:
1, through the lease mechanism to recover each chunk write access to the file, stop writing services to the file;
2, master copy file name and other meta-data generated a new snapshot file;
3, increase the reference count for all chunk of the file that executes snapshot;
For example, performing a snapshot operation on file foo generates Foo_backup,foo there are three chunk c1,c2 and C3 in GFs. Master first needs to take back the write lease of C1,C2 and C3, so that the file Foo is in a consistent state, and then master copy foo file's metadata generation Foo_backup,foo_backup also points to c1,c2 and C3. Before the snapshot, C1,c2 and C3 were only referenced by a file foo, so the reference count was 1; After the snapshot operation, the reference count of these chunk increased to 2. After the client appends data to C3 again, Master discovers that the reference count of C3 is greater than 1, notifies C3 Chunk server that the copy C3 generate C3 ', and the client's append operation also turns to C3 '.
Chunkserver
Chunkserver management size is 64MB chunk, storage needs to ensure that the chunk as evenly as possible in the distribution of different disks, may consider factors including disk space, recently new chunk number, and so on. In addition, the Linux file system deleted 64MB large files take too long, and not necessary, delete chunk can only move the corresponding chunk files to the Recycle Bin on each disk, the new chunk can be reused later.
Chunkserver is a disk and network IO Intensive application that, in order to maximize machine performance, requires the ability to asynchronously perform disk and network operations, which increases the difficulty of code implementation.
This article transferred from Rizhao Blog: http://www.nosqlnotes.net/archives/237
Go GFS Architecture Analysis