Cloud computing Key Technology Series II--gfs

Last Update:2015-03-17 Source: Internet

Author: User

Keywords Cloud

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

As search engines have to deal with massive amounts of data, Google's two founders Larry Page and Sergey Brin design a file system called "Bigfiles" at the beginning of the venture, while GFS (the "Google File System") This distributed file system is a continuation of "Bigfiles".

Technology Overview

First, the architecture of GFS is divided into two main categories: the Master node, which primarily stores metadata related to data files, rather than chunk (data blocks). The metadata includes a table that maps 64-bit labels to the location of the data block and its constituent files, where the data block is copied and which process is reading and writing to a particular block of data. Also, the master node periodically receives updates from each chunk node ("Heart-beat") to keep the metadata up to date, and the chunk node, which is primarily used to store data. On each chunk node, the data files are stored in 64MB chunk per default size, and each chunk has a unique 64-bit label, and each chunk is replicated multiple times for the entire distributed system, with a default number of 3. The following figure is the GFS architecture diagram:

GFS Architecture

Then, in design, GFS has eight main features:

1. Large files and large data blocks: the size of the data file is generally GB and the default size for each block of data is 64MB, and the benefit is to reduce the size of the metadata so that the master node can easily place the metadata in memory to improve access efficiency.

2. Operation to add mainly: The file is rarely cut or overwritten, usually only to add or read operations, this can fully consider the hard disk linear throughput, but random read and write slow characteristics.

3. Support for fault tolerance: first of all, although it was designed to be convenient, a single master scheme was used, but the entire system ensured that the master node would have its counterpart (Shadow) so that it could be switched when the master node experienced a problem. Secondly, in the chunk layer, GFS has been designed to treat node failures as normal, so it can handle the problem of chunk node failure very well.

4. High throughput: Although GFS performance is common on a single node, both throughput and latency, the overall data throughput is staggering because it supports thousands of nodes.

5. Protect data: Files are divided into fixed-size blocks of data for easy storage, and each block of data is copied at least three copies.

6. Scalability: Because of the small metadata, a master node can control and manage thousands of stored data chunk nodes.

7. Support compression: For older files, it can be compressed to save hard disk space, and the compression rate is very alarming, sometimes even close to 90%.

8. Based on user space: GFS is primarily run on the system's user space, although in terms of efficiency, user space is slightly lower than kernel space, but easier to develop and test, as well as some POSIX APIs that make better use of Linux.

Good points

Since GFS is primarily designed to store large amounts of search data, it does well in terms of throughput (throughput) and scalability (scalability), a "leader" of the industry, but because it is primarily stored as a 64MB block of data, So the speed of random access is not good, although this is its "soft rib", but this is its original for the throughput and scalability of the trade-off.

Related Products

Similar to MapReduce, GFS has its own products in the open source world, the most famous of which is HDFs Distributed file system, the function and design, HDFs from GFS, and because it is a part of the Hadoop series, So it's a lot of optimizations for the MapReduce framework for better Hadoop.

Actual Use Cases

Google Now runs at least 200 GFs clusters, the largest cluster has thousands of servers, the amount of data is PB-level, and serves multiple Google services, including Google search and Google Earth. Meanwhile, in recent years, due to the high latency issues mentioned above, GFS is not well suited for some of the new Google products, such as YouTube, Gmail, and caffeine search engines, which are highly emphasized in real time, so Google is already developing the next generation GFs, code-named " Colossus ", and there are many differences in design, for example, to support distributed master nodes to promote high availability and support more files and chunk nodes can support the chunk of the 1MB size in order to support the needs of low latency applications, etc., hope that when Colossus mature, Google can also share the details and experience of its design, as GFs did.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More