Google's three core technologies (i) Google File System-Introduction

Source: Internet
Author: User
To meet Google's rapidly growing data processing needs, we have designed and implemented the Google File system (Google System–gfs). GFS has many of the same design goals as traditional distributed file systems, such as performance, scalability, reliability, and availability. However, our design is also based on our observation of the load and the technical environment of our own applications, both now and in the future, and the assumptions of GFS and earlier file systems are significantly different. So we re-examined the traditional file system in the design of the compromise choice, derived from a completely different design ideas. First, component invalidation is considered a normal event, not an incident. GFS includes hundreds of or even thousands of of ordinary inexpensive equipment assembled by the storage machine, while being accessed by a considerable number of clients. The number and quality of GFS components results in the fact that some components may not work at any given time, and some components cannot recover from their current failure state. We have encountered a variety of problems, such as application bugs, operating system bugs, human errors, and even hard drives, memory, connectors, networks, and power failures. Therefore, the mechanisms for continuous monitoring, error detection, disaster redundancy, and automatic recovery must be integrated in GFS. Second, our files are very large, measured by the usual standards. A few gigabytes of files are very common. Each file typically contains many application objects, such as Web documents. When we often need to deal with fast-growing, terabytes of data sets made up of hundreds of millions of of objects, it is very unwise to adopt a small file that manages hundreds of millions of KB sizes, although some file systems support this way of managing. Therefore, the assumptions and parameters of the design, such as I/O operations and the size of the block, need to be reconsidered. Thirdly, most of the files are modified by appending data at the end of the file, rather than overwriting the original data. Random writes to a file are virtually nonexistent in practice. Once written, the file is read-only and is usually read sequentially. Large amounts of data meet these characteristics, such as: Data analysis program scan of the very large data sets, running applications generated by the continuous flow of data, archived data, one machine generated, another machine processing intermediate data, the processing of these intermediate data may be at the same time or may be a follow-up process. For this access pattern for massive files, the client is meaningless to the block cache, and the data append operation is the main consideration for performance optimization and atomicity assurance. Finally, the collaborative design of the application and file system APIs improves the flexibility of the entire system. For example, we relaxed the requirements for the GFS conformance model, which reduced the critical requirements of the file system to the application, greatly simplifying the design of GFS. We have introduced atomic record append operations to ensure multipleThe client is able to perform the simultaneous append operation without the need for additional synchronization operations to ensure data consistency. There is also a detailed discussion of the details of these issues later in this article. Google has deployed multiple GFS clusters for different applications. The largest cluster has more than 1000 storage nodes, more than 300TB of hard disk space, and is continuously accessed by hundreds of clients on different machines.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.