Google core technology-distributed infrastructure

Source: Internet
Author: User
Tags database sharding

 

Press: This is the guest blog series. Contributed by Wu zhuhua, who has been engaged in cloud computing-related research at the IBM China Research Institute, is now working on cloud computing technology.

This series of articles discusses deeply the implementation mechanism of Google App Engine Based on public information. Before getting started with Google App Engine, we will first analyze Google's core technologies and overall architecture to help you better understand the implementation of Google App Engine.

This article mainly introduces Google's ten core technologies, which can be divided into four categories:

  • Distributed infrastructure: GFS, chubby, and Protocol buffer.
  • Distributed large-scale data processing: mapreduce and sawzall.
  • Distributed Database Technology: bigtable and database sharding.
  • Data center optimization technology: High-temperature data center, 12 V battery and server integration
Distributed infrastructure

GFS

Because the search engine needs to process massive data, Google's two founders, Larry Page and Sergey Brin, designed a file system named "bigfiles" in their early stages, GFS (all called "Google file system") is a continuation of "bigfiles.

First, introduce its architecture. GFS is mainly divided into two types of nodes:

  • Master node: stores metadata related to data files, rather than chunk (data block ). Metadata includes a table that maps 64-bit tags to the location of the data block and its composition file, the location of the copy of the data block and the process that is reading and writing specific data blocks. Also, the master node periodically receives updates ("heart-beat") from each chunk node to keep the metadata updated.
  • Chunk node: as the name suggests, it must be used to store Chunk. data files are stored in a 64 MB chunk format separated by a default chunk, and each chunk has a unique 64-bit tag, each Chunk is replicated multiple times in the entire distributed system. The default value is 3.

Is the architecture diagram of GFS:

Figure 1. GFS architecture ([15])

Then, in terms of design, gfs has eight main features:

  • Large files and big data blocks: the size of data files is generally at the GB level, and the default size of each data block is 64 mb. This reduces the size of metadata, the master node can conveniently store metadata in the memory to improve access efficiency.
  • Operations are mainly added: Because files are rarely deleted or overwritten, they are usually added or read. This takes into account the linear throughput of the hard disk and the slow random read/write speed.
  • Support for Fault Tolerance: first, although a single master solution was adopted for ease of design, the entire system will ensure that each master has its counterparts, this allows you to switch between master nodes when a problem occurs. Second, in the chunk layer, gfs has considered node failure as normal in design, so it can well handle the failure of chunk nodes.
  • High throughput: Although the performance of a single node is common in terms of throughput and latency, the total data throughput is amazing because it supports thousands of nodes.
  • Protect data: first, the file is divided into fixed-size data blocks for easy storage, and each data block is copied by the system in three copies.
  • Strong Scalability: Because metadata is small, a master node can control thousands of chunk nodes that store data.
  • Supports compression: for older files, you can compress them to save hard disk space, and the compression ratio is amazing, sometimes close to 90%.
  • User space: although the user space runs less efficiently, it is easier to develop and test, and some POSIX APIs provided by Linux are better utilized.

Currently, Google has at least 200 GFS clusters running internally. The largest cluster has thousands of servers and serves multiple Google services, such as Google search. However, because GFS is mainly designed for search, it is not very suitable for some new Google products, such as YouTube, Gmail and caffeine search engines that emphasize large-scale indexing and real-time performance, therefore, Google is already developing the next generation GFS, codenamed "Colossus", and there are many differences in design. For example, it supports distributed master nodes to improve high availability and support more files, the chunk node supports 1 MB chunk to meet the needs of low-latency applications.

Chubby

In short, Chubby is a Distributed Lock service. With chubby, thousands of clients in a distributed system can "Lock" or "unlock" a resource ", it is often used for bigtable collaboration. In terms of implementation, it is to implement "locking" by creating files and based on the paxos algorithm of the famous scientist Leslie Lamport.

Protocol Buffer

Protocol buffer is Google's internal use of a language-neutral, platform-neutral, and scalable serialization of structured data, and provides the implementation of Java, C ++, and python, each implementation contains the compiler and library files of the corresponding language, and it is a binary format, so its speed is to use XMLAbout 10 times of data exchange. It is mainly used in two aspects: RPC communication, which can be used for communication between distributed applications or in heterogeneous environments. The second is data storage. Because of its self-description and convenient compression, it can be used for data persistence. For example, it stores log information and can be processed by the map reduce program. Similar to Protocol buffer, there is also Facebook's thrift, and Facebook claims that thrift has a certain speed advantage.

 

 

Distributed large-scale data processing

Mapreduce

First, there will be a large amount of data to be processed in the Google data center, such as a large number of webpages crawled by web crawlers. Because many of the data is Pb-level, the processing work has to be parallelized as much as possible. To solve this problem, Google introduced the mapreduce programming model. mapreduce is derived from functional languages, it mainly uses the "map" and "reduce" Steps to process large-scale datasets in parallel. Map will first perform the specified operation on each element in the logical list composed of multiple independent elements, and the original list will not be changed, multiple new lists are created to save the map processing results. This means that map operations are highly parallel. After map is completed, the system first clears and sorts the New Multiple lists, and then performs the reduce operation on these newly created lists, that is, the elements in a list are properly merged according to the key value.

 

For mapreduce Running Mechanism:

 

 

 

 

Next, we will give an example of mapreduce: for example, search for Spider to capture a large number of web pages to a local GFS cluster, the index system then performs parallel Map Processing on multiple chunks in the GFS cluster and generates multiple keys as URLs, value is the key-value map of the HTML page. Then, the system shuffle these generated key-value pairs ), then, the system merges these key-value pairs based on the same key value (that is, URL) through the reduce operation.

Finally, a simple programming model like mapreduce can not only be used to process large-scale data, but also hide complicated details, such as automatic parallelization, Server Load balancer, and machine downtime processing, this greatly simplifies the development of programmers. Mapreduce can be used to include distributed grep, distributed sorting, Web access log analysis, reverse index construction, Document Clustering, machine learning, and statistical-based machine translation, generate the entire search index of Google and perform other large-scale data processing. Yahoo also released the open-source version of mapreduce hadoop, which has been widely used in the industry.

 

Sawzall

Sawzall can be considered to be a domain-specific language (DSL) built on mapreduce and similar to Java syntax. It can also be considered as a distributed awk. It is mainly used for filtering, aggregating, and other advanced data processing operations on large-scale distributed data. In terms of implementation, it is converted into a corresponding mapreduce task through the interpreter. In addition to Google's Sawzall, Yahoo introduced similar pig languages, but its syntax is similar to SQL.

 

 

 

Distributed Database Technology

Bigtable

Because Google's data center stores petabytes of non-relational data, such as web pages and geographic data, in order to better store and use the data, Google has developed a database system, the name is "bigtable ". Bigtable is not a relational database. It does not support join and other advanced SQL operations. Instead, it uses a multi-level ing data structure, it is a self-management system designed for large-scale processing and fault tolerance. It has TB-level memory and Pb-level storage capabilities and uses structured files to store data, and can process millions of read/write operations per second.

What is a multi-level ing data structure? It is a sparse, multidimensional, sorted map. Each cell is located in three dimensions by row keyword, column keyword, and timestamp. the cell content is an uninterpreted string. For example, the following table stores the text of the content of each website and the reverse connection text of other websites. Reverse url com. CNN. WWW is the keyword of this line. The contents column stores the content of the webpage. Each content has a timestamp. Because there are two reverse connections, the archor column family has two columns: Anchor: cnnsi.com and anchhor: my. look. CA. The column family concept makes it easy to scale the table horizontally. The following figure shows the specific data model:

 

 

 

 

 

In terms of structure, bigtable is based on the GFS Distributed File System and the chubby Distributed Lock service. Bigtable is also divided into two parts: the first is the master node, which is used to process metadata-related operations and supports load balancing. The second is the tablet node, which is mainly used to store the sharded tablet of the database and provide corresponding data access. Meanwhile, the tablet is based on the format named sstable, which has good support for compression.

 

 

 

 

Bigtable is providing a supporting platform for Google to store and obtain structured data for over 60 products and projects, including Google Print, Orkut, Google Maps, Google Earth, and blogger, in addition, Google runs at least 500 bigtable clusters.

As Google's internal services continue to increase demand and technology continues to develop, the original bigtable cannot meet user needs, and Google is also developing the next generation bigtable, the name is "spanner (wrench)", which has the following features that cannot be supported by bigtable:

  1.  
    1. Supports multiple data structures, such as table, Familie, group, and coprocessor.
    2. Fine-grained replication and permission management based on hierarchical directories and rows.
    3. Supports strong consistency and weak consistency control across data centers.
    4. Highly consistent copy Synchronization Based on the paxos algorithm and supports distributed transactions.
    5. Provides many automated operations.
    6. Powerful scalability, supporting clusters with millions of servers.
    7. You can customize important parameters such as latency and number of copies to meet different requirements.

 

 

Database sharding

Sharding means sharding. Although non-relational databases such as bigtable play a very important role in the world of Google, sharding is applicable to traditional OLTP applications, such as advertising systems, google uses the traditional relational database technology, namely MySQL, And because Google needs to deal with huge traffic, Google uses sharding at the database layer) partition is an improvement in the partition mode of the traditional vertical expansion (scale up), mainly through time, A large database is divided into multiple slices by means of scope and service-oriented, and these data slices can be horizontally expanded across multiple databases and servers.

Google's complete database sharding technology has the following advantages:

    1. Strong Scalability: in the Google production environment, there are already MySQL sharding clusters that support thousands of servers.
    2. Amazing throughput: massive MySQL shard clusters can meet massive query requests.
    3. Global backup: not only in one data center, but also globally, Google will back up MySQL's multipart data, which not only protects data, but also facilitates expansion.

 

The implementation can be divided into two parts: one is to add the database sharding technology on the basis of MySQL InnoDB. Second, related sharding technologies are also added on the basis of hibernate at the orm layer, and virtual Shard is supported to facilitate development and management. At the same time, Google has submitted these two aspects of code to relevant organizations.

 

Data center Optimization Technology

 

High-temperature data center

The power usage quota tiveness of large and medium-sized data centers is usually around 2, that is, 1 level of power is consumed on computing devices such as servers, and 1 level of power is consumed on auxiliary devices such as air conditioners. For some very good data centers, it can reach up to 1.7, but Google has made some data centers reach the industry-leading 1.2 through some effective designs. In these designs, the most distinctive feature is the high-temperature data center, which enables computing devices in the data center to run at high temperatures, erik teetzel, Google's energy director, said: "a common data center works under 70 degrees Fahrenheit (21 degrees Celsius), while we recommend 80 degrees Fahrenheit (27 degrees Celsius) ". However, there are two common restrictions on increasing the temperature of a data center: one is the collapse point of a server device, and the other is precise temperature control. If these two points are done well, the data center will be able to work at high temperatures, because assuming that the data center administrator can adjust the temperature of the data center by plus or minus 1/2 degrees, this will enable the server device to work within 5 degrees of collapse, rather than within 20 degrees, which is both economical and secure. In addition, it is rumored that Intel provides Google with a custom chip for high-temperature design, but James Hamilton, a top expert in the cloud computing field, thinks it is unlikely because although the processor is also very afraid of heat, however, it is much stronger than memory and hard disk, so the processor is not a core factor in the Design of High Temperature Resistance. At the same time, he also strongly supported the idea of making the data center highly temperature. In addition, he hoped that the data center would even run at 40 degrees Celsius in the future, which not only saved the air-conditioning costs, it is also beneficial to the environment.

  

12 V battery

Because the traditional ups is a waste of resources, Google has a different approach in this regard, using a dedicated 12 V battery for each server to replace the commonly used ups, if the primary power system fails, the battery is responsible for powering the server. Although large ups can achieve 92% to 95% efficiency, it is very limited compared to the 99.99% built-in battery, and due to the conservation of energy, as a result, the power that is not fully utilized by UPS will be converted into heat energy, which will cause the energy consumption for air conditioners to rise accordingly and thus enter a vicious circle. At the same time, there is a similar "magic pen" in the power supply, the common server power supply will provide both 5 V and 12 v dc. However, the server Power Supply designed by Google only outputs 12 V Direct Current, and the necessary conversion is carried out on the motherboard, although this design will increase the cost of the motherboard by 1 USD to 2 USD, however, it not only enables the power supply to run close to its peak capacity, but also delivers a higher current aging rate on copper lines.

 

Server Integration

When talking about the killer of virtualization, the first thought must be server integration, and the integration rate can be achieved to reduce the cost of all aspects. Interestingly, Google also introduced the idea of server integration in terms of hardware. It places two servers in the space of a single chassis, which has many advantages, first, the footprint is reduced. Second, we can reduce investment in equipment and energy by sharing two servers, such as power supplies.

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.