Explore Google data Center internal operation

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

CNET Science and Information Network June 2 International report Google unveiled the internal work style of the mystery.

Google, the search giant, rarely exposes its data centers, but last week Google researcher Jeff Dean uncovered some of its operations at Google I/O meetings.

On the one hand, Google used a number of regular servers, on the other hand, Google has 1800 of servers into the cluster, these cluster servers responsible for Google's day-to-day search processing tasks, this part of the number of servers in about 700 to 1000 units.

Google does not disclose how many servers it owns, but we estimate the number of thousands. According to Dean, Google has 40 servers in a cluster, while Google has 36 data centers worldwide. Each data center has 150 server clusters, which means that Google has more than 200,000 servers, but the number of Google servers should be far more than this number, and growing every day.

No matter how many servers there are, Google's success has been impressive. The New York Stock Exchange and the airline ticketing system use a large number of backbone server and software, while Google mainly uses its own technology.

To be sure, many server vendors will be sour about this, but Google clearly believes that it is safest to keep its technology destiny in its own hands. Marissa Mayer, vice president of Google search products and user experience, said the founder Larry Page encouraged a "healthy, impossible to say no" atmosphere in the company.

In order to deal with Google's search scale, you need to make each machine performance to the extreme. While server vendors relish the fault-tolerant performance of their high-end models, Google is more than willing to invest money in fault-tolerant software.

Dean said: "Our view is that the number of unreliable hardware is best twice times that of reliable models." You need to put reliability on the software level. If you run 10,000 machines, there are some crashes every day. ”

Dean said, in the first year of running for each server cluster, 1000 machines typically fail, thousands of hard drives have problems, and a "Power allocation unit" (PDU) will break, making 500 to 1000 machines work 6 hours; 20 server racks will fail. Causing 40 to 80 machines to fall off the network; 5 server racks will become unstable, leaving half the packets on the rack unresponsive; a server cluster needs to be reconnected, which will affect 5% of the machines and affect the time span generally 2 days. The server cluster has 50% overheating possibilities, and overheating will allow most servers to be in 5 minutes and take 1-2 days to recover.

Although Google uses a generic hardware device, it does not use the usual software on the software. Google requires Intel to provide a specially customized circuit board. According to Dean, Google currently has a chassis for each of the 40 servers, rather than a chassis for each server as usual.

For the server itself, Google prefers multi-core chip configurations. While many software companies are struggling to adapt to the era of multi-core chips, Google is handy with the chips. Google has had to adapt its technology to a limited hardware resource architecture, so they have entered the era of parallel processing.

Dean said: "We do like multi-core machines." For us, multi-core machines achieve good connection performance with a small number of machines. For us, they are easier to use. ”

While Google needs to respond quickly to search and other services, parallel processing can accomplish this task, although it is possible that a single thread is not fast.

"The performance of a single thread does not really matter to us," Dean said. We will focus mainly on the problem of parallel processing. ”

How Google makes these common hardware work. With software.

Dean explains the three core elements of Google's software: Google's file System (GFS), Google's big table (BigTable: Google's one interface or service for distributing storage and access to semi-structured data); MapReduce algorithm ( It is a Google-developed C + + programming tool for large-scale data set parallel operations larger than 1TB data. Although Google relies on many open-source projects to take off, Google is secretive about the three core elements.

Google file system is at the bottom of these three elements, it is responsible for many servers, machine data storage work. Many Google file systems are unusually large in size, with several petabyte (1 petabyte equivalent to 1 million gigabytes). There are more than 200 server clusters running with Google file systems, many of which contain thousands of machines.

The Google file system stores large volumes of data (typically 64MB) on at least three "block Servers" (chunkservers); If a block server fails, the primary server is responsible for restoring the data to the new zone. Dean said: "At least at the storage level, the processing of machine faults is done by Google file system." ”

To provide some structure for all this data, Google uses large tables. The business database, like Oracle and IBM, doesn't work here because they don't meet Google's needs, Dean says. If you want to use a business database, the price will be very expensive.

Google has been designing large tables since 2004 and is now used in more than 70 Google projects, including Google Maps, Google Earth, blogger,google print and the core search catalog. Dean said the largest table managed by a data table has 6 petabytes size, covering thousands of machines.

In 2003, Google wrote the first version of MapReduce, an algorithm that gave Google a way to make its data work. For example, MapReduce can find the number of times a word appears in a Google search directory, the frequency of a particular word in a series of pages, the number of sites linked to a particular site, and so on.

With Mapreduce,google, you can write an index directory that quickly displays the pages associated with a particular word. Dean said: "In order to complete the work in an acceptable time, you need to be processed on thousands of machines." ”

Google's use of MapReduce software is on the rise. In 2004, MapReduce ran 29,000 jobs, and by 2007 2.2 million jobs had been completed by MapReduce. During the same period, the average running response time for a mapreduce decreased from 634 seconds to 395 seconds, and the output of mapreduce tasks rose from 193 terabytes to 14018 terabytes.

On any day, says Dean, Google runs about 100,000 mapreduce jobs, each of which consumes about 400 servers, about 5-10 minutes.

This is an interesting mathematical calculation. Assuming that the server completes only MapReduce, each server completes only one task at a time, it takes about 24 hours, if each of these tasks takes 5 minutes, which means that the MapReduce task consumes about 139,000 servers. If it takes 7.5 minutes, the number of servers required increases to 208,000, and if 10 minutes are required, the number of servers is increased to 278,000 units.

Like Google file systems, MapReduce's design also takes into account the server's machine problems.

Dean said: "When a machine is a machine, the primary server understands what the task is assigned to the machine, and then directly assigns other machines to accomplish this task." ”

The reliability of MapReduce has withstood a rigorous test in the maintenance of a cluster of 1800 servers. The staff took out 80 of the machines, and the other 1720 machines undertook the processing of 80 machines. Dean said: "It's running some slow, but it's all done." ”

In 2004, Dean said, once there was a cluster of 1800 machines, of which 1600 were machines, but the whole system stood the test.

Despite the great success of Google's data centers, the company is not satisfied, they have a long-term plan to improve development.

For a typical enterprise, they only need to consider moving tasks from one server to another, but Google faces a different workload level, and they want to be able to automatically transfer work tasks from one datacenter to another.

Dean said: "We want our next generation architecture to be a system that transcends a single machine." ”

Given the magnitude of Google's business, this is undoubtedly a daunting challenge. There is no doubt that many small companies are looking at them with envy.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Explore Google data Center internal operation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Explore Google data Center internal operation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support