Three core technologies of Google (c) google_bigtable Chinese version

Source: Internet
Author: User
Tags require versions

Three core technologies of Google (c) google_bigtable Chinese version

Bigtable: A distributed, structured data storage System

Translator: Alex

Summary

BigTable is a distributed, structured data storage system designed to handle massive amounts of data: typically petabytes of data distributed across thousands of ordinary servers. Many of Google's projects use BigTable to store data, including web indexes, Google Earth, and Google Finance. The requirements of these applications vary greatly from the bigtable of data (from URLs to web pages to satellite imagery) to response speed (from back-end batch processing to real-time data services). Despite the wide variation in application requirements, BigTable has successfully provided a flexible, high-performance solution for these products for Google. This paper describes a simple data model provided by BigTable, which allows the user to dynamically control the distribution and format of the data, and we will also describe the design and implementation of BigTable.

1 Introduction

Over the last 2.5 times, we've designed, implemented, and deployed a distributed, structured data storage System-at Google, we call it bigtable. The bigtable is designed to reliably process petabytes of data and can be deployed on thousands of machines. BigTable has achieved several goals: broad applicability, scalability, high performance, and high availability. BigTable has been applied to more than 60 Google products and projects, including Google Analytics, Google Finance, Orkut, Personalizedsearch, writely and Google Earth. These products have different requirements for bigtable, some require high-throughput batching, others need to respond in a timely manner and quickly return data to end users. The configuration of the bigtable clusters they use is also very different, with some clusters having only a few servers, while others require thousands of servers and store hundreds of terabytes of data.

In many ways, BigTable is similar to a database: It uses a number of database implementation strategies. The parallel database "14" and the Memory Database "13" are already scalable and performant, but BigTable provides an entirely different interface from these systems. BigTable does not support the complete relational data model; In contrast, BigTable provides customers with a simple data model that allows them to dynamically control the distribution and format of the data (Alex Note: The data is not formatted for BigTable , in the terminology of the database domain, that is, the data does not have schema, the user to define the schema), users can also speculate (Alex Note: reasonabout) The underlying storage data location correlation (Alex Note: Location correlation can be understood, such as tree structure, Data with the same prefix is placed close to the location. This data can be read at a time when it is read. The index of the data is the name of the row and column, and the name can be any string. BigTable treats the stored data as strings, but the bigtable itself does not parse the strings, and the client typically serially serialize the various structured or semi-structured data into those strings. By carefully selecting the mode of the data, the customer can control the location dependency of the data. Finally, the bigtable mode parameter can be used to control whether the data is stored in memory or on the hard disk.

The second section describes the more detailed aspects of the data model, the third section outlines the client API, and the fourth section briefly describes the Google infrastructure used at the bottom of the BigTable, and section fifth describes the key parts of the bigtable implementation. The 6th section describes some fine tuning methods we use to improve the performance of BigTable, and section 7th provides bigtable performance data, and section 8th describes several examples of Google's internal use of bigtable The 9th section is our experience and lessons learned during the design and post-support process, and finally, in section 10th, we make a list of our relevant research work, and section 11th is our conclusion.

2 Data Model

BigTable is a sparse, distributed, persistent storage-multidimensional sort map (Alex Note: For programmers, map should not be translated.) The map is made up of key and value, and we use key and value directly, which is no longer translated separately. The index of the map is the row keyword, the column key, and the timestamp; each value in the map is an unresolved byte array.

(Row:string,column:string,time:int64)->string

After careful analysis of the various potential uses of a system like bigtable, we decided to use this data model. Let's start with a concrete example that has led us to make a lot of design decisions, assuming we want to store a huge amount of web pages and related information that can be used for many different projects, and we would call this special table webtable. In webtable, we use the URL as the row keyword, using some of the page's properties as the column name, the content of the Web page exists in the "Contents:" Column, and is used to obtain the timestamp of the Web page as the identity (Alex Note: that is, according to the acquisition time is different, stored multiple versions of the Web page data), As shown in Figure A.

Figure one: A fragment of a table that stores an example of a Web page. The row name is a reverse URL. The contents column family holds the contents of the Web page, and the Anchor column family holds the anchor text that refers to the page (Alex Note: If you don't know the HTML anchor, please Google). The CNN home page is referenced by sports Illustrater and My-look's home page, so the row contains columns called "Anchor:cnnsi.com" and "anchhor:my.look.ca". There is only one version per anchor (Alex Note: Note that the timestamp identifies the version of the column, T9 and T8 respectively identify the two anchor-connected versions) , while the contents column has three versions, respectively, by the timestamp t3,t5, and the T6 identity.

Line

The row keyword in the table can be any string (currently supports a maximum of 64KB strings, but 10-100 bytes is sufficient for most users). Reading or writing to the same line keyword is atomic (no matter how many different columns are read or written in this line), this design decision can make it easy for the user to understand the behavior of the program in concurrent update operations on the same row.

BigTable organizes data by the dictionary order of row keywords. Each row in the table can be dynamically partitioned. Each partition is called a "tablet" and the tablet is the smallest unit of data distribution and load balancing adjustments. The result of this is that the operation is very efficient when it reads only a few columns of data in the row, and typically requires only a few times of inter-machine communication to complete. Users can make better use of this feature by selecting the appropriate line keyword to effectively leverage the data's location dependencies during data access. For example, in Webtable, by reversing the host name in the URL, you can organize the pages of the same domain name into contiguous rows. Specifically, we can store the maps.google.com/index.html data under the keyword com.google.maps/index.html. Storing pages in the same domain in a contiguous area makes it more efficient to host and domain-based analysis.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.