One of the three Chinese versions of Google's thesis: Bigtable: A distributed structured data storage System

Last Update:2018-07-26 Source: Internet

Author: User

Tags sessions

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Fixed several deficiencies in Alex's translation
Bigtable: A distributed, structured data storage System Summary

BigTable is a distributed storage system that manages structured data and is designed to handle massive amounts of data: PB-level data distributed across thousands of common servers. Many of Google's projects store data in BigTable, including web indexing, Google Earth, and Google Finance. The requirements for bigtable vary widely, whether in the size of the data (from URLs to Web pages to satellite images) or in response speed (from the back-end bulk processing to the real-time data services). Despite the wide variations in application requirements, BigTable has managed to deliver a flexible, high-performance solution for all of Google's products. This paper describes the simple data model provided by BigTable, which allows users to dynamically control the layout and format of the data, and we will also describe the design and implementation of BigTable. 1 Introduction

Over the past 2.5 years, we've designed, implemented, and deployed a distributed storage system for managing structured data--at Google, we call it bigtable. The bigtable is designed to reliably accommodate PB-level data and thousands of machines. BigTable has achieved several of the following goals: Broad applicability, scalability, high performance, and high availability. There are more than 60 Google products and projects in use BigTable, including Google Analytics, Google Finance, Orkut, personalized Search, writely and Google Earth. These products use BigTable to complete disparate workload requirements ranging from throughput-oriented batch jobs to end-user-sensitive data services. The configuration of the BigTable cluster they use is also very different, from a handful of machines to thousands of servers that can store up to hundreds of TB of data.

In many ways, BigTable is similar to a database: It uses a number of implementation strategies for the database. The parallel database "14" and the Memory Database "13" are already scalable and high-performance, but BigTable provides a completely different interface to these systems. BigTable does not support a complete relational data model; In contrast, BigTable provides customers with a simple data model that allows customers to dynamically control the layout and format of the data (Alex Note: For BigTable, the data is unformatted, In the terminology of the database domain, the data has no schema, the user defines the schema themselves, and the user can speculate (Alex Note: Reason about) the location attributes of the data displayed in the underlying store (Alex Note: Position dependencies can be understood in this way, such as the tree structure, Data with the same prefix is stored in a close position. This data can be read one at a time while reading. The data is indexed by the name of the row and column, and the name can be any string. Although client programs typically serialize various structured or semi-structured data into strings, BigTable also treats the data as an unresolved string. By carefully selecting the pattern of the data, customers can control the location of the data. Finally, the data can be read or written dynamically through the BigTable mode parameters (control whether to serve).

Section II describes the data model in more detail, and section III outlines the client API; Section Fourth briefly describes the underlying Google infrastructure of bigtable dependencies; section Fifth describes the rationale for BigTable implementations Section 6th describes some of the finer tuning methods used to improve the performance of BigTable; section 7th provides bigtable performance data; Section 8th describes several examples of Google internal use BigTable Section 9th discusses our experience and lessons learned in the design and post support process and, finally, the relevant work in section 10th, and section 11th is our conclusion. 2 Data Models

BigTable is a sparse, distributed, persistent storage multidimensional sort map (Alex Note: For programmers, map should not be translated.) The map consists of key and value, and we use key and value directly and no longer translate it. The map is indexed by row keywords, column keywords, and timestamps, and each value in the map is an unresolved array of bytes.

(Row:string, Column:string,time:int64)->string

After careful analysis of the potential uses of a similar bigtable system, we decided to choose this data model. Let's start with a concrete example, which prompts us to make a lot of design decisions; Suppose we want to back up a ton of web pages and related information that can be used for many different projects, let's say this particular table is webtable. In webtable, we use the URL as the line keyword, use the various properties of the Web page (aspect) as the column name, the contents of the Web page are in the "Contents:" column, and are identified with the timestamp of the page (Alex Note: Depending on the time of acquisition, More than one version of the Web page data is stored, as shown in figure one.

Figure I: A fragment of a table that stores examples of Web pages. The row name is a reverse URL. The contents column family accommodates the contents of the Web page, and the Anchor column family accommodates the anchor text that references the page. CNN's home page is referenced by sports Illustrater and My-look's homepage, so the row contains columns named "Anchor:cnnsi.com" and "anchhor:my.look.ca". There is only one version per anchor data item (Alex Note: Note that the timestamp identifies the version of the column, that T9 and T8 identifies the two anchor versions), and that the Contents column has three versions, respectively, by the timestamp t3,t5, and the T6 identity. Line

The row key in the table is any string (currently supports a maximum of 64KB strings, but for most users, 10-100 bytes is sufficient). Each read or write operation under a single line of keywords is atomic (regardless of the number of different columns being read or written in this line), this design decision makes it easy for the user to speculate (reason about) the system behavior when concurrent update operations are performed on the same row.

BigTable maintains data by the dictionary order of the Line keywords. A range of rows in a table are dynamically partitioned. Each partition is called a "tablet", and the tablet is the unit of data distribution and load balancing. As a result, reading a small number of rows within a certain range is efficient and often requires only communication with a few machines. Users can develop this feature by selecting their line keywords, which gives them a good locality for their data access (get good locality). For example, we store data for maps.google.com/index.htm under the index of Keyword com.google.maps/index.html. Storing pages in the same domain in contiguous areas can make the analysis of some hosts and domain names more efficient. Row Family

A set of column keywords is called a "column family", which constitutes the basic unit of access control. All data stored in the same column family is usually of the same type (we compress the data under the same column family). The column family must be created before the data is stored under any column keywords in the column family, and after the column family is created, the data can be stored under any one of the column keys. Our intention is that the number of different column families in a table is small (up to hundreds of), and that the column family is rarely changed in the operation. In contrast, a table can have an infinite number of columns.

The naming syntax for column keywords is as follows: Column family: qualifier. The name of the column family must be a printable string, but the qualifier can be any string. For example, Webtable has a column family of language, which holds the language for writing pages. We use only one column keyword in the language column family to hold the language ID IDs of each page. Another useful column family in webtable is anchor; each column keyword in this column family represents a single anchor connection, as shown in Figure I. The qualifier is the name of the site that references the page, and the data item content is the linked text.

The count of access control, disk, and memory is done at the column family level. In the example of our webtable, the above control authority can help us manage different types of applications: Some applications can add new basic data, some can read basic data and create derived column families, and some only allow browsing of existing data (and may even be unable to browse all existing row families for privacy reasons). time Stamp

In BigTable, each data item can contain different versions of the same data, and these versions are indexed by timestamps. The bigtable timestamp is a 64-bit integer number. The timestamp can be specified by BigTable, in which case the timestamp represents a "live" time that is accurate to milliseconds, or the value is explicitly specified by the library program. Programs that need to avoid conflicts must generate a unique timestamp themselves. Different versions of data items are arranged in reverse chronological order, so the latest version can be read first.

To mitigate the administrative burden of multiple versions of data, we provide two sets of parameters for each of the column families, which can be used to automate garbage collection of obsolete versions of data bigtable. Users can either specify that only the last n version of the data is saved, or they can save only the "new enough" version of the data (for example, only the data written for the last 7 days is saved).

In our example webtable, we will store the timestamp of the page in the contents: column as the time stamp for this version of the page is actually crawled (Alex Note: Contents: The time stamp information that the column stores is the time the web crawler crawls a page). The garbage collection mechanism mentioned above allows us to retain only the last three versions of each page. 3 API

BigTable provides API functions for creating and deleting tables and for column families. BigTable also provides APIs for modifying metadata for clusters, tables, and column families, such as modifying access rights.

Open the table

Table *t = Openordie ("/bigtable/web/webtable");

Write a new anchor and delete an old anchor

Rowmutation R1 (T, "com.cnn.www");

R1. Set ("anchor:www.c-span.org", "CNN");

R1. Delete ("anchor:www.abc.com");

Operation op;

Apply (&op, &R1);

The client can do the following: Write or delete values in the bigtable, look up values from individual rows, or traverse a subset of the data in the table. \ bigtable The C + + code in Figure 2 uses the Rowmutation abstract object to perform a series of update operations. (To keep the sample simple, we omitted extraneous details). The call to apply performs an atomic modification of the webtable (mutation) operation: It adds an anchor point to the www.cnn.com and deletes another anchor point.

Scanner Scanner (T);

Scanstream *stream;

Stream = Scanner. Fetchcolumnfamily ("anchor");

Stream->setreturnallversions ();

Scanner. Lookup ("com.cnn.www");

for (;!stream->done (); Stream->next ()) {

printf ('%s%s%lld%s\n ',

Scanner. RowName (),

Stream->columnname (),

Stream->microtimestamp (),

Stream->value ());

}

The C + + code in Figure 3 uses the scanner abstract object to traverse all anchor points within a particular row. The client program can traverse multiple column families, and there are several mechanisms to limit the rows, columns, and timestamps of the scanned output. For example, we can limit the scan above so that it prints only those anchor points that match the regular expression *.cnn.com, or those that have timestamps that are 10 days before the current time.

BigTable also supports a number of other features that allow users to manipulate data in more complex ways. First, BigTable supports transaction processing on a single line, which allows users to perform atomic read-update-write operations on data stored under a single row key. Although BigTable provides an interface that allows users to bulk write across line keywords (at the clients?), BigTable currently does not support common cross row transactions. Second, BigTable allows data items to be used as integer counters. Finally, BigTable supports scripting in the server's address space. The script is written using the Sawzall language "28" developed by Google for data processing. Currently, our Sawzall API does not allow client scripts to write data back to BigTable, but it allows for multiple forms of data conversion, data filtering based on arbitrary expressions, and summary induction through a variety of operators.

BigTable can be used with MapReduce "12", MapReduce is a framework for running large-scale parallel computing in Google's development. We have developed a set of wrappers (wrapper) that enable BigTable to be either a source input for a mapreduce job or a target output output. 4 BigTable Member

BigTable is built on some other Google infrastructure. BigTable uses the Google Distributed File System (GFS) "17" to store logs and data files. BigTable clusters are often run in a shared machine pool, and machines in the pool run other kinds of distributed applications, and bigtable processes often share machines with other applications. BigTable relies on cluster management systems to dispatch jobs, manage resources, handle machine failures, and monitor the state of machines on shared machines. sstable file Format

BigTable data is stored internally using the Google sstable file format. Sstable provides a persistent, sorted, immutable mapping (map) from key (key) to value (value), where the key and value are arbitrary byte (byte) strings. The following actions are provided for sstable: The query relates to a value that is associated with a specified key value, or all key value pairs within the range of a specified key value. Internally, the Sstable is a series of data blocks (usually 64KB per block, but this size is configurable). Sstable uses a block index (usually stored at the end of the sstable) to locate a block of data, and when the sstable is opened, the index is loaded into memory. A lookup can be done with one disk search: First, perform a binary lookup to locate the appropriate block of data in the memory index, and then read the appropriate block of data from the hard disk. You can also choose to map the entire sstable to memory so that you can perform a query search without having to access the hard disk. Chubby distributed Lock service

BigTable also relies on a highly available, persistent distributed lock service component called chubby "8". A chubby service consists of 5 active copies, one of which is selected as master and actively handles requests. The chubby service is available only if most replicas are working correctly and are able to communicate with each other. When a copy fails, Chubby uses the Paxos algorithm "9,23" to ensure the consistency of the replica when it fails. Chubby provides a name space, which includes a directory and small files. Each directory or file can be used as a lock, and the read-write operation of the file is atomic. The Chubby client library provides a consistent cache of chubby files. Each chubby client maintains a session with the chubby service. If the client cannot sign the session lease within the expiration of the lease, the session expires (a client ' s sessions expires if it is unable to renew its sessions lease within the LEAs e expiration time.). When a client session fails, the lock it owns and the open file handle are invalidated. The chubby client can register the callback function on the chubby file and directory, and the callback function notifies the client when the file or directory changes, or if the session expires.

BigTable uses chubby to accomplish the following tasks: Ensure that at any time at most only one activity of master The location of the bootstrapper where the bigtable data is stored (refer to section 5.1), the tablet server is found, and after the tablet server is invalidated (Section 5.2), the schema information for the BigTable (the column family information for each table) is stored, and the access control list is stored. If chubby is inaccessible for a long time, bigtable will fail. We have recently measured this effect on 14 bigtable clusters spanning 11 chubby service instances. The average ratio of the bigtable server clock is 0.0047%, during which some of the data in BigTable is inaccessible due to chubby unavailable (chubby cannot be accessed because of chubby itself failure or network problems). The percentage that is most affected by chubby failure in a single cluster is 0.0326%. 5 Implementation

The implementation of BigTable has three main components: a library that links to each client program, a master server, and multiple tablet servers. You can dynamically add (or remove) a tablet server in a cluster to accommodate changes in workloads.

Master is primarily responsible for assigning tablets to the tablet server, detecting newly joined or expired table servers, balancing the load on the tablet server, and garbage collection of files in GFs. In addition, it handles schema modification operations, such as creating tables and column families.

Each tablet server manages a set of tablets (usually about dozens of to thousands of tablet servers per tablet server). The tablet server handles the read and write operations of the tablet it loads, and splits the larger tablet that grows.

Like many single primary node (single-master)-type distributed Storage System "17.21", customer data does not go through the Master server: client programs communicate directly with the tablet server for read and write operations. Because BigTable clients do not rely on the master server to obtain location information for their tablet, most client programs do not even communicate with master at all. As a result, master's load is very light in practical applications.

A bigtable cluster stores a number of tables, each containing a set of tablets, and each tablet contains all the relevant data for a range of rows. In the initial state, each table consists of only one tablet. As the data in the table grows, it is automatically split into multiple tablets, and by default the size of each tablet is about 100MB to 200MB. 5.1 location information for your tablet

We use a three-layer structure similar to the B + tree [10] to store the location information of the tablet (Figure 4).

The first layer is a file stored in chubby that contains the location information for the root tablet. The root tablet contains the location information for all of the tablet in a special metadata (METADATA) table. Each metadata tablet contains the location information for a set of user's tablet. The root tablet is actually just the first tablet in the metadata table, but it's handled more specifically-root the tablet will never be split-this ensures that the tablet's position level does not exceed three layers.

The metadata table stores the location information for each tablet under a single line key, which is encoded by the identifier of the table in which the tablet is located and the last line of the tablet. Each metadata row stores approximately 1KB of data in memory. In a metadata tablet of moderate size and size limited to 128MB, our three-tier structure location information pattern is sufficient to address 234 (three-layer 27+10*27+10) tablet (or 261 bytes in 128M metadata).

The client library caches the location information for your tablet. If the client does not know the location of a tablet, or if it finds that the address information that it caches is incorrect, the client program moves recursively to the tablet hierarchy, and if the client cache is empty, then the addressing algorithm needs to be addressed back and forth through three of network traffic. This includes a chubby read operation. If the client-cached address information expires, the addressing algorithm can be up to 6 times (Alex Note: Three of these traffic finds the cache expires and three update cached data) the network communicates back and forth, because the expired cache entry is only discovered when data is not found (upon misses) ( Suppose the metadata tablet is not moved frequently). Although the location of the tablet is stored in memory, there is no need to access GFS, but usually we will further reduce the access cost by prefetching The tablet address: whenever the metadata table is read, metadata is read for more than one tablet.

Secondary information is also stored in the metadata table (Alex Note: secondary information), including all event logs related to the tablet (for example, when a server starts servicing the tablet). This information helps troubleshoot failures and performance analysis. 5.2 Tablet Assignments

Each tablet is assigned to one tablet server at a time. master server logs active tablet servers, current tablet to tablet server assignments, including which tablet has not been assigned. When a tablet has not been assigned, and a tablet server has enough free space to mount the tablet and is available, master assigns the tablet to the tablet server by sending a tablet mount request.

BigTable Use the chubby trace to record the tablet server. When a tablet server is started, it creates a file with a unique name in a specified chubby directory and obtains an exclusive lock on the file. Master monitors this directory (server directory) Y for a fan-free tablet server. If the tablet server loses an exclusive lock on the chubby-for example, the tablet server loses the chubby session because of a network disconnect-it stops servicing the tablet. (Chubby provides an efficient mechanism by which a tablet server can check that it holds the lock without incurring network congestion). As long as the file exists, the tablet server attempts to regain the exclusive lock, and if the file does not exist, the tablet server will never be able to provide the service, and it exits (so it kills itself). As soon as the tablet server terminates (for example, the cluster's management system removes the tablet server's host from the cluster), it tries to release the lock it holds so that master can reassign its tablet as soon as possible.

Master is responsible for probing when a tablet server is no longer serving its tablet and reallocating those tablets as soon as possible. Master probes the status of a tablet server lock to detect when the tablet server is no longer serving the tablet. If a tablet server reports that it has lost a lock, or if master has failed to communicate with the server in the last few attempts, master attempts to acquire an exclusive lock on the tablet server file, and if Master can acquire an exclusive lock, then the chubby is operating correctly. The tablet server is either down or unable to communicate with chubby, so master deletes the server file on chubby for the tablet server to ensure that the tablet server is no longer able to provide the service. Once the server file has been deleted, Master puts all of the tablet that was previously assigned to the server into the unassigned

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More