Google technology "Sambo" of the BigTable

Source: Internet
Author: User

2006 OSDI has two Google papers, namely bigtable and chubby. Chubby is a distributed lock service based on the Paxos algorithm; bigtable is a distributed storage system for managing structured data, built on Google technologies such as GFS, Chubby, and sstable. Quite a few Google apps use bigtable, such as Google Earth and Google Analytics, so it and GFS, MapReduce, are called Google Technology "Sambo."

Compared with the GFS and MapReduce papers, I find BigTable's paper difficult to understand. On the one hand, because they do not understand the database, on the other hand, because the understanding of the database is limited to relational database. Trying to understand bigtable with a relational data model is easy. Recommend an article here (requires FQ): Understanding HBase and BigTable, I believe this article is very helpful in understanding the Bigtable/hbase data model.

1 What is BigTable

Bigtable is a distributed storage system designed to manage large-scale structured data and can scale to petabytes of data and thousands of servers. Many Google projects use BigTable to store data, which presents different challenges for bigtable, such as data size requirements and latency requirements. BigTable can meet these changing requirements and successfully deliver flexible, high-performance storage solutions for these products.

BigTable looks like a database, adopting a lot of database implementation strategies. But BigTable does not support the complete relational data model, but rather provides a simple data model for the client that can dynamically control the layout and format of the data and take advantage of the local characteristics of the underlying data store. BigTable the data as meaningless byte strings, the client needs to serialize both structured and unstructured data into bigtable.

The following is an introduction to BigTable's data model and basic working principles, while various optimization techniques (such as compression, Bloom filter, etc.) are not covered.

2 BigTable Data Model

BigTable is not a relational database, but it follows the terminology of many relational databases, such as table, row (row), column, and so on. This makes it easy for readers to be misguided and to match the concept of relational databases, making it difficult to understand the paper. Understanding HBase and BigTable is an excellent article that helps readers get out of the mindset of a relational data model.

Essentially, BigTable is a key-value (key-value) mapping. According to the author, BigTable is a sparse, distributed, persistent, multidimensional sort map.

Let's take a look at multidimensional, sort, and map. The BigTable keys are three-dimensional, row key, column key, and timestamp (timestamp), row and column keys are byte strings, timestamp is a 64-bit integer, and the value is a byte string. You can use (row:string, column:string, Time:int64) →string to represent a key-value pair record.

The row key can be any byte string, usually with 10-100 bytes. The reading and writing of a line is atomic. BigTable stores data in the dictionary order of row keys. BigTable tables are automatically divided into slices (tablets) based on row keys, which are load-balanced units. The original table has only one slice, but as the table grows, the slices automatically split and the size of the sheet is controlled in 100-200MB. The row is the first index of the table, and we can consider the column, time, and value of the row as a whole, simplifying it to one-dimensional key-value mappings, similar to the following:

[JavaScript]View Plaincopyprint?
    1. table{
    2. "1": {sth.},//row
    3. "AAAAA": {sth.},
    4. "Aaaab": {sth.},
    5. "xyz": {sth.},
    6. "Zzzzz": {sth.}
    7. }

Columns are second-level indexes, each row has unrestricted columns and can be increased at any time. For ease of administration, the columns are divided into columns family (column family, which is the unit of Access Control), and columns in a column family generally store the same type of data. The column family of a row rarely changes, but the columns in the column family can be arbitrarily added and deleted. The column keys are named in the Family:qualifier format. This time we will take out the list, and consider the time and value as a whole, simplifying to a two-dimensional key-value mapping, similar to:

[JavaScript]View Plaincopyprint?
  1. table{
  2. // ...  
  3. "AAAAA": { //one line
  4. "A:foo": {sth.},//One column
  5. "A:bar": {sth.},//One column
  6. "B:": {sth.} //column, column family name B, but column name is empty string
  7. },
  8. "Aaaab": { //one line
  9. "A:foo": {sth.},
  10. "B:": {sth.}
  11. },
  12. // ...  
  13. }

Or you can use the column family as a new level of index, similar to the following:

[JavaScript]View Plaincopyprint?
  1. table{
  2. // ...  
  3. "AAAAA": { //one line
  4. "a": { //column family A
  5. "foo": {sth.}, //One column
  6. ' Bar ': {sth.}
  7. },
  8. "B": { //column family B
  9. " ": {sth.}
  10. }
  11. },
  12. "Aaaab": { //one line
  13. "A": {
  14. "foo": {sth.},
  15. },
  16. "B": {
  17. "" : "Ocean"
  18. }
  19. },
  20. // ...  
  21. }

The timestamp is the third-level index. BigTable allows multiple versions of the data to be saved, and the version is differentiated by the timestamp. The timestamp can be assigned by BigTable, which represents the exact time at which the data enters the bigtable, or it can be assigned by the client. Different versions of the data are stored in descending timestamp, so the latest version of the data is read first. When we add a timestamp, we get the full data model of bigtable, similar to the following:

[JavaScript]View Plaincopyprint?
  1. table{
  2. // ...  
  3. "AAAAA": { //one line
  4. "A:foo": { //a column
  5. : "y", //a version
  6. 4: "M"
  7. },
  8. "A:bar": { //a column
  9. : "D",
  10. },
  11. "B:": { //a column
  12. 6: "W"
  13. 3: "O"
  14. 1: "W"
  15. }
  16. },
  17. // ...  
  18. }
When querying, if only the rows and columns are given, the latest version of the data is returned, and if the row and column timestamp is given, then the data returned is less than or equal to the timestamp. For example, we query "AAAAA"/"A:foo", the return value is "y", Query "AAAAA"/"A:foo"/10, the result is "M", query "AAAAA"/"A:foo"/2, the returned result is empty.


Figure 1 is an example given in the BigTable paper, where the webtable table stores a large number of pages and related information. In webtable, each row stores a Web page with its inverted URL as a row key, such as Maps.google.com/index.html's data stored in the key com.google.maps/ Index.html line, the reason for the reversal is to allow the same domain name sub-domain pages can be clustered together. The column family "anchor" in Figure 1 saves the Web page's reference site (for example, a site referencing the CNN home page), qualifier is the name of the reference site, and the data is the link text, and the column family "contents" saves the contents of the Web page, and the column family has only one empty column "contents: "。 The three versions of the page are saved under the "Contents:" Column in Figure 1, which we can use ("com.cnn.www", "Contents:", T5) to find the contents of the CNN homepage at T5 time.

Let's look at other features that the author says: Sparse, distributed, persistent. The meaning of persistence is simple, and the bigtable data will eventually be put in the form of a file in GFs. BigTable based on GFs itself means distributed, and of course the meaning of distribution is not limited to that. Sparse means that a row in a table may be completely different.

3 Support Technology

BigTable relies on several technologies from Google. Use GFS to store logs and data files, store data in sstable file format, and manage metadata with chubby.

GFS See Google Technology "Sambo" of Google File system. BigTable data and logs are written to GFs.

The full name of sstable is sorted Strings Table, which is a kind of non-modifiable and orderly key-value mapping, which provides functions such as query and traversal. Each sstable consists of a series of blocks (block), and BigTable sets the block to 64KB by default. A block index is stored at the end of the sstable, and the entire index is read into memory when the sstable is accessed. BigTable paper does not refer to the specific structure of sstable, leveldb four: sstable file This article describes the LEVELDB format of sstable, Because Leveldb author Jeffreydean is BigTable's designer, so it is of great reference value. Each tablet (tablet) is stored in GFS in the sstable format, and each slice may correspond to multiple sstable.

Chubby is a highly available distributed lock service that has five active replicas, with only one primary replica serving, the replicas maintaining consistency with the Paxos algorithm, Chubby provides a namespace (including some directories and files), and each directory and file is a lock, Chubby clients must maintain a session with chubby, and the client's session will lose all locks if it expires. More information about Chubby can be seen in another Google paper: the Chubby Lock Service for loosely-coupled distributed systems. Chubby is used for chip positioning, chip server status Monitoring, Access control list storage and other tasks.

4 bigtable Cluster

The BigTable cluster consists of three main parts: a library for clients to use, one master server (master server), and many tablet servers (tablet server).

As the data Model section says, BigTable will fragment tables (table), the size of the tablet remains in the 100-200MB range, and once out of range it will split into smaller slices or merged into larger slices. Each slice server is responsible for a certain amount of slices, processing the read and write requests for its slices, and splitting or merging slices. The chip server can be added and removed at any time depending on the load. The server does not actually store the data, but rather a proxy that connects BigTable and GFS, and some of the client's data operations indirectly access GFS through the slice server proxy.

The primary server is responsible for allocating slices to the chip server, monitoring the patch server additions and deletions, balancing the load on the server, and processing the creation of tables and column families. Note that the master server does not store any slices, does not provide any data services, and does not provide location information for the slices.

When the client needs to read and write data, contact the chip server directly. Because the client does not need to obtain the location information of the slice from the primary server, most clients never need to access the primary server, and the load on the primary server is generally light.

5-piece positioning

As mentioned earlier, the primary server does not provide the location information of the slice, then how does the client access the slice? To take a look at the paper, BigTable uses a data structure like the B + tree to store the location information.


First is the first layer, Chubby file. This layer is a chubby file that holds the location of the root tablet. This chubby file is part of the chubby service, and once chubby is unavailable, it means that the root tablet location is lost and the entire bigtable is unavailable.

The second layer is the root tablet. The root tablet is actually the first shard of the metadata table (METADATA table), which holds the location of the other slices of the metadata table. The root tablet is special, and to keep the depth of the tree intact, the root tablet never splits.

The third layer is the other meta-data slices, which together with the root tablet form the complete metadata table. Each meta-slice contains information about the location of many user slices.

It can be seen that the entire positioning system is actually just two parts, a chubby file, a meta-data table. Note that although the metadata table is special, it still obeys the previous data model, and each shard is also owned by a dedicated slice server, which is why the primary server is not required to provide location information. The client caches the location of the slice, and if the location of a slice is not found in the cache, it needs to find the three-tier structure, including access to the chubby service and access to the two-slice server.

6 pieces of storage and access to the piece of data is ultimately written in GFs, the physical form of the piece in GFs is a number of sstable files. Figure 5 shows the basics of read and write operations.

When the slice server receives a write request, the slice server first checks whether the request is legitimate. If it is legal, the write request is submitted to the log first, and then the data is written to the in-memory memtable. Memtable equivalent to sstable cache, when memtable grow to a certain size will be frozen, bigtable then create a new memtable, and convert the frozen memtable to sstable format to write GFs, This operation is called minor compaction.

When the slice server receives a read request, it also checks to see if the request is legitimate. If it is legal, this read will see a consolidated view of all sstable files and memtable, because Sstable and memtable themselves are sorted, so merging is fairly fast.

Each time minor compaction will produce a new sstable file, sstable file too much of the efficiency of the read operation is reduced, so bigtable regularly perform merging compaction operations, Merge several sstable and memtable into a new sstable. BigTable also has a more powerful called major compaction, which merges all sstable into a new sstable.

Unfortunately, bigtable authors do not introduce detailed data structures for memtable and sstable.

7 The relationship between BigTable and GFs

The cluster includes the primary server and the chip server, the primary server is responsible for allocating slices to the chip server, and the specific data service is solely responsible for the chip server. But do not mistakenly think that the chip server really stored data (in addition to in-memory memtable data), the real location of the data only GFs know, the primary server will be allocated to the chip server should mean that the chip server to obtain all the Sstable file name of the slice, Chip server through some indexing mechanism can know the required data in which sstable file, and then read the sstable file from the GFS data, the sstable file may be distributed on several chunkserver.

8 The structure of the metadata table

A metadata table (METADATA table) is a special table that is used to locate data and some meta-data services, which are not important. However, only a few clues are given in the bigtable paper, and the concrete structure of the table is not explained. Here I try to guess the structure of the table based on some clues in the paper. Start by listing the clues in the paper:

    1. The METADATA table stores the location of a tablet under a row key that's an encoding of the tablet's table identifier an D its end row.
    2. Each METADATA row stores approximately 1KB of the data in memory (due to the large traffic, the metadata table is placed in RAM, this optimization is mentioned in the paper's locality groups). This feature (attribute locality group is placed in memory) is useful for small pieces of data, is accessed frequently:we use it internall Y for the Location column family in the METADATA table.
    3. We also store secondary information in the METADATA table, including a log of all events pertaining to each tablet (such as When a server begins
      Serving it).

First clue, the row key of the metadata table is encoded by the ID of the table name that belongs to the sheet and the last line of the slice, so each slice occupies a record (one row) in the metadata table, and the row key contains both the information of its owning table and the range of the rows it owns. For example, the simplest encoding method, the row key of the metadata table equals strcat (table name, row key of the last row of the slice).

The 2nd clue, in addition to knowing that the metadata table's address part is resident memory, you can also find that the metadata table has a column family called location, we already know that each row of the metadata table represents a slice, then why need a column family to store the address? Because each slice may consist of multiple sstable files, the column family can be used to store the location of any number of sstable files. A reasonable assumption is that the location information for each sstable file occupies one column, and the column name is Location:filename. Of course, you don't have to use column keys to store the full file name, and the bigger possibility is to put the Sstable file name in the value. You can request data from GFs by getting the file name.

The third clue tells us that the metadata table not only stores location information, that is to say, the column family is more than locations, the data is not our concern for the time being.

With the above information, I have drawn a simplified bigtable structure diagram:

The structure chart takes the Webtable table as an example, the table stores the NetEase, the Baidu and the watercress several pages. When we want to find Baidu paste posted yesterday's web content, you can send bigtable query webtable table (Com.baidu.tieba, Contents:, yesterday).

Assuming that the client does not have the cache, then bigtable accesses the root tablet's slice server and wants to get the location information of the slice that the page belongs to in which meta-data slice. Use METADATA.Webtable.com.baidu.tieba to find the row key in the root tablet, Positioning to the last one larger than it is METADATA.Webtable.com.baidu.www, so determine what is needed is the metadata table of slice a. To access slice A of the slice server, continue to find Webtable.com.baidu.tieba, navigate to Webtable.com.baidu.www is larger than it, determine what is needed is the webtable table of slice B. Access Slice B's slice server and get the data.

It is important to note that each slice is actually composed of several sstable files and memtable, and these sstable and memtable are sorted. This results in the discovery of slice B, which may require all sstable and memtable to be looked up again, and the client should not obtain the Sstable file name directly from the metadata table, but only obtain the information that the slice belongs to the slice server and access the sstable for the proxy through the slice server.

Reference documents

[1] bigtable:a distributed Storage System for structured Data. In Proceedings of OSDI ' 06.

[2] Understanding HBase and BigTable.

Google technology "Sambo" of the BigTable

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.