In 006, osdi published two Google papers, bigtable and chubby. Chubby is a Distributed Lock service based on the paxos algorithm. bigtable is a distributed storage system used to manage structured data. It is built on Google technologies such as GFS, chubby, and sstable. A considerable number of Google applications use bigtable, such as Google Earth and Google Analytics. Therefore, bigtable and GFS and mapreduce are also known as the "Three Treasures" of Google technology ". Compared with the GFS and mapreduce papers, I think bigtable papers are hard to understand. On the one hand, it is because you do not know much about the database, and on the other hand, it is because your understanding of the database is limited to relational databases. If you try to use the relational data model to understand bigtable, it is easy to get angry ". Here we recommend an article (which needs to be turned over): Understanding hbase and bigtable. I believe this article is of great help in understanding the bigtable/hbase data model. 1. What is bigtable? It is a distributed storage system designed to manage large-scale structured data. It can be expanded to petabytes of data and thousands of servers. Many Google projects use bigtable to store data. These applications pose different challenges to bigtable, such as data size requirements and latency requirements. Bigtable can meet these changing requirements and successfully provide flexible and high-performance storage solutions for these products. Bigtable looks like a database and uses many database implementation policies. However, bigtable does not support the complete relational data model. Instead, it provides a simple data model for the client to dynamically control the data layout and format, it also utilizes the local features of underlying data storage. Bigtable regards all data as meaningless byte strings. The client needs to serialize structured and unstructured data and then store it in bigtable. The following describes the data models and basic working principles of bigtable, and various optimization technologies (such as compression and bloom filter) are not covered. 2. bigtable's Data Model bigtable is not a relational database, but uses many relational database terms, such as table, row, and column. This makes it easy for readers to go astray and correspond to the concept of relational databases, making it hard to understand the paper. Understanding hbase and bigtable is an excellent article that helps readers get out of the mindset of relational data models. Essentially, bigtable is a key-value ing. According to the author, bigtable is a sparse, distributed, persistent, and multidimensional sorting ing. Let's take a look at multi-dimensional, sorting, and ing. Bigtable keys are in three dimensions: Row key, column key, and timestamp. The row key and column key are both byte strings, the timestamp is a 64-bit integer, and the value is a byte string. You can use (row: String, column: String, time: int64) → string to represent a key-Value Pair record. The row key can be any byte string, usually 10-bytes. The read and write operations on rows are atomic. Bigtable stores data in the Lexicographic Order of the row key. Bigtable tables are automatically divided into slices Based on Row keys, which are the units of Server Load balancer. At first, the table has only one slice, but as the table grows, the slice is automatically split and the size of the slice is limited to 100-200 MB. A row is the first-level index of a table. We can regard the column, time, and value of the row as a whole and simplify it into one-dimensional key-value ing, similar to: [JavaScript] view plaincopyprint? 1. table {2. "1": {something .}, // a row 3. "AAAAA": {something .}, 4. "aaaab": {something .}, 5. "XYZ": {something .}, 6. "Zzzzz": {something .} 7 .} the column is the second-level index. The columns in each row are unrestricted and can be increased or decreased at any time. For ease of management, columns are divided into multiple columns (column family, the access control unit). Columns in a column family generally store the same type of data. The columnfamily of a row rarely changes, but columns in the column family can be added or deleted at will. Column keys are named in the family: qualifier format. This time we take out the columns and take the time and value as a whole, which is simplified to two-dimensional key-value ing, similar to: [JavaScript] view plaincopyprint? 1. table {2. //... 3. "AAAAA": {// a row 4. "A: foo": {something .}, // a column 5. "A: bar": {something .}, // a column 6. "B:": {something .} // a column with the column family name B, but the column name is an empty string 7 .}, 8. "aaaab": {// a row of 9. "A: foo": {something .}, 10. "B:": {something .} 11 .}, 12. //... 13 .} alternatively, you can use the column family as a new index, similar to [JavaScript] view plaincopyprint? 1. table {2. //... 3. "AAAAA": {// a row 4. "A": {// column family a 5. "foo": {something .}, // a column 6. "bar": {something .} 7 .}, 8. "B": {// column family B 9. "": {something .} 10 .} 11 .}, 12. "aaaab": {// a row of 13. "A": {14. "foo": {something .}, 15 .}, 16. "B": {17. "": "ocean" 18 .} 19 .}, 20. //... 21 .} the timestamp is the third-level index. Bigtable allows multiple versions of data to be saved. The timestamp is used to differentiate versions. The timestamp can be assigned by bigtable, which indicates the exact time when the data enters bigtable. It can also be assigned by the client. Different Versions of data are stored in descending order by timestamp. Therefore, the latest data version is read first. After adding the timestamp, we get the complete data model of bigtable, similar to: [JavaScript] view plaincopyprint? 1. table {2. //... 3. "AAAAA": {// a row 4. "A: foo": {// a column of 5. 15: "Y", // a version 6. 4: "M" 7 .}, 8. "A: bar": {// a column of 9. 15: "D", 10 .}, 11. "B:": {// column 12. 6: "W" 13. 3: "O" 14. 1: "W" 15 .} 16 .}, 17. //... 18 .} when querying, if only the row and column are given, the latest data version is returned. If the row and column timestamp is given, the data whose time is less than or equal to the timestamp is returned. For example, if we query "AAAAA"/"A: foo", the returned value is "Y"; if we query "AAAAA"/"A: foo"/10, the returned result is "M"; query "AAAAA"/"A: foo"/2, and the returned result is null. Figure 1 is an example given in the bigtable paper. The webtable table stores a large amount of web pages and related information. In a webtable, each row stores a webpage, and its reverse URL serves as the row key, such as the forward. The column family "anchor" in Figure 1 saves the reference site of the webpage (for example, the site that references the CNN homepage). qualifier is the name of the referenced site, and data is the link text; the column family "contents" stores the content of the webpage. This column family has only one empty column "Contents :". In Figure 1, three versions of the web page are saved in the "Contents:" column. We can use ("com. CNN. WWW "," Contents: ", T5. Let's take a look at other features mentioned by the author: sparse, distributed, and persistent. Persistence means that bigtable data is stored in GFS as files. Bigtable is built on GFS, which means that it is distributed. Of course, the significance of distributed is not limited to this. Sparse means that different rows in a table may have completely different columns. 3. The supporting technology bigtable relies on several Google technologies. Use GFS to store logs and data files; store data in sstable file format; Use Chubby to manage metadata. For details about GFS, refer to Google File System of Google technology "sanbao. Bigtable data and logs are written to gfs. The full name of sstable is sorted strings table, which is an unmodifiable and ordered key-value ing that provides functions such as query and traversal. Each sstable consists of a series of blocks. bigtable sets the block size to 64kb by default. Block indexes are stored at the end of sstable. when accessing sstable, the entire index is read into the memory. The bigtable thesis does not mention the specific structure of sstable. This article introduces the sstable format of leveldb, which is the designer of bigtable, therefore, it is of great reference value. Each tablet is stored in GFS in sstable format, and each tablet may correspond to multiple sstable. Chubby is a highly available Distributed Lock service. Chubby has five active copies and only one primary copy provides services. The paxos algorithm is used to maintain consistency between copies, chubby provides a namespace (including some directories and files). Each directory and file is a lock. The chubby client must maintain a session with chubby, if the client session expires, all locks will be lost. For more information about chubby, see another Google paper: the chubby lock service for loosely-coupled distributed systems. Chubby is used for slice locating, slice server status monitoring, access control list storage, and other tasks. 4. The bigtable cluster consists of three main parts: a library for the client, a master server, and many tablet servers ). As described in the data model section, bigtable will partition the table, and the size of the tablet is kept within the range of-MB. Once the range is exceeded, it will be split into smaller slices, or merge them into larger slices. Each slice server is responsible for a certain amount of slice, processing the read/write requests to the slice, and splitting or merging the slice. SLB instances can be added or deleted at any time based on the load. Here, the slice server does not actually store data, but is equivalent to a proxy connecting bigtable and GFS. Some client data operations indirectly access GFS through the slice server proxy. The master server is responsible for allocating slices to the slice server, adding and deleting monitoring slice servers, balancing slice server loads, and processing table and column family creation. Note that the master server does not store any slices, does not provide any data services, and does not provide location information for the slices. When the client needs to read and write data, it can directly contact the slice server. Because the client does not need to obtain the location information of the slice from the master server, most clients never need to access the master server, and the load on the master server is usually very light. As mentioned above, the master server does not provide the location information of the video. How does the client access the video? Let's take a look at the location information given by the paper. bigtable uses a Data Structure storage piece similar to B + tree. The first layer is chubby file. This layer is a chubby file that stores the root tablet location. This chubby file is part of the chubby service. Once the chubby file is unavailable, it means that the location of the root tablet is lost and the entire bigtable is unavailable. The second layer is the root tablet. The root tablet is actually the first shard of the metadata table, which stores the positions of other slices in the table. The root tablet is very special. To ensure that the depth of the tree remains unchanged, the root tablet is never split. The third layer is other metadata slices, which together with the root tablet form a complete metadata table. Each metadata piece contains the location information of many user slices. It can be seen that the entire positioning system is actually only two parts: a chubby file and a metadata table. Note that although the metadata table is special, it still follows the previous data model. Each Shard is also the responsibility of a dedicated slice server. This is why the master server does not need to provide location information. The client caches the location information of the slice. If the location information of a slice cannot be found in the cache, you need to find the layer-3 structure, including accessing the chubby service once and accessing the slice server twice. The storage and access data of the six slices are ultimately written into gfs. the physical form of the slice in GFS is several sstable files. Figure 5 shows the basic information about read/write operations. When the slice server receives a Write Request, the slice server first checks whether the request is valid. If it is valid, submit the write request to the log, and then write the data to the memtable in the memory. Memtable is equivalent to sstable cache. When memtable grows to a certain scale, it will be frozen. bigtable then creates a new memtable and converts the frozen memtable to sstable format and writes it to gfs, this operation is called minor compaction. When the slice server receives a read request, it also needs to check whether the request is valid. If it is valid, this read operation will view the merged views of all sstable files and memtable, because sstable and memtable are both sorted, so the merge is quite fast. Each minor compaction generates a new sstable file, which reduces the efficiency of reading too many sstable files. Therefore, bigtable periodically performs the merging compaction operation, merge several sstable and memtable into a new sstable. Bigtable is also called Major compaction, which combines all sstables into a new sstable. Unfortunately, the authors of bigtable did not introduce the detailed data structures of memtable and sstable. 7. The relationship clusters between bigtable and GFS include the primary server and the slice server. The primary server is responsible for allocating the slice to the slice server, while the specific data service is solely responsible for the slice server. But do not mistakenly assume that the slice server actually stores data (except for the memtable data in the memory). The real location of the data is only known by gfs, the primary server allocates the slice to the slice server, which means that the slice server obtains all sstable file names of the slice. The Slice server can use some indexing mechanisms to know which sstable file the required data is in, then read the data of the sstable file from gfs. The sstable file may be distributed on several chunkservers. The metadata table is a special table used for data locating and metadata services. However, only a few clues are provided in the bigtable paper, but the specific structure of the table is not described. Here I try to guess the table structure based on some clues in the paper. First, list the clues in the paper: 1.The metadata table stores the location of a tablet under a row key that is an encoding of the tablet's table identifier and its end row.2.each metadata row stores approximately 1kb of data in memory (because of the access volume relatively large, the metadata table is stored in the memory. This optimization is mentioned in the locality groups of this paper ). this feature (the feature of putting locality group in memory) is useful for small pieces of data that are accessed frequently: we use it internally for the low.o N column family in the metadata table.3.we also store secondary information in the metadata table, including a log of all events pertaining to each tablet (such as when a server beginsserving it ). the first clue is that the row key of the metadatabase is encoded by the table ID of the part and the last row of the part. Therefore, each piece occupies one record (one row) in the metadata table ), in addition, the row key contains both the information of the table to which it belongs and the range of its rows. For example, the simplest encoding method is used. The row key of the metadatabase table is strcat (the table name and the row key of the last row ). The second clue is that apart from knowing that the address of the metadatabase table is resident memory, we can also find that the metadatabase table has a column family called location. We already know that each row of the metadatabase table represents a piece, so why do we need a column family to store addresses? Because each slice may consist of multiple sstable files, the column family can be used to store the location of any number of sstable files. A reasonable assumption is that the location information of each sstable file occupies one column and the column name is location: filename. Of course, you do not have to use the column key to store the complete file name. The larger possibility is to store the sstable file name in the value. After obtaining the file name, you can ask for data from gfs. The third clue tells us that the metadata table not only stores location information, that is to say, the column family does not store location information. These data are not of concern to us at the moment. Based on the above information, I drew a simplified bigtable Structure: The structure chart uses the webtable table as an example. The table stores several webpages of Netease, Baidu, and Douban. If you want to find the content on the web page of Baidu post bar yesterday, you can send a webtable query (COM. Baidu. tieba, contents:, yesterday) to bigtable ). If the client does not have this cache, bigtable accesses the root tablet's slice server and wants to obtain the location of the slice to which the webpage belongs. Use metadata.webtable.com. Baidu. tieba as the row key to search in the root tablet and locate the last metadata.webtable.com. Baidu. www which is larger than it. Therefore, it is determined that the expected Part A is the metadata table. Access the slice server of slice a, continue to search for webtable.com. Baidu. tieba, and find that webtable.com. Baidu. WWW is larger than that of slice B of the webtable table. Access the SLB server to obtain data. Note that each slice actually consists of several sstable files and memtable, and these sstable and memtable are sorted. Therefore, when searching for Part B, you may need to search all sstable and memtable. In addition, the client should not directly obtain the sstable file name from the metadata table, instead, you only need to obtain information about the slice that belongs to the slice server and access the sstable through the slice server as a proxy. References [1] bigtable: a distributed storage system for structured data. In Proceedings of osdi '06. [2] understanding hbase and bigtable.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.