GFS solves the requirements of distributed file systems in some business scenarios. It is natural that some services are not convenient to use only file systems. They need distributed database systems. BigTable is generated by Google to meet its internal needs for large-scale structured data processing. The words "key" involved in the paper are: 1. Structured Data 2.
GFS solves the requirements of distributed file systems in some business scenarios. It is natural that some services are not convenient to use only file systems. They need distributed database systems. BigTable is generated by Google to meet its internal needs for large-scale structured data processing. The words "key" involved in the paper are: 1. Structured Data 2.
GFS solves the requirements of distributed file systems in some business scenarios. It is natural that some services are not convenient to use only file systems. They need distributed database systems. BigTable is generated by Google to meet its internal needs for large-scale structured data processing. The word "key" involved in the paper abstract is:
1. Structured Data
2. Large data volume
3. Typical applications: Web indexing, Google Earth, and Google Finance
4. Batch Processing and real-time requirements
5. Data Model
First of all, it should be noted that the so-called structured data here is not exactly the same as the structured data for DBMS. The structured data defined by the latter is indeed more structured data such as values and strings, and the length is not very large. Most of the data models used refer to relational models. Second, the large data volume is not an order of magnitude that the former DBMS users prefer to say about massive databases. The massive data volume is only TB, and the big data here is PB or above (this is about the same magnitude as the OLAP author said ). In this case, the typical applications are clearly beyond the range of traditional relational databases.
For batch processing services, it can be understood that the processing time is much longer than the response time in seconds. But for real-time requirements, it should not be a real-time operating system. How should it be millisecond or even second-level response time. From the above simple analysis, we can see that the exact meaning of many terms can be compared only in the context. Otherwise, it is easy to understand the meaning.
The data model is relatively easy to understand. Since BigTable claims to be a database, the core logical concept is what the supported data model is. Section 2 states that a "Big Table" is a sparse, distributed, persistent, multidimensional, and ordered Map. It is neither a relational model nor an object model or other traditional data models. This definition is a bit cool, but it accurately describes the features of BigTable.
For a large data management software, we have some universal concerns. For example:
1. Data Model
2. Programming Interface
3. dependent infrastructure/components
4. Optimization in implementation
5. performance data and typical scenarios
This is also the content of the subsequent chapters of the paper.
When learning the data model/Implementation of BigTable, consider the following with an analogy with the relational model/implementation:
1. Differences between it and the relational model
2. Does it support ACID?
3. Is the data organization similar to that of Heap and Cluster B + trees?
4. Is it indexed?
5. mode definition
6. Data Partition (vertical partition and horizontal partition)
7. Permission Control
8. Multi-version of rows, etc.
More importantly, what is its concurrency control mechanism? If it is different from a traditional database in terms of these basic issues/the larger the gap, it can be said that it is not like a database :-).
For a data management system, the supported operations/APIs should at least include:
1. schema definition (Table creation, table modification, and table deletion)
2. Data manipulation (add, delete, modify, and query)
3. Permission control (granting and revoking permissions)
How can users use it? In these APIs, the most complex and interesting data is query data:
1. Full table Scan
2. Point Query
3. Range Query
When reading section 3, we can take these questions to think about what is intersection and what is not provided by traditional DBMS.
The vast majority of real systems are not starting from scratch, but on the shoulders of giants. Many distributed file systems are built on local file systems, and many databases store data in files. BigTable is no exception. However, it depends on more infrastructure/components than I think, and the components that come out are more important: GFS is used to store data files and log files, the cluster management system is used to schedule jobs, manage resources, and handle faults. SSTable defines the file format (Sorted String Table), and Chubby provides the Distributed Lock service. Chubby is so important and complex that you need to write a separate paper to describe it.
With the data model, the programming interface is defined, and the infrastructure is ready. The important work in the future is system implementation and optimization. BigTable has three groups of modules: Client/function library, Master server, and Tablet servers. For more information, see section 5. Note:
1. Master's Responsibilities
2. Tablet's Responsibilities
3. Location Management of tablet (why is it Layer 3, positioning efficiency)
4. How does the Master track the life and death of each tablet?
5. Special Handling of metadata
6. increase, decrease, and merge tablet
7. Logs
8. Three compactions
9. Restore and so on.
There are many details involved in the header, so you need to taste it slowly. When performance optimization is involved, it will spread to common algorithms/technologies such as compression and Bloom Filters. Some things have not been done, and the understanding is still superficial. I will continue to learn later .....