Spanner's design reflects the accumulation and precipitation of Google's experience in the field of distributed storage systems over the years, using the Megastore data model, chubby data replication and consistency algorithms, and the use of bigtable technologies in the scalability of the data. The novelty is that it uses a local clock with high precision and observable error to determine the sequence of events in a distributed system. Spanner represents a new trend in the field of distributed database--newsql.
Spanner is Google's recently released new generation of distributed database, which has both the scalability of the NoSQL system and the function of the relational database. For example, it supports SQL-like query languages, support table joins, and support transactions (including distributed transactions). Spanner can replicate a single piece of data to multiple data centers worldwide and ensure data consistency. A set of spanner clusters can scale up to hundreds of data centers, millions of servers, and up to T database records. At present, the Google advertising business of the background (F1) has migrated from the MySQL Sub-Library table scheme to the spanner.
Data model
Traditional RDBMS (such as MySQL) adopt a relational model and have rich functionality to support SQL query statements. But NoSQL database is to add the limited function on the Key-value storage, such as column index, range query, but have good expansibility. Spanner inherits the Megastore design, the data model is between RDBMS and NoSQL, provides a tree-like, hierarchical database schema, supports the SQL query language, provides the characteristics of relational databases such as table joins, and functions like RDBMS On the other hand, all records in the entire database are stored in the same key-value large table, which is similar to BigTable, and has the scalability of NoSQL system.
In spanner, applications can create multiple tables in a single database, and you need to specify the hierarchical relationships between those tables. For example, the two tables created in Figure 1-the user table (users) and the album Table (albums), and the specified user table are the parent of the album table. There is a one-to-many relationship between the parent node and the child node, and a record in the user table (a user) corresponds to multiple records (multiple albums) in the album table. In addition, the primary key of a child node must be prefixed with the primary key of the parent node. For example, the primary key (user ID) of a user table is the prefix of the album table primary key (user id+ album ID).
Figure 1 Schema example, hierarchical relationships between tables, records sorted by interleaved storage
It is obvious that all table primary keys prefix The primary key of the root node, spanner a record in the root node table and a collection of all records in other tables prefixed with its primary key as a Directory. For example, a user's record and the records of all of the users ' albums make up a directory. Directory is the basic unit of partition, replication, and migration of data in spanner, which allows you to specify how many copies of a directory to store in each room, such as storing the user's directory in several rooms near the user's area.
Such a data model has the following benefits.
The primary key for all records in a directory has the same prefix. When stored to the underlying key-value large table, it is assigned to the adjacent location. If the volume of data is not very large, it is located on the same node, which not only improves the locality of data access, but also ensures that transactions occurring in one directory are stand-alone.
Directory also implements partitioning of data from fine-grained granularity. The entire database is divided into millions or more directory, and each directory can define its own replication strategy. This directory-based data partitioning method is finer than the table-based granularity of the MySQL partitioned table, and is thicker than the row-based in the Yahoo! Pnuts system.
Directory provides an efficient way of table join operations. In a directory, records on multiple tables are sorted by primary key, interleaved (interleaved), so that the table join operation without sorting can be merged directly between the tables.
Replication and consistency
Spanner uses the Paxos protocol to synchronize redo logs between multiple replicas, ensuring that the data is consistent across multiple replicas. Google engineers love the Paxos protocol, Chubby, Megastore and spanner a range of products are based on the Paxos agreement to achieve consistency.
Paxos's basic protocol is simple. There are three roles in the protocol: proposer, acceptor and learner,learner and proposer are both reader and writer, acceptor the equivalent of storage nodes. As the entire protocol describes, when there are multiple proposer and acceptor in the system, each proposer writes a variable that initiates a round of resolution (Paxos instance), as shown in Figure 2. The resolution process ensures that even if multiple proposer are written at the same time, the result will not be inconsistent on the acceptor node. To be exact, once a proposer commits a value that is accepted by most acceptor, the value is selected, and the variable is no longer modified to another value during the entire round of resolution. If another proposer is to write another value, the next round of resolution must be started, and the resolution process is serial (serializable).
Figure 2 Paxos protocol Normal execution process