The tablet is a horizontal partition of the Kudu table, similar to Google BigTable's tablet, or the region of HBase. Each tablet stores data (key) for a certain contiguous range, and the range between tablet 22 does not overlap. All the tablet of a table contains all the key spaces for this table.
The tablet is made up of rowset, and rowset consists of a set of rows (n data, n rows of data). Rowset are disjoint, that is, the row between different rowset does not intersect, so a given data will only exist in one rowset. Although the rowset is disjoint, the key space of 22 can intersect (range of key).
Handling Insertions
A rowset is stored in memory, which is called Memrowset, and only one memrowset in a tablet. Memrowset is a in-memory b-tree tree, sorted by the table's primary key. All inserts are written directly into the Memrowset. Benefit from MVCC (multi-version Concurrency Control Multi-version concurrency controls, described below), once the data is written to the Memrowset, subsequent reader can immediately query.
Note: Unlike Bigtable,kudu, only mutation that are inserted and inserted with flush before are logged to Memrowset. Mutation such as update, deletion based on disk data, are described below.
Any piece of data is precisely in a memrowset in the form of entry, entry consists of a special header and the actual row data content. Since Memrowset is only stored in memory, it will eventually be filled and flushed to disk (one or more diskrowset). (Details will be provided below)
MVCC Overview
Kudu to provide some useful features, use multiple versions of concurrency control:
- Snapshot Scanner: Snapshot query, when a query is created, the system operates a snapshot of the time that the tablet specifies (point-in-time). Any update to this tablet during this query will be ignored. In addition, snapshots of the specified time (point-in-time) can be stored and reused in other queries, for example, an application that performs an analytic query multiple times for a contiguous set of data.
- Time-travel scanners: Historical snapshot query, same as the snapshot query above. The user can specify a point in time to create a query, and MVCC can guarantee the consistency of historical snapshots. This feature can be used for consistent backups at a certain point in time.
- Change-history queries: History change query, given two MVCC snapshots, the user can query the task data between the two snapshots. This feature can be used for incremental backups, cross-cluster synchronization, or offline audit analysis.
- Multi-row Atomic updates within a tablet:tablet atomic update of multiple lines of data, in a tablet, an operation (mutation) can modify multiple rows of data and is visible in atomic operations in a single piece of data. (should be an atomic operation for column)
To provide MVCC functionality, each operation (mutation) comes with a timestamp (timestamp). Timestamp is provided by the Ts-wide clock instance, and the mvccmanager of the tablet ensures that the timestamp in the tablet is unique and non-repeatable. Mvccmanager determines the timestamp of the data submission so that the query after that point in time can get the data that was just submitted. When the query is created, scanner extracts a snapshot of the Mvccmanager time state, and all the data visible to this scanner is compared to this mvccsnapshot to determine which insertion, The data after the update or Detete operation is visible.
The timestamp of each tablet is monotonically increasing. We use hybridtime technology to create timestamps that ensure consistent timestamps between nodes.
To support snapshot and historical snapshot functionality, multiple versions of the data must be stored. To prevent infinite space expansion, users can configure a retention time and record GC before this time (this feature prevents each query from being read from the most original version).
MVCC Mutations in Memrowset
In order to support the MVCC feature in Memrowset, the data inserted in each row will be timestamp-stamped. Also, row has a pointer to the mutations list that follows it, and each mutation has a timestamp:
In traditional relational database terminology, this ordered list of mutations can be referred to as "Rodo log".
Any reader needs to access the mutations in the Memrowset row to get the correct snapshot. The logic is as follows:
- If this row of data is inserted in the timestamp, not in the scanner MVCC snapshot (that is, scanner snapshot specified timestamp is less than the timestamp of the data insertion, the data has not been created), the row is ignored.
- If not satisfied, put this row of data into the output cache.
- Mutation in the Loop list:
- If mutation's timestamp is in MVCC snapshot, perform this update in the memory cache. If it is not, skip this mutation.
- If mutation is a delete operation, it is marked as deleted in buffer and the data in the cache is loaded before emptying.
Note that the mutation can be any of the following:
- Update: Updating value, one or more columns in a row of data
- Delete: Deletes a row of data
- Reinsert: Reinsert a row of data (this happens only if there is a delete mutation previously and the data is in Memrowset. )
For a real example, the table structure (key STRING, Val UINT32) is as follows:
- INSERT into T VALUES ("row", 1); [Timestamp 1]
- UPDATE t SET val = 2 WHERE key = "Row"; [Timestamp 2]
- DELETE from t WHERE key = "Row"; [Timestamp 3]
- INSERT into T VALUES ("Row", 3); [Timestamp 4]
In Memrowset, there is the following structure:
Note that when the update is too frequent, there are the following effects:
- Readers needs to trace the linked list pointer, causing a lot of CPU cache tasks to be generated
- The update needs to be appended to the end of the linked list, causing the time complexity of each update to be O (n).
Taking into account the low efficiency of operations, we give the following assumptions:
- The kudu is suitable for relatively low-frequency updates, assuming that the data is not updated too frequently.
- Only a small fraction of the data is stored in Memrowset: Once the memrowset reaches a certain threshold, it is flush to disk. So even if the mutation of memrowset results in low performance, it only takes up a fraction of the overall query time.
If the inefficiencies mentioned above affect the actual application, there will be many optimizations to reduce the cost.
Memrowset Flushes
When the Memrowset is full, the flush operation is triggered and it continues to write data to disk.
The data is flush to disk as a cfiles file (see SRC/KUDU/CFILE/README). Each line in the data is identified by an orderly rowid, and the rowid is dense, immutable, and unique in Diskrowset. For example, if a given Diskrowset contains 5 rows of data, they will be assigned as rowid0~4 in the order in which the key is ascending. Different diskrowset will have different rows (rows), but they may have the same rowid.
When reading, the system uses an index structure to map the user's visible primary key key to the ROWID inside the system. The primary key in the example above is a simple key whose structure is embedded in the cfile of the primary key column. In addition, a separate index CFile saves the encoded combination key, which is used to provide a similar approach. (not understand)
Note: rowID is not exactly the same as data for each row, but rather an implicit recognition of the index in this cfile based on the order of the data. In part of the source code, define ROWID as "row indexes" or "ordinal indexes".
Note: Other systems, such as C-Store, refer to Memrowset as "write optimized Store" (WOS), which calls Diskrowset "Read-optimized store" (ROS).
Historical MVCC in Diskrowsets
In order for On-disk data to have MVCC functionality, each ON-DISK rowset not only contains the current version of row data, but also the undo record, so that the historical version of this line of information can be obtained.
When the user wants to read the latest version of data in flush, only the base data needs to be obtained. Because base data is column-stored, this query performance is very high. If you do not read the latest data, but instead time-travel the query, you have to roll back to a version of the specified history time, and you need to use the Undo record data.
When a query gets a piece of data, the process for handling MVCC information is:
- Read base data
- Loop each undo record: If the related action timestamp has not yet been committed, the rollback operation is performed. That is, the query for the specified snapshot timestamp less than mutation timestamp,mutation has not yet occurred.
For example, review a series of operations previously MVCC mutations in Memrowset Chapter examples:
When this data is flush into the disk, it will be stored in the following form:
Each undo record is the opposite of the execution process. For example, in the undo record, the first insert transaction is converted to delete. UNDO Recod is designed to preserve the timestamp of inserting or updating data: The MVCC snapshot of the query specified a time earlier than TX1, Tx1 was not committed, and the delete operation was performed, then this data does not exist.
Two more examples of different queries:
Each example processes the undo record at the correct time to produce the correct data.
The most common scenario is querying the latest data. At this point, we need to optimize the query strategy to avoid processing all undo records. To achieve this goal, we introduce file-level metadata that points to the data range of the undo record. If all the transactions for the MVCC snapshot of the query have been committed (querying the latest data), the set of deltas will be shorted (no undo record is processed), and the query will have no MVCC overhead.
Handling mutations against On-disk files
Updates or deletes data that has been flush to disk and does not manipulate memrowset. Its process is this: in order to determine the Update/delete key in which rowset, the system will patrol all rowset. This process first uses an interval tree to locate a set of rowset that may contain this key. Then, use boom filter to determine if all candidate rowset contain this key. If some rowset pass at the same time as the last two check, the system will look for the primary key corresponding ROWID in these rowset.
Once the data is identified, the rowset,mutation will get the rowid corresponding to the primary key, and mutation will be written to a memory structure called Deltamemstore.
A diskrowset is a deltamemstore,deltamemstore is a parallel btree,btree key is a combination of ROWID and mutation timestamp. When the query is executed, the matching mutation is performed and the corresponding data of the snapshot timestamp is executed in a similar manner to the mutation after the new data has been inserted (Memrowset).
When the data deposited by Deltamemstore is large, the flush to disk is also performed, and the Deltafile file is landed:
The Deltafile information type is consistent with the deltamemstore, but is compacted and serialized in a dense disk. In order to update the data from base data to the most recent, the queries need to perform the mutation transactions in these deltafile, which are called redo files, and these mutation in file are called Redo record. Similar to the mutation stored in memrowset, they need to be applied once (executed) when reading newer versions of data than base data.
The delta information for a single piece of data may be contained in multiple Deltafile files, in which case the Deltafile is ordered and the following changes take precedence over the previous changes.
Note that the mutation storage structure does not necessarily contain the entire row of data. If only one column of data is updated in a row, the mutation structure will only contain the updated information for that column. It is fast and efficient to update data operations without reading or overriding unrelated columns.
Summary of delta file processing
To summarize, each diskrowset is logically divided into three parts:
- Base Data:memrowset The latest data when flush to Diskrowset, the data is stored in columns.
- UNDO Records: Historical data, rolling back to base data some historical versions before
- REDO Some of the updated data after the Records:base data can be used to get the latest version of the data.
The UNDO record and the redo record storage format are the same, all known as Deltafile.
Delta compactions
As mutation accumulate more and more in the Deltafile, reading rowset data becomes less efficient and, worst of all, reading the latest version of the data requires traversing all redo record and merge with base data. In other words, if the data has been updated too many times, in order to get the latest version of the data, you need to perform so many mutation.
To improve read performance, kudu transforms inefficient physical layouts into more efficient layouts in the background, with the same logical content after conversion. This transformation is called the Delta compaction. Its objectives are as follows:
- Reduce the number of delta files. The more delta files RowSet Flush, the more independent delta files you want to read in order to read the latest version of the data. This work is not suitable for in-memory (RAM), because each read will have a disk address with the delta file, which can suffer a performance penalty.
- Migrate Redo records to undo records. As mentioned above, a rowset contains a base data, which is stored by column, and the next is the Undo records, and the forward section is Redo records. Most of the queries are for the latest version of the data, so we need to minimize the number of redo records.
- Reclaim Old UNDO Records. Undo Recods only needs to save the data after the user sets the earliest point in time, and the undo record before this time can be removed from the disk.
Note: The design of the bigtable is that timestamp is bound to data without preserving the change information (insert update Delete), whereas Kudu is designed to timestamp bind in the change instead of data. If the history undo record is deleted, it will not get a row of data or when a column of data is inserted or updated. If the user needs this feature, they need to save the inserted or updated timestamp column, just like the traditional relational database.
Types of Delta compaction
Delta Campaction points Minor and major two species.
Minor Delta Compactoin:
The Minor compaction is a compaction of multiple delta file and will not contain the delta file generated by base data,compact.
Major Delta compaction:
Major compaction is the compact of base data and any number of delta file.
Major compaction is more performance-intensive than minor compaction because it needs to read and rewrite base data, and base data is much larger than delta data (because base data saves a row of data, and Delta Data is the mutation of some column, the base data to be noted is Columnstore, delta data is not.
Major compaction can compact any number or column in the Diskrowset. If there is only one column of data that has several important updates, the compact can read and rewrite only for that column. This is often the case in an enterprise application, such as updating the status of an order, and updating the user's access volume.
Both types of compaction maintain the ROWID in the rowset. Because they are executed completely in the background and do not have locks. The compact results file is introduced into the rowset in the form of atomic swapping. At the end of the swap operation, the old files before the compact will be deleted.
Merging Compactions
As more and more data is written to Tablet,diskrowset, more and more will accumulate. This will degrade KUDU performance:
- Random access (Gets or updates a piece of data through a primary key), in which case each rowset will locate the primary key as long as its key range contains the primary key. Boom filter can alleviate a certain amount of physical addressing, but a large bloom filter access affects the CPU and also increases memory consumption.
- Querying a certain key range data (such as querying the primary key for data between A and b), at which point each rowset, as long as its key range overlaps the provided range, will be addressed separately, without using Bloom filter. A specialized index structure can be helpful, but it also consumes memory.
- Sort queries, the query result set must go through a merge process if the user requires that the results of the query be in the same order as the primary key. The consumption of the merge usually increases exponentially with the amount of data entered, i.e. the merge will consume more performance as the amount of data increases.
As mentioned above, we should merge rowset to reduce the number of rowset:
Unlike the delta compaction mentioned above, please note that merging compaction will not remain rowid. This makes mutation complex to handle concurrency. This process is described in more detail in the Compaction.txt file.
Overall picture
Comparison to BigTable approach
The different design approaches to BigTable are as follows:
- In
- kudu, a given key will only exist in the rowset of a tablet.
in BigTable, a key can exist in several different sstable. A whole tablet of bigtable is similar to KUDU's rowset: Reading a piece of data requires the data found in all sstable of the merge (according to key), similar to kudu, which requires the merge base to read a single piece of data Data and all deltafile. The advantage of the
Kudu is that the merge is not required to read a single piece of data or perform a non-sorted query. For example, the aggregation of a certain range of keys can independently query each rowset (or even parallel), and then perform the sum, because the order of key is not important, obviously the query is more efficient. The disadvantage of
Kudu is that unlike bigtable,insert and mutation, there are different operations: Insert writes data to Memrowset, and mutation (delete, Update) writes to the deltamemstore of the rowset in which the data exists. Performance impact There are a few points:
a) must be written to make sure that this is a new piece of data. This produces a bloom filter that queries all rowset. If the Bron filter gets a possible match (that is, it might be in a rowset), and then to determine whether it is an insert or update, an address must be executed. The
assumes that as long as the rowset is small enough, the results of the Bloom filter are accurate enough that most inserts will not require physical disk addressing. In addition, if the inserted key is ordered, for example timeseries+ "_" +xxx, the block where key is located may be stored in the block cache due to frequent use.
B) When you update, you need to determine which rowset the key is in. Similar to the above, the bloom filter needs to be executed.
This is a bit like a relational database RDBMS, which causes an error when inserting data from a primary key and does not update the data. Similarly, when updating a piece of data, an error will be found if the data does not exist. BigTable's syntax is not so.
- The
- mutation operates on disk data by ROWID, not by key in the real sense.
BigTable, the same primary key data can exist in more than one sstable, in order to let mutation and disk stored key together, bigtable need to perform the merge based on Rowkey. Rowkey can be a string of any length, so comparing rowkey is a very costly performance. In addition, in a query, even if the key column is not used (for example, the aggregation calculation), they are also read out, which results in additional IO. Composite primary keys are common in bigtable applications, and the primary key size may be one order of magnitude larger than the column you are interested in, especially if the columns of the query are compressed.
By contrast, Kudu's mutation is bound to ROWID. So the merge will be more efficient, by maintaining the counter way: To set a need to save the mutation, we can simply subtract, we can get from the base data to the current version of how many mutation. Alternatively, direct addressing can be used to efficiently obtain the latest version of the data.
In addition, if key is not specified in the query, the execution plan does not look up the key, except that the key boundary condition needs to be determined.
Example:
as the primary key of the previous table is (host,unitx_time), the execution pseudocode in Kudu is as follows: Sum = 0 foreach Rowset:start_rowid = Rowset.lookup_key ( 1349658729) End_rowid = Rowset.lookup_key (1352250720) iter = Rowset.new_iterator ("Cpu_usage") Iter.seek (Start_rowid) Remaining = End_rowid-start_rowid while remaining > 0:block = iter.fetch_upto (remaining) sum + = SUM (block). The
Get block is also very efficient because mutation directly points to the index address of the block.
- Timgstamp is not part of the data model.
In a bigtable-like system, each cell's timstamp is exposed to the user, essentially forming a primary key for the cell. This means that this approach provides efficient direct access to the specified version of the cell, and it stores all versions of a cell's entire time series. While Kudu is not efficient (requires execution of multiple mutation), its timestamp is implemented from MVCC and is not a separate description of the primary key. Instead, kudu can use a native composite primary key to satisfy a time series scenario, such as a primary key (Host,unix_time).
Reference
Source Document: Kudu table Design
Kudu tablet Design