Overview of the internal mechanisms of SQL Server in-memory OLTP (ii)

Source: Internet
Author: User
Tags garbage collection table definition

----------------------------I'm a split line-------------------------------

This article is translated from the Microsoft white Paper SQL Server in-memory OLTP Internals Overview: http://technet.microsoft.com/en-us/library/dn720242.aspx

Translators are limited, and if there is any improper translation, please correct them.

----------------------------I'm a split line-------------------------------

Row and Index Storage

In-memory OLTP memory-optimized tables and indexes on tables are stored in a very different way from disk-based tables. Memory-optimized tables are not stored in data pages or in space allocated from the extents as disk-based tables, because memory-optimized tables are determined by design principles optimized for byte-addressable memory rather than block-addressable disks.

Data rows

Data rows are allocated from structures called heaps that differ from the types of heaps supported by SQL Server for disk-based tables. Data rows on a single table do not necessarily reside near other data rows of the same table, and SQL Server knows which rows are the only way to belong to the same table because they are concatenated with the index of the table. This is why the memory-optimized table must have at least one requirement to create an index in it. An index provides a structure for a table.

The structure of the data row itself differs greatly from the data row structure used by disk-based tables. Each data row contains a caption and a payload containing the property of the data row. Figure 2 shows the structure and expands the content in the header area.

Figure 2 Structure of a data row in a memory-optimized table

Data row Header

The header contains two 8-byte fields that record the timestamp of the in-memory OLTP: The start timestamp and the end timestamp. Each database that supports memory-optimized tables manages two internal counters that are used to generate these timestamps.

    • The transaction ID counter is a global unique value that is reset when the instance of SQL Server restarts. It is incremented each time a new transaction starts.
    • Global transaction timestamps are also global and unique, but are not reset at restart time. This value is incremented at the end of each transaction and begins the validation process. The new value is the timestamp of the current transaction. The value of the global transaction timestamp is initialized when recovering using the maximum transaction timestamp found from the recovery record. (We'll see more about recovery in this white paper later.) )

The value of the start timestamp is the timestamp of the transaction that inserted the data row, and the value of the end timestamp is the timestamp of the transaction that deleted the data row. A special value (called infinity) is used as the end timestamp for rows that have not been deleted. However, when the data row is inserted for the first time, the timestamp of the transaction is unknown until the insert transaction completes, so the start timestamp before the transaction commits uses the global transaction ID value. Similarly, for a delete operation, the timestamp of the transaction is unknown, so the end timestamp for the deleted data row uses the global transaction ID value, and once the transaction's actual timestamp is determined, the global transaction ID value is replaced. As we saw during the discussion of data operations, the start timestamp and end timestamp determine which other transactions will be able to see the data row.

The header also contains a four-byte statement ID value. Each statement in a transaction has a unique Stmtid value, and when the data row is created, it creates the stmtid of the data row statement for the store. If the same statement accesses the same data row again, its stmtid is skipped.

Finally, the title contains a double-byte value (Idxlinkcount), which indicates how many indexes refer to this data row. The next Idxlinkcount value is a set of index pointers, which are described in the next section. The number of pointers equals the number of indexes. The reference value of the data row needs to start at 1 so that the data row can be referenced by the garbage collection (garbage collection, GC) mechanism even if the data row is no longer connected to any index. Garbage collection is considered the "owner" of the initial reference.

As mentioned earlier, each index on the table has a pointer, and these pointers, together with the index data structure, concatenate the data rows. In addition to the index pointers connecting data rows together, there is no other structure to merge data rows into a single table. It is for this reason that all memory-optimized tables must have at least one index requirement. Also, because the number of pointers is part of the data row structure and the data rows are never modified, all indexes must be defined when the memory-optimized table is created.

Payload data

The payload is the data row itself, which contains the key-value column plus all the other columns in the data row. (This means that all indexes on a memory-optimized table are actually overwrite indexes) the format of the payload varies according to the table. As mentioned in the section on creating tables earlier, the in-memory OLTP compiler generates DLL files for table operations, so that it can generate the appropriate commands for all data row operations as long as it knows the format of the payload used when the data row is inserted into the table.

Indexes in memory-optimized tables

All memory-optimized tables must have at least one index, because it is the index that joins all data rows together. As mentioned earlier, data rows are not stored in data pages, so there is no collection of data pages or extents, and no partitions or allocation units can be referenced to get all the data pages on the table. There are some concepts of index pages in a class of indexes, but these index pages are stored in a different way from the index of a disk-based table.

Indexes on in-memory OLTP, and changes made to them during data processing, are not written to disk. Only data rows and changes in data are written to the transaction log. All indexes for memory-optimized tables are created during database recovery based on the definition of the index. We will discuss the relevant details in the next checkpoint and recovery section.

Hash index

A hash index consists of an array of pointers, each of which is called a hash bucket. The index key Value column in each data row has a hash function applied to it, and the result of the function determines which hash bucket the data row uses. All key values that have the same hash value (that is, have the same result as the hash function) are accessed by the same pointer in the hash index and are concatenated into a chain. When a row of data is added to a table, the hash function is applied to the key value of the index in the data row. Because duplicate key values will always produce the same function results, if duplicate key values exist, the key values will always be in the same chain.

Figure 3 shows a row of data for a hash index on the name column. For this example, suppose there is a very simple hash function that produces a value that is equal to the length of the index key column in the string. The first value, "Jane", has a hash value of 4, so far, the value is in the first hash of the hash index. (Note that the real hash function is more random and unpredictable, but the example of this length is now used to make it easier to explain.) You can see a pointer to the data row where Jane is located from an entry with a value of 4 in the hash table. This data row does not point to any other rows of data, so the index pointer for this record is null.

Figure 3 A hash index with a single row of data

In Figure 4, a data row with a Name column value of Greg has been added to the table. Since we assume that the hash value of Grag is also mapped to 4, Grag is in the same hash bucket as Jane and the data row is linked to the same chain as the data row where Jane is located. Greg's data row has a pointer to the data row where Jane is located.

Figure 4 a hash index with two rows of data

On the city column, the second hash index contained in the table definition creates a second pointer field. Each row of data in the table now has two pointers (one per index) pointing to it, with the ability to point to more than two rows of data. The first pointer for each data row points to the next value in the chain for the index on the Name column, and the second pointer points to the next value in the chain for the index on the city column. Figure 5 shows the same hash index on the Name column, which now has three rows of data with a hash value of 4, and two data rows with a hash value of 5, and a hash value equal to 5 uses the second hash of the index on the name column. A second index on the city column uses three hash buckets. A hash bucket with a hash value of 6 has three values in the chain, a hash that corresponds to 7 has a value in the chain, and a hash with a hash value of 8 has a value in the chain.

Figure 5 Two hash indexes on the same table

As shown earlier in the example of CREATE table, when creating a hash index, you must specify the number of hash buckets. We recommend that you select the number of hash buckets equal to or greater than the expected cardinality of the index key column (that is, the number of unique values), so that each hash bucket has a greater likelihood of having only a single value in the chain of data rows. Be careful not to select an oversized quantity, however, because each hash bucket consumes memory. The number you provide will be rounded to the next of 2, so the value of 50000 will be rounded to 65536. Having an extra hash bucket does not improve performance, but it can obviously waste memory and potentially reduce the performance of the scan because the scan checks each hash bucket for the data rows.

When deciding to establish a hash index, remember that the hash function that is actually used is based on all the key-value columns. This means that if there is a hash index on the LastName and FirstName columns in a Employees table, the data rows with the value "Harrison" and "Josh" may be hashed to the value "Harrison" and "John" The data rows are in a different hash bucket. Queries that provide only one LastName value or an imprecise FirstName value (for example, "jo%") will not be able to use this index at all.

Memory-Optimized nonclustered indexes

If you do not know the number of hash buckets required for a column, or if you know that you are looking for data based on the range of values, you should consider creating a memory-optimized nonclustered index instead of a hash index. These indexes are implemented by using new data structures called bw trees, which were originally conceived and defined by Microsoft Research in 2011. A memory-optimized nonclustered index is a variant of the B-tree that does not have a lock and latch.

In addition to index pages that do not have a fixed size, the general structure of this nonclustered index is similar to the regular B-Tree of SQL Server, and once established, they are immutable. Like a normal B-tree page, each index page contains an ordered set of key values and a corresponding pointer for each value. At the top of the index, on the so-called internal page, the pointer points to an index page below the index tree, and at the leaf level, the pointer points to a data row. Just like the hash index of in-memory OLTP, multiple rows of data can be connected together. For nonclustered indexes, data rows with the same index key values are concatenated.

One of the big differences between memory-optimized nonclustered indexes and the B-Tree of SQL Server is that the page pointer is a logical page ID (PID) instead of a physical page number. The PID marks a location in the mapping table that is connected to each PID by the physical memory address in the mapping table. The index page is never updated, instead it is replaced with a new page and the mapping table is updated so that the same PID is indicated to a new physical memory address.

Figure 6 shows the approximate structure of a memory-optimized nonclustered index, as well as the page mapping table.

Figure 61 The approximate structure of a memory-optimized nonclustered index

In Figure 6, all the PID values are not marked, and the mapping table does not show all the PID values in use. The index page shows the key values referenced by the index. Each index row in the internal index page contains a key value (), and a PID for the next level page. The key value is the maximum possible value on the referenced page. (Note that, unlike regular B-tree indexes, the index row stores the minimum value on the next level of page for a regular B-tree index.) )

Leaf-level index pages also contain key values, but not a PID, which contains the actual memory address of a data row, which may be the first in the chain of data rows with the same key value.

Another big difference between memory-optimized nonclustered indexes and the B-Tree of SQL Server is that in leaf-level nodes, a set of delta values is used to record the tracking of data changes. For each change, the leaf-level page itself is not replaced. Each time a page is updated, a key value can be inserted or deleted in the page, and a page is generated that contains the incremental record that indicates the change that was made. An update is represented by two new incremental records, one is the original value of the deletion, and the other is the new value inserted. When each incremental record is added, the mapping table is updated with the physical address of the page that contains the new incremental record. This behavior is illustrated in Figure 7. The mapping table shows only a single page with a logical address of p. As the P page shows, the physical address in the mapping table is originally the memory address of the corresponding leaf-level index page. After a new row with an index key value of 50 (assuming that it did not occur in the table's data) is added to the table, the in-memory OLTP increments the Delta record to page p, indicating the insertion of the new value and updating the physical address of the P page to indicate the address of the first incremental record page. It is assumed that a unique row with an index key value of 48 is removed from the table. In-memory OLTP must remove an index row with a key value of 48, so another delta record is created, and the physical address on the P page is updated again.

Figure 7 Incremental records linked to a leaf-level index page

index page Structure

Although the maximum index page size is still 8KB, unlike indexes on disk-based tables, non-clustered index pages for in-memory OLTP do not have a fixed size.

All non-clustered index pages that are memory-optimized have a title area that contains the following information:

    • PID-pointer to the map table
    • Page type-leaf page, internal page, incremental page, or special page
    • Right page PID-PID on the right page of the current page
    • Height-the vertical distance from the current page to the leaf level page
    • Page statistics – The number of incremental records plus the number of records on the page
    • Maximum key value-the upper limit of the value on the page

In addition, leaf pages and internal pages contain two or three fixed-length arrays:

    • Numeric-This is actually an array of pointers. Each entry in the array is 8 bytes long. The entry for the internal page contains the PID for one page in the next level, and the leaf-page entry contains the memory address of the first data row in the chain with the data row with the equivalent key value. (It is important to note that from a technical standpoint, the PID can be stored in 4 bytes, but in order to use the same numeric structure for all index pages, the array allows 8 bytes per entry).
    • Offset-this array exists only in pages with indexes that have variable length key values. Each entry is 2 bytes and contains an offset from the starting position of the key value in the array of key values on the page.
    • Key value-This is an array of key values. If the current page is an internal page, the key value represents the first value on the page referenced by the PID. If the current page is a leaf-level page, the key value is the value in the chain where the data row is located.

The smallest page is typically an incremental page, and the incremental page has a title that contains most of the same information as the internal page or leaf-level page. However, the header of an incremental page does not have the array information that is described in the leaf page or internal page. An incremental page contains only one opcode (insert or delete) and a numeric value, which is the memory address of the first data row in the chain where the data row is located. Finally, the incremental page also contains the key values for the current incremental operation. In fact, you can think of an incremental page as a small index page that holds a single element, whereas a normal index page stores an array of n elements.

non-clustered internal refactoring operations

There are three different operations that can be used to manage the structure of this index: summary, Split, and merge. For all of these operations, the existing index pages are not changed. Instead, change the mapping table to update the physical address of a PID value. If an index page needs to add a new row of data (or delete a row of data), an entirely new page is created and the PID value is updated in the mapping table.

Summary of incremental records

The long chain of one incremental record ultimately results in a decrease in search performance, because when SQL Server searches through an index, it must consider changes in the delta record and the contents of the index page. If in-memory OLTP tries to add a new incremental record to a chain that already has 16 elements, the changes in the Delta record are summarized as a referenced index page, and the page is rebuilt and contains the changes indicated by the new delta record that triggered the rollup. The new rebuild page has the same PID value as the original value, but has a new memory address. The old pages (indexed pages plus incremental pages) are marked as garbage collection.

split of a full index page

An index page grows according to the on-demand principle, and can start at a storage line and store up to 8K bytes. Once the index page grows to 8 K bytes, a newly inserted row of data will cause the index page to be split. For an internal page, this means that there is no more space to add another key value and pointer, and for a leaf-level page, the data row is too large to fit on the page once all the incremental records are merged. Statistics in the page header of a leaf page continuously track the space required to summarize incremental records, which is adjusted each time a new incremental record is added. The split operation is done by the two atomic steps described next. Suppose PS is a page split into pages P1 and P2, and PP is a parent page with a data row pointing to Ps.

    • Step 1: Allocate two new pages P1 and P2, and split the data rows from the PS pages onto these pages, including the newly inserted rows. A new location in the page map table for storing?? The physical address of the P2 page. The two pages of P1 and P2 are not accessible to any concurrent operations at this time. A "logical" pointer from P1 to P2 is also set. When this is done, the page mapping table is updated within the same atomic operation, changing the pointer from point to PS to point to P1. After this operation, there is no pointer to the PS page.
    • Step 2: After Step 1, the parent page of PP points to P1, but there is no direct pointer from the parent page to page P2. Page P2 can only be accessed through page P1. In order to create a pointer from the parent page to page P2, you need to assign a new parent page PNP, copy all the data rows from the page PP, and add a new data row to the page P2. When this is done, the page mapping table is updated within the same atomic operation, changing the pointer from PP to PNP.
merging of adjacent index pages

When the delete operation leaves an index page p with less than 10% of the maximum page size (currently 8K), or if there is only a single row of data on the page, page P will be merged to its adjacent page. Similar to split, this is also a multi-step operation. For this example, let's say we merge a page with its their neighbourhood page, and the their neighbourhood page is the one that has a smaller value. When a row of data is removed from page p, the incremental record that identifies the deletion is added as usual. In addition, a check is performed to determine if the page P meets the merge criteria (for example, the remaining space after the data row is deleted is less than 10% of the maximum page size). If a condition is met, the merge is performed in three atomic steps as described below. For this example, suppose that page pp is a parent page that has a data row pointing to page p, PLN represents the their neighbourhood page, and we assume that its maximum value is 5. This means that the data row that points to PLN in the parent page PP contains a value of 5. We will delete the data row with a key value of 10 in page p. When deleted, only data rows with a key value of 9 remain in page p.

    • Step 1: Create an incremental page DP10 that represents a key value of 10, and set its pointer to point to P. In addition, a special "merged incremental page" DPM was created to point to DP10. Note that at this stage, page DP10 and DPM are still not visible to any concurrent transactions. In the same atomic step, a pointer to page p in the page map table is updated exponentially to DPM. After this step, an entry with a key value of 10 in the parent page of PP now points to DPM.
    • Step 2: In this step, the data row in Page pp that represents a key value of 5 is deleted, and an entry with a key value of 10 is updated to point to page PLN. To do this, a new non-leaf-level page PP2 is assigned, and all data rows are copied to page PP2 in addition to the row of data for which the key value is 5, and then the data row with the key value of 10 is updated to point to page PLN. When this operation is complete, the page map entry pointing to page pp is updated to point to page PP2 within the same atomic operation. Page pp can no longer be accessed.
    • Step 3: In this step, the leaf page p and PLN are merged, and the incremental pages are deleted. To do this, a new page pnew is assigned, and the data rows in P and PLN are merged, and the new page pnew contains the changes in the incremental page. Now, within the same atomic operation, page map entries pointing to page PLN are updated to point to page pnew.
Data manipulation

In-memory OLTP for SQL Server determines which row versions are visible to which transactions by maintaining an internal transaction ID that provides the purpose of the timestamp, which is called a timestamp in this section. The timestamp is generated by a monotonically incrementing counter that grows each time the transaction commits. The start time of a transaction is the maximum timestamp in the database at the point at which the transaction started, and a new timestamp is generated when the transaction commits, which uniquely identifies the transaction. The timestamp is used to specify the following:

    • Submit/ End time : Each transaction that modifies data commits a different point in time that is referred to as the commit or end timestamp of the transaction. The commit time is able to effectively identify where the transaction is located in the serialized history.
    • A record is valid for a version: 2, all records in the database contain two timestamps-the start timestamp (begin-ts) and the end timestamp (end-ts). The start timestamp refers to the commit time of the transaction that created the version, and the end timestamp refers to the commit timestamp of the transaction that deleted the version (perhaps replaced with a new version). The valid time for a record version is the range of timestamps that its version is visible to other transactions. For example, in Figure 5, Susan's record was updated from Vienna to Bogota at time "90".
    • logical Read TIME : The read time can be any value between the start time of the transaction and the current time. Only versions with valid time and logical read time are visible for read operations. The logical read time for a transaction corresponds to the start of the transaction for all other isolation levels except for Read Committed. For committed reads corresponds to the start of a statement in the transaction.

The concept of version visibility is the basis for proper concurrency control in in-memory OLTP. Transactions performed at the logical read time RT must only see those versions where the start timestamp is less than the RT and the end timestamp is greater than RT.

Allowable isolation levels for memory-optimized tables

Data operations on memory-optimized tables always use optimistic multi-version concurrency control (Multi version Concurrency control, MVCC). Optimistic data access does not use locks or latches to provide transactional isolation. We will describe the details of how this unlocked and unlocked behavior is managed, as well as details about the reasons for allowable transaction isolation levels mentioned in the following sections. In this section, we will discuss only the details of the level of transaction isolation required to understand the fundamentals of data access and modification operations.

Transactions that access memory-optimized tables support the following isolation levels.

    • Repeatable READ

The transaction isolation level can be specified as part of the atomic block of the locally compiled stored procedure. In addition, when accessing memory-optimized tables through interpreted Transact-SQL, you can use table hints or a new database option named Memory_optimized_elevate_to_snapshot to specify the isolation level. This database option transparently maps the lower isolation level (for example, uncommitted read and Read committed) to the snapshot isolation level, thereby reducing the changes to the application that are required to migrate some applications to use memory-optimized tables. For more details, refer to http://msdn.microsoft.com/en-us/library/dn133175 (v=sql.120). aspx.

The Read Committed isolation level is supported for memory-optimized tables in autocommit (single statement) transactions. Memory-Optimized tables do not support explicit or implicit user transactions. (An implicit transaction is a transaction that is lowered in the implicit_transactions session option.) In this mode, the behavior is the same as an explicit transaction, but does not require a BEGIN TRANSACTION statement. Any DML statement initiates a transaction, and the transaction must be explicitly committed or rolled back. Only the BEGIN TRANSACTION statement is implicit. The memory-optimized table for autocommit transactions supports the Read_committed_snapshot isolation level and is supported only if the query does not have access to any disk-based tables. In addition, transactions initiated through interpreted Transact-SQL under the SNAPSHOT isolation level cannot access memory-optimized tables. Transactions using interpreted Transact-SQL at the repeatable READ or SERIALIZABLE isolation level must use the SNAPSHOT isolation level to access the memory-optimized tables.

Based on the structure of the data rows described earlier in memory, let's look at an example of how DML operations are performed. We will represent the rows of data by listing the contents sequentially in angle brackets. Let's say we have a transaction TX1 with a transaction ID of 100 running at the serializable isolation level, the transaction starts at timestamp 240, and two operations are performed:

    • Delete data row <greg, lisbon>
    • Update <jane, Helsinki > for <jane, perth>

At the same time, the other two transactions will read the data rows. TX2 is a single SELECT statement that is automatically committed when it is run at timestamp 243. TX3 is an explicit transaction that reads a row of data and then updates another data row based on the value it reads in the SELECT statement, and the timestamp of TX3 is 246.

First, let's take a look at data modification transactions. When the transaction starts, it gets a start timestamp indicating the start of the transaction, which is related to the serialization order of the database. In this example, this timestamp is 240.

When the transaction runs, transaction TX1 will only be able to access records with a start timestamp of less than or equal to 240 and a record with an end timestamp greater than 240.


Transaction TX1 First locates <greg, lisbon> through an index. In order to delete this data row, the end timestamp of the line is set to 100 and an additional bit flag bit, which indicates that the value of 100 is the ID of a transaction. Any other transaction that attempts to access the row now finds that the end timestamp contains the transaction ID (100), which indicates that the row may have been deleted. It then finds the TX1 in the transaction diagram and checks if the transaction TX1 is still active to determine whether the <greg, lisbon>, has completed.

Update and insert

Next, <jane, the helsinki> update is performed by dividing the operation into two separate operations: deleting the entire original data row and inserting a complete new data row. This is done by building a new data row <jane, Perth> begins with a data row that includes a start timestamp with a value of 100 and represents a bit flag bit for the transaction ID of 100, and then sets the end timestamp to ∞ (infinity). Any other transaction that attempts to access the row will need to determine whether the transaction TX1 is still active to determine whether it can see <jane, perth>. The insertion is then made by connecting the <jane, perth>, to two indexes. Next, <jane, helsinki> will be deleted as described in the previous paragraph. Any other attempt to update or delete <jane, Helsinki> 's transaction finds that the end timestamp contains a transaction ID instead of infinity, thus determining that there is a write conflict and that it is immediately aborted.

At this point, the transaction TX1 has completed the operation, but has not yet committed. The process of processing a commit begins by obtaining an end timestamp for the transaction. In this example, the time stamp is assumed to be 250, identifying its point in the serialized order of the database, at which point the update logic for the transaction has been completed. After the end timestamp has been obtained, the transaction enters a state known as authentication, in which case the database performs checks to ensure that the transaction does not violate the current isolation level. If the validation fails, the transaction is aborted. More details on validation are described later. At the end of the validation phase, SQL Server will also write to the transaction log.

Transaction tracking all changes in a write collection, the Write collection is primarily a series of delete/insert operations and pointers to the version associated with each operation. The Write collection for this transaction and the changed data rows are shown in the green box in Figure 8. This write collection forms the log content of the transaction. A transaction typically generates only one single log record with its ID and commit timestamp, and the version of all records it deletes or inserts. Each record that is affected will not have a separate log record like a disk-based table. However, there is an upper limit to the size of the log record, and if one transaction in the memory-optimized table exceeds this limit, more than one log record is generated. Once the log record has been solidified into the store, the state of the transaction is changed to committed and the subsequent processing is initiated.

Subsequent processing involves the traversal of the write collection and the processing of each entry as follows:

    • For a delete operation, the end timestamp of the data row is set to the end timestamp of the transaction (in this case, 250) and the type flag of the end timestamp field in the data row is cleared.
    • For an insert operation, the start timestamp of the affected data row is set to the end timestamp of the transaction (in this case, 250) and the type flag of the Start timestamp field in the data row is cleared.

Old data row versions the actual unlink and delete operations are handled by the garbage collection system, which is described later.

Figure 8 Modification of transactions on a single table


Now, let's take a look at the read transactions, TX2 and TX3, both of which will be processed with TX1 at the same time. Keep in mind that TX1 is deleting <greg, lisbon> data rows, and will <jane, Helsinki > Update to <jane, perth>.

TX2 is an auto-commit transaction that reads an entire table:

SELECT Name, City

From T1

TX2 sessions run under the default isolation level of Read Committed, but as described above, because no hints are specified, and T1 is a memory-optimized table, snapshot isolation is used to access the data. Because TX2 runs at a timestamp of 243, it is able to read rows of data that existed at the time. When the timestamp is 243, the data row <greg, beijing> is no longer valid, so it is not able to access the row. The data row <greg, lisbon>, is deleted when the timestamp is 250, but it is valid between timestamps 200 and 250, so the transaction TX2 can read to it. TX2 will also read the data rows <susan, Bogota > and Data Rows <jane, Helsinki >.

TX3 is an explicit transaction that begins at timestamp 246, which reads a row of data and updates another data row based on the read value.

DECLARE @City nvarchar( +);BEGIN TRANTX3SELECT @City = City fromT1 with(RepeatableRead)WHEREName= 'Jane'; UPDATET1 with(RepeatableRead)SETCity= @City    WHEREName= 'Susan';COMMIT TRAN --commits at timestamp 255

In the transaction TX3, the SELECT statement reads to the data row <jane, Helsinki, because the row is still accessible after timestamp 243. The statement will then take the data line < Susan, Bogota > Update to < Susan, Helsinki >. However, if the transaction TX3 tries to commit after TX1 has been committed, SQL Server will detect the data row <jane, Helsinki > has been updated by another transaction. This is a violation of the REPEATABLE read isolation level requirement, so the commit will fail and the transaction TX3 will be rolled back. In the next section we will cover more about validation.


SQL Server performs a validation step before finally committing the transaction related to the memory-optimized table. Because locks are not required during data modification, data changes can result in invalid data depending on the level of isolation requested. This phase of the commit process therefore ensures that no invalid data is available.

Below is a list of the possible violations of the isolation level that may be encountered in each of the possible isolation levels. More likely violations and submitted dependencies will be described in more detail in the next section describing the isolation level and concurrency control.

If you access a memory-optimized table under the Snapshot isolation level, you may have the following validation errors when you try to perform a commit:

    • If the current transaction inserts a row of data into a row of data inserted into another transaction committed before the current transaction, having the same primary key value will result in a 41325 error ("The present transaction failed to commit due to a Serializable validation failure. " ), and the transaction will be aborted.

If you access a memory-optimized table under the REPEATABLE READ isolation level, you may have the following validation errors when you try to perform a commit:

    • If the current transaction has read any data rows that were updated by another transaction that was committed before the current transaction, a 41305 error will be generated ("The present transaction failed to commit due to a repeatable read Validat Ion failure. " ), and the transaction will be aborted.

If you access a memory-optimized table under the Serializable isolation level, you may have the following validation errors when you try to perform a commit:

    • If the current transaction cannot read any valid data row that meets the specified filter criteria, or if it encounters a phantom data row that is inserted by another transaction that matches the specified filter criteria, the commit will fail. The transaction needs to be executed as if there were no concurrent transactions. All actions are logically occurring at a single serialized point in time. If any of these protections are violated, a 41305 error will be generated and the transaction will be aborted.
T-SQL support

Memory-Optimized tables can be accessed in two different ways: either by using interpreted Transact-SQL to interoperate, or by locally compiled stored procedures.

interpreted type of Transact-SQL

When using the interop feature, the full functionality of Transact-SQL is almost always available for use with memory-optimized tables, but it does not provide the same performance as accessing memory-optimized tables using locally compiled stored procedures. When you run an ad hoc query, or when you migrate an application to in-memory OLTP, as a step in the migration process, interoperability is the right choice before migrating the stored procedures that have the most impact on performance. Interpreted Transact-SQL should also be used when it is necessary to access both memory-optimized tables and disk-based tables.

When using interop to access memory-optimized tables, the following Transact-SQL features are not supported:

    • MERGE (when the target is a memory-optimized table)
    • Dynamic and Keyset cursors (these cursors are automatically demoted to static cursors)
    • Cross-database queries
    • Cross-database transactions
    • Linked server
    • Lock hint: tablock,xlock,paglock (support NOLOCK, but will be ignored automatically).
    • Isolation level hints readuncommitted,readcommitted and Readcommittedlock
T-SQL The local compilation stored procedure

Natively compiled stored procedures allow you to execute Transact-SQL statements in the quickest possible way, including accessing data in memory-optimized tables. However, these stored procedures have more limitations than Transact-SQL statements. There are also restrictions on data types and collations that can be accessed and processed in a locally compiled stored procedure. For a complete list of supported Transact-SQL statements, data types, and allowed operators, see the documentation. In addition, access to disk-based tables is not allowed at all in a locally compiled stored procedure.

These limitations are due to the fact that within the engine, a separate function must be created for each operation on each table. This interface will be extended in subsequent releases.

Garbage collection of data rows in memory

Because in-memory OLTP is a multi-version system, delete and update operations (and aborted insert operations) produce row versions, which eventually become invalidated, which means that they are no longer visible to any transaction. These unwanted versions slow down the index structure scan and create unused memory that needs to be reclaimed.

The garbage collection process for a failed version in a memory-optimized table is similar to the version store cleanup that SQL Server performs for disk-based tables, in the case of one of the snapshot-based isolation levels. But one big difference is that cleanup does not take place in tempdb, but rather in the structure of the memory table itself.

To determine which rows of data can be safely deleted, the system keeps track of the time stamp of the oldest active transaction running on the system and uses this value to determine which rows of data may still be required. Any invalid data rows after this point in time (that is, their end timestamp earlier than this point in time) are considered invalidated. The failed data rows are deleted and their memory is freed back to the system.

The design of the garbage collection system is non-blocking, collaborative, efficient, responsive and extensible. Of particular interest is the "synergy" attribute. Although there is a system thread dedicated to the garbage collection process, the user thread actually does most of the work. If a user thread is scanning the index (all index access on the memory-optimized table is considered an index scan) and encounters a stale row version, it disconnects the row version from the current chain and adjusts the pointer. It also decrements the reference count in the header area of the data row. In addition, when a user thread completes a transaction, the user thread then adds information about the transaction to a transaction queue to be processed by the garbage collection process. Finally, the user thread obtains one or more work items from a queue created by the garbage collection thread and frees the memory used by the data rows that comprise the work item.

The garbage collection thread checks the completed transaction queue approximately every minute, but the system can adjust the frequency internally based on the number of completed transactions waiting to be processed. For each transaction, it determines which rows of data are invalidated and establishes a work item that consists of a set of data rows that are ready to be removed. In the CTP2 version, the number of data rows in a group is 16, but this number may change in future releases. These work items are distributed across multiple queues, and each SQL Server uses a CPU that corresponds to a queue. In general, the actual work of removing data rows from memory is left to the user thread that handles those work items in the queue, but if the user is inactive, the garbage collection thread itself deletes the data rows to reclaim the system's memory.

Dynamic management view Sys.dm_db_xtp_index_stats has one record for each index in each memory-optimized table, and the rows_expired column indicates how many rows of data were detected as invalidated during the scan of the index. There is also a column named Rows_expired_removed that indicates how many rows of data have been unlinked from the index. As mentioned above, once the data row has been unlinked from all the indexes in a table, it can be deleted by the garbage collection thread. Therefore, the value of rows_expired_removed does not grow until the rows_expired counters for each index in the memory-optimized table have been increased.

The following query can observe these values. It connects the sys.dm_db_xtp_index_stats dynamic management view to the Sys.indexes catalog view to return the name of the index.

SELECTName as 'index_name', s.index_id, scans_started, rows_returned, rows_expired, rows_expired_removed fromSys.dm_db_xtp_index_stats sJOINSys.indexes I onS.object_id=I.object_id  ands.index_id=i.index_idWHERE object_id('<memory-optimized Table Name>')=S.object_id;GO

---------------------------to Be Continued-------------------------------

Overview of the internal mechanisms of SQL Server in-memory OLTP (i)

Overview of the internal mechanisms of SQL Server in-memory OLTP (ii)

Overview of the internal mechanisms of SQL Server in-memory OLTP (iii)

Overview of the internal mechanisms of SQL Server in-memory OLTP (iv)

Overview of the internal mechanisms of SQL Server in-memory OLTP (v)

Overview of the internal mechanisms of SQL Server in-memory OLTP (ii)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.