Secondary index of the Aerospike-Architecture series, aerospike
Secondary Index (Secondary Index)
The secondary index is built on a non-primary key to give the model a one-to-multiple relationship. The index is specified based on bin (similar to columns in RDBMS ). Allows efficient updates and reduces the need for index storage resources.
DDL is used to determine which bin and type are indexed. Indexes can be dynamically created or removed using tools or APIs. Similar to the RDBMS mode, DDL does not perform data validation even if the bin is defined as an index by DDL. When the index bin record is updated, the index is updated together.
For example, an index can only be created on a string or an integer ). Consider this situation. One bin stores the user's age, one application stores the string type, and the other stores the positive integer type. The integer index does not include the string type record, and the string type index does not include the integer record.
The secondary index is as follows:
- Store in memory for quick search
- It is created on every node of the cluster. Each index entry contains a reference pointing to the record location of the current node.
- Secondary index contains the pointer of the master record and replication record in the Cluster
Data Structure
Figure 1 secondary index of B-tree
Like a primary index, a secondary index is a hash table in the B-tree form. Each logical index has 32 physical trees. When the key is defined as an index operation, the hash function is used to determine the physical trees on which the index entries are created. After B-tree is updated, note that the key of a secondary index can reference Multiple Primary records. The bottom layer of the structure is a complete B-tree based on the number of records.
Index Management (Index Management) Index Metadata (Index Metadata)
Aerospike retains the index creation information in a special globally maintained data structure-system metadata (SMD ). The system Metadata module is located in the index module for multiple nodes. The change of secondary index is eventually triggered by SMD.
Figure 1 index triggered by system metadata
- The client request triggers the create/delete/update operation and the sub-index metadata operation. Request to reach SMD through secondary index module
- SMD sends a request to the paxos master
- The Paxos master requests metadata from all nodes in the cluster. When all data is returned, it calls the callback function of the secondary index merge. This function is used to analyze the winning data versions.
- Once the version information is determined for the secondary index, the request is sent to all connected nodes to accept the new metadata information.
- Each node executes the index create/delete DDL function, then triggers a scan and returns it to the client.
Index Creation (Index Creation)
Aerospike supports dynamic index creation. The aql tool can read currently available indexes and create and distribute indexes.
To create secondary indexes, you must specify the namespace, set, bin, and index types (such as integer and string types)
SlaveSMDAfter receiving the confirmation that each node creates the secondary index in write-active mode and starts a background scan job, the job scans all data and inserts the entry into the secondary index.
An index entry is created only when the record meets all the index conditions.
The scan job in the secondary index interacts with the read/write transaction that is normally scanned in the same way. Unlike the normal scan, the index scan does not involve network components. During index creation, all new write operations that affect index attributes update the index.
After an index is created, all index entries are created. The index is immediately prepared for query and marked as read-active.
After an index is created on all nodes, the secondary index is valid for all common queries.
Recommendations (recommended)
- Index DDL (create/drop index) may be ignored when the cluster is not yet formed or the cluster is under fault detection. Index creation is an I/O-intensive operation and should be performed at low load.
- If the node joins the cluster with data but misses the index definition, the missing index will be created and added to the cluster. Query access is not allowed during index formation. To avoid this situation when adding nodes, you should clear indexes before they increase.
Create Index Priority (Index creation Priority)
An index creation scan only reads records that should be committed by transactions (no dirty reads ). If there is no record update blocking, the scan will be executed at full speed. Therefore, an important priority setting index is built at the correct level to ensure that index scanning does not affect the latency of ongoing read and write transactions. The job priority settings in the Aerospike real-time engine can be used to effectively control the resource utilization for creating index scans. The default parameters should satisfy most situations, because this is based on years of deployment experience and the experience comes from the balance between long transactions such as rebalancing and backup and Low-latency read/write.
Writing data with indexes (write data with indexes)
When data is written, the system metadata (SMD) of the current index is checked ., The secondary index performs update/insert/delete for all bin with indexes. Note that Aerospike is a flex-schema system. If the corresponding bin does not have a value or the data type is inappropriate, the corresponding next step will not be executed.
All these index changes are synchronized when the record changes. Because the index is not continuous, the difficulty in submitting the index and data is eliminated, improving the speed.
Garbage Collection (Garbage Collection)
To delete secondary index entries, when primary data is deleted (e. g delete/expiry/eviction/migration), data cannot be read from the disk. This avoids unnecessary I/O overhead. Secondary index entries are cleared by a background thread that is triggered regularly. The garbage collector is designed as a non-intrusive type. It creates a list of deleted items in a small batch and then slowly deletes the items from the index. When there is a large amount of cleaning work in the system, more memory is needed to adapt to garbage collection.
Distributed Query (Distributed Query)
Query requests sent through secondary indexes are sent to each cluster node. Figure B describes the basic architecture as follows:
- Distribute requests to all nodes
- In-memory index ing to primary key quickly
- Indexes work with SSD data on each node to ensure ACID and manage migration
- Read records from all SSDs/DRAM in parallel
- Aggregation result set on each node
- Integrate result sets from all nodes and send them to the client
The secondary index searches for a very long list of primary key records. For this reason, we choose to perform small batch secondary index searches. There are also some batch responses in the client. If the memory reaches the critical value, the response is immediately written to the network. This behavior is like the return value of the Aerospike Batch Processing request. The general idea is to keep the memory of an independent secondary query as a constant regardless of the Request selection degree.
Query Result)
The result of the query process validation is synchronized with the real data during the query record scan. No data not submitted during query execution. However, data deleted during the query may be returned.
In Presence of Cluster State Change (when the Cluster State changes)
The following table describes the secondary index reconstruction and query result consistency scenarios.
Scenario |
Persistent Namespace Boot Type |
Data-in-memory |
Secondary Index Population |
Node Boot Time |
Query Consistency during migrations |
Node Joining |
With Data; Without Fast Restart |
False |
Post Data Load From Disk; Parallel Data Partition Scan |
Higher data load time than with no secondary index |
Best effort * |
Node Joining |
With Data; With Fast Restart (Primary Key Index Available in Shared memory) |
False |
Post Fast Restart; Parallel Data Partition Scan ** |
Higher data load time than with no secondary index |
Best effort * |
Node Joining |
With Data; Without Fast Restart |
True |
At the data load from disk |
No Signficant difference with and without Secondary Index |
Best effort * |
Node Joining |
With No Data; Always Without Fast Restart |
True/False |
-NA- |
-NA- |
Consistent Copy |
Node Leaving |
-NA- |
True/False |
-NA- |
-NA- |
Consistent Copy |
Best EffortIt is a consistent data copy of Data transactions and does not need to be the latest (in the record copy, the merge will not be executed before the secondary index query is passed ). After the merge ends, the copies are eventually consistent.
Fast restartSecondary index not supported
In normal operating environments, nodes can be fully queried and available when they are added or removed. When the migration is in progress, Aerospike looks after the data load from the disk mounted to the data (see the result consistency table for details)
Query Node)
Exact queries during data migration are complex. When a node is added to or removed from a cluster, the data migration module is called to convert the data to a new configured node. In the migration operation, partitions may have different versions on many nodes. To access the shard location with the queried data. Aerospike queries the Processing Quality of the partition States shared among nodes in other clusters, and selects a query node for each partition that may be executed. The query node of the selected partition is based on many factors (such as the number of records in the partition, the number of copies in the partition in the cluster, and so on ). The purpose of the system design is to query the most accurate results.
Aggregations (aggregation)
Query records can be provided to the aggregation framework for filtering. On each node, the query result is sent to the UDF subsystem for record stream processing. The UDF stream mentioned by the user will be called, and the user-defined operation sequence will be applied to the query results. The results of each node are collected by the client, and other data operations may be performed.
Performance)
To ensure that aggregation does not affect the overall performance of the database, we adopt a variety of technologies:
The global queue is used to manage records through various stages of processing, and the thread pool effectively utilizes the CPU Parallel capability. The query status is shared by the thread pool so that the system can correctly manage the stream UDF pipeline. In addition to initialization data, each phase in the aggregation is a CPU binding operation. Therefore, the quick and optimized completion of each stage is very important. To achieve this goal, we have used technologies such as record batch processing and UDF state cache to optimize the system upper limit for real-time massive record processing.
In addition, data stream processing is implemented in a separate thread context for operations that are stored on top of the namespace in the memory (no storage is obtained. Even in this case, the Aerospike native divides data into a fixed number of partitions, and the system can still perform parallel operations on cross-partition data.
<Http://www.aerospike.com/docs/architecture/secondary-index.html>
Translated by: Beijing IT masters