Hybrid storage for the Aerospike-architecture series

Source: Internet
Author: User
Tags aerospike

Hybrid Storage (Hybrid storage)

The hybrid memory system contains the indexes and data on each node, manipulating interactions with physical storage. It also includes modules for automating the removal of data, as well as for defragmenting.

Aerospike can store data in DRAM, traditional disks and SSD drives, each namespace can be configured separately. This configuration resiliency allows application developers to configure a small but frequently accessed namespace in memory to configure a large namespace on a relatively inexpensive SSD drive.

Important work has been done to optimize data storage on SSDs, including through file systems utilizing the underlying SSD read-write mode.

Philosophy (philosophy)

differs from Large Data Types, all of a single record is stored together. The storage limit for each row is 1 MB by default.

Store Write-time replication ( Copy-on-write ), the space is reclaimed by the defragmentation process.

Each namespace is configured with a fixed-size storage. Each node must have the same namespace, and each namespace is of the same size.

Storage can be configured for pure memory without persistence, memory and persistence, or Flash (SSD)

Persistent storage (disk) must be a flash or high-performance block storage device (cloud), or a file on any storage device.

Data in DRAM (memory)

in-memory data- No Persistence- The advantage is high throughput. Even high-performance modern flash storage performance is still less than memory, and memory prices are falling fast.

data through JEMalloc allocator for distribution. Jemalloc is allowed to be assigned to different pools. Long-term allocations-such as those made for storage tiers-can be allocated separately. We found that the jemalloc dispenser has exceptional performance in low fragmentation conditions.

high levels of reliability can be achieved with multiple copies of DRAM. A high level of "k-safety" can be obtained because the aerospike is re-allocated and data copied when the cluster node is corrupted or the node joins . Automatically recovers node data from a copy of the data.

Because of the random data distribution of Aerospike, the risk ofdata unavailability is quite small when several nodes fail. For example, in a 10-node cluster with two copies of data, if two nodes fail. The number of failed data is approximately 2%, 1/50 of the data.

When the persistence layer is configured, the read occurs in the memory copy. Write through the data path.


Data on Ssd/flash (SSD)

When the data is written, the write latch is added to the row to avoid writing the same record. In some cluster states, data needs to be read from other nodes and resolve conflicts.

When the write is confirmed, the in-memory record is updated in the master node. The written data is added to the write buffer. If the write buffer is full, the queue is written to disk. Similar to the maximum number of rows, dependent on write buffer size and write throughput, there is some risk of uncommitted data.

If there are replicas, their indexes are updated at the same time when they are updated. When multiple in-memory replicas are updated, the results are returned to the client.

The system can be configured to return results before all writes are completed-deferred consistency.

Storing data (storage)

Aerospike data includes integers, strings, binary objects, native serialization types, lists, mappings, and Ldts.

In addition to the more efficient single bin mode, the Bin-aerospike column-each bin has a bin name, which is stored with a string table. The name of each column is stored and the 32K unique bin name can be stored in a namespace.

If you need more bin names, consider using map. With map, you can store any number of key-value pairs, accessing data efficiently through UDF access.

If you access data through a complex language type like Java class. The Aerospike client will use the native serialization system of the language. The data will be stored as the "blob type" specified by the language. This allows clients of the same language to read data with clear code, but the default serialization of most languages is rudimentary.

The integer is stored in 8 bytes, which limits the value of the current version integral type. The Aerospike network protocol allows variable-length integers.

String is stored in the UTF-8 character set. UTF-8 is more compact than Unicode for most string literals . To allow for Kua-language compatibility, the client library converts the original Unicode characters into UTF-8.

The most efficient way is to use binary objects (BLOBs). The size limit is equal to the record size limit. Many deployments use their own serialization, which may be stored directly after the object is compressed. Doing so means that the data cannot be accessed easily through UDFs

Complex types are rendered as Msgpack local storage. Complex objects are serialized on the client and sent using a write protocol. When applied to a simple get/put operation, the network format does not need to be serialized or converted and is directly written to the store.

Flash optimizations (Flash optimized)


The defragmenter tracks the number of active records on each block on the disk and reclaims the blocks below the minimum used. The defragmenter keeps scanning the activity fast, looking for blocks that have a certain amount of free space.

Eviction based on storage (storage-based reclamation)

The defragmenter tracks the number of active records on each block on the disk and reclaims the blocks below the minimum used. The Cleaner is responsible for removing expired records and reclaiming the memory when the system reaches the preset high watermark. When configuring namespace, the administrator specifies the maximum value that namespace uses for memory. Typically, the cleaner looks for outdated data, freeing up memory and disk space. The Cleaner also tracks memory usage through namespace, and if the memory reaches a preset high watermark, the cleaner releases older records even if the record does not have to expire. When the system memory reaches the usage limit, aerospike can be used as an efficient LRU (least recently used algorithm) cache by allowing the cleaner to remove old data. Note that the age of the record is measured by its last modification time, and the application can modify the surviving period of the record at any time. Applications can also specify that records will never be automatically reclaimed.

Large Records (Sub-record Storage mechanism) large record (sub-record storage algorithm)

To support the storage capabilities of large objects, Aerospike supports the new underlying storage model, which is called a "child record" (sub-records). Child records (sub-records) are similar to regular records, and the main differences are not directly accessible. The child record link is on the parent record and is accessed through the parent record. The child record shares the partition ID and the internal record lock with the parent record, so it moves with the parent record when migrating, and is protected with the same isolation mechanism as the parent record.

The Aerospike LDT is built using this storage algorithm. The LDT bins/records (record with the LDT type bin) is not a related record for continuous storage, but is divided into multiple sub-records (size 2k to 1M). A child record is related to a bin and can contain multiple items (for example, a child record of 8k can hold 100 80bytes of string data). Child records are interconnected, and links provide valid updates and lookups on the parent record.

Therefore, the LDT object uses aerospike robust replication, rebalancing, and migration algorithms to ensure instant consistency and high availability. The Ldt object is processed by the client API on the database server side.

The sub-record mechanism has the following benefits

    • The ability to perform random reads with SSD capability does not require any additional overhead, nor does it require the arrangement of data on storage that is feared when traditional database implementations are implemented.
    • When performing a specified LDT operation (for example, inserting 100bytes) costs only equivalent to updating the LDT entry, the client and the server directly do not have the cost of the LDT interaction.

Large records are stored in different ways, but allow data to exceed the limit of a single record, please read large data Type Architecture.


original link: >
Translator: Beijing It man son

Hybrid storage for the Aerospike-architecture series

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.