System Overview, Aerospike
System Overview)
Aerospike is a distributed and scalable NoSql database built for the following three main objectives:
- Create an elastic and scalable platform that meets the needs of today's network platform applications
- Delivers the same robustness and reliability as traditional databases (for example, ACID)
- Provides operation efficiency (minimum manual participation)
It was first released on Proceedings of VLDB (Very Large Databases) in 2011. The Aerospike architecture includes three layers:
- The Client Layer of the cluster can be perceived as an open-source Client function library that implements the Aerospike API. It tracks nodes and perceives the location of data in the cluster.
- Tering and Data Distribution Layer monitor cluster communication and automatic failover, Data replication, cross-Data Center synchronization, smart rebalancing, and Data migration
- Flash-optimized Data Storage layer (Data Storage) reliably stores Data in memory and flash
Client layer
Aerospike "Smart Client" is designed for speed. It is implemented as an open-source accessible Link Library for C, c #, Java, PHP, and Python development. developers can freely release or modify the library as needed. The client includes the following content:
- Implement the Aerospike API and directly connect to the cluster through the client-server protocol
- Tracks the data storage location of a node, and immediately detects cluster configuration changes when the node starts or stops.
- To improve efficiency, implement the TCP/IP connection pool. Detects non-cluster node failure-level transactions and routes new transactions to data copies.
- Transparently send requests directly to the node where the data is located and retry or re-request as needed. An example is during cluster reconfiguration.
This architecture reduces transaction latency, distributes cluster work, and eliminates developer workload. It ensures that the application does not need to be restarted when the node is started or stopped. In short, it eliminates the need for additional cluster management servers or proxy servers.
Distribution layer
The Aerospike "shared nothing" architecture is designed to reliably store TB and data and support Automatic Fault Tolerance, replication, and cross-Data Center synchronization. This layer implements linear expansion and ACID guarantees. The distribution layer also aims to eliminate manual operations and automate all cluster management functions of the system. It consists of three modules:
Cluster Management ModuleUsed to track cluster nodes. The key algorithm is to determine which nodes are part of the cluster's Paxos-like consistent voting process. Aerospike implements dedicated heartbeat detection (active and passive) for monitoring the connectivity between nodes.
When a node is added or removed and the cluster members are determined, each node uses the hash algorithm to split the primary index space into data slices and assign the owner.Data Migration Module (Data migration module) Then intelligently balance the data distribution across nodes in the cluster, and ensure that each data block features nodes and data center replication according to the replication factor configured by the system. Data segmentation is purely an algorithm, and the system extension does not have a master, thus eliminating other additional configurations in the shared environment.
- Transaction Processing Module)This module is used to read and write data requests and ensure consistency and isolation.
Clustering
Once the cluster is started, you can install other clusters configured with cross data-center replication in other data centers. If the cluster crashes, the remote cluster can carry the load.
Data storage layer
Aerospike stores key-value pairs in a stateless data model. The container for organizing data is called namespaces, which is equivalent to (databases. In namespace, data is subdivided into various sets (similar to tables in databases) and records (records) (Similar to rows in a database ). In a set, each record has a unique index key and one or more bins (similar to columns in a database) associated with it.
- Set and bin are not pre-defined and can be added at runtime.
- The values in bin are strongly typed, including any supported data types. Bin itself is not a type, so the same bin name can be different types of data.
For quick access, indexes (primary keys and secondary keys) are stored in the memory, and data can be stored in the memory or SSD hard disk. Each namespace can be configured separately, so that small namespaces can be stored in the memory while large namespaces can be stored in SSD.
The data layer is designed to speed up and reduce hardware costs. It can be used as a buffer layer for all operations in the memory or using optimized flash storage, the latter data will not be lost.
- 0.1 billion keys only occupy GB space. Although the key has no size limit, the effective storage of each key is only 64 bytes.
- Native, multithreading, multi-core Flash I/O, and Aerospike log structure file systems use low-level SSD read/write mode. In addition, the write disk performs block write operations to reduce latency. This method bypasses the standard file system to optimize traditional disks.
- InternalSmart Defragmenter (Smart fragment management) andIntelligent Evictor (Smart cleaner ). These processes work together to ensure that data in the memory is not lost and securely written to the disk.
-
- The fragment program tracks activity records in each partition and recycles blocks lower than the minimum usage.
- The Cleaner removes expired records and recycles memory when the system reaches the high level line. The expiration time is configured in each namespace. The record storage period starts from the last modification time. The application can reset the expiration time beyond the default data lifecycle, and the data can never expire.
Operate Aerospike
In traditional (non-distributed) database systems, you need to set the schema and create databases and tables after installing the software. This is very different from the Aerospike database.
In distributed databases, data is distributed across servers in the cluster. This means that you cannot access all data on a server.
To use the Aerospike database, follow these steps to create and manage the database:
Initialize database settings by configuring. According to Aerospike, when installing the system, a database is called a namespace. each node in the cluster must specify how to create and copy each namespace. The database is created when you restart the service.
Execute database operations through applications. When the application references set and bin for the first time, the database schema is created, and the application simply stores the data in the specified bin. In the Aerospike database, a task is usually executed by DBA through a command line program.
Modify the configuration file as needed. To update the configuration parameters of namespace, You need to dynamically modify or use a new configuration file to restart the service.
To meet performance and redundancy requirements, Aerospike needs to plan and configure the number of nodes. For details, see Capacity Planning.
You can use management utilities and monitoring tools to manage and monitor nodes in urgent need. The cluster is automatically configured when a node is added or becomes down due to upgrade or maintenance. When a node fails, the cluster balances the load to minimize the impact on the end user.
Build applications
Once a namespace is created, Aerospike provides a tool that allows you to verify the correctness of the database storage data. In the production database, data is distributed in the cluster. To operate the database, you need to instantiate the Smart Client in the application. Smart terminals are location aware and know how to store/retrieve data in clusters without affecting performance.
Aerospike provides APIs in multiple languages for building big data applications. For more information, see the client manual.
When compiling an application, the API function library is included with the smart terminal. To determine the data location at any given time, the smart terminal continuously monitors the cluster status. The Location Awareness technology of the Smart Client ensures that the required data can be retrieved within one second in most cases.
When it comes to big data applications, such as web-based applications, the situation is as follows:
The Smart Client allows applications to ignore data distribution details. For more information, see architecture guide.
In this document, we will use the term API and client interchangeably-the application integrating the Aerospike API will integrate the Smart terminal at the same time.
<Http://www.aerospike.com/docs/architecture/>