After the last Memsql distributed architecture Introduction (a), the original is here: http://docs.memsql.com/latest/concepts/distributed_architecture/
First of all, I draw the picture according to my own understanding, if there are errors, please point out
Several concepts
1. Memsql There are two types of tables:
- Reference table data is distributed across the main aggregator and each leaf node. The data for each node is complete (no partitions). The reference table synchronizes data from the primary aggregator to each leaf node with replication. In addition, the reference table can only be written in the main aggregator.
- sharded table data is stored in each leaf node through hash shards, with only partial data for each leaf node.
Querying Memsql
Note # The Official document says this query I understand is to delete and change the operation, such as misunderstanding, also please point out.
Memsql will compile this statement and cache it in memory the first time a query is executed.
The user's query is always directed to a aggregator. DDL operations or writing data to reference table must pass through the main aggregator, while other DML statements can pass through any aggregator.
Query only for reference table is performed only in the aggregator. The aggregator will not send these query to the leaf node because each aggregator node or leaf node has a copy of reference table.
The query for sharded table contains more:
- In the simplest case, the data that a query needs is only in one partition, so the query can be forward directly to the correct leaf node, such as INSERT into db.table VALUES (15) . INSERT into db_3.table VALUES unless you rewrite the database name to map it to the specified partition.
- If a query requires data in multiple partitions, the aggregator will fetch data from multiple leaf nodes. For example,SELECT Count (*) from T will send a count (*) to each partition, and then the rollup returns, eventually returning a row to the user.
Some query will have a lot of query conversion and aggregation logic, but they all follow the same common process. You can use the Explain keyword in a query statement to show the execution plan between the aggregator and leaf nodes, including the rewrite query that will be sent to the leaf node.
Data distribution
Memsql will distribute the data (hash partition table) on the distribution table through the primary key of each row hash. Since each primary key is unique and the hash function is generally uniform, the cluster can distribute the data and minimize the data skew in a relatively uniform way (skew)
When the database is created, Memsql splits the database into several partitions. Each partition has its own hash range. You can explicitly specify the number of partitions through the PARTITIONS=X option. By default, the total number of partitions is 8 times times the number of leaf nodes.
The partitions on each leaf node are implemented by database. When a distribution table is created, it is split according to the number of partitions in the database. This table is saved in the data slice of the partition. Secondary indexes are managed by each partition and the prefix of each row's primary key as a unique index.
If you run a query that needs to find a level two index, the aggregator will fan out of the query to all the partitions in the cluster, and each partition will find the level two index.
The query that exactly matches the Shard key will be routed to a single leaf node (I understand that the INSERT statement or the Add or delete of the hash value can be determined, because the hash value Memsql know where the record is placed). Otherwise, the aggregator will send this query to the cluster and collect the results. You can use the Explain keyword to test the created query and view the leaf node and query distribution strategy.
Availability Groups
A highly available group is a collection of leaf nodes that store redundant data to ensure high availability. Each high-availability group contains a copy of each partition (some are master and some are slave). Currently, Memsql only supports two highly available groups. You can set the number of highly available groups through the Redundancy_level variable on the main aggregator. We're talking about the redundancy-2 scenario (there are two high-availability groups in the case).
Each leaf node in a high-availability group has a pair node in the other high-availability groups. When a partition fails, Memsql will automatically promote the partition's slave partition to the master partition.
Introduction to Memsql Distributed Architecture (II.)