1. What is Citus
is a postgresql extension that can be installed with PG and then added to the database through SQL commands.
"Related Actions"
#创建Citus扩展:
CREATE EXTENSION citus;
2. Node
2.1. Coordination nodes (Coordinator node, abbreviated as CN)
Stores all of the metadata and does not store the actual data. Provide services to application systems, send query requests to each work node, and summarize results. For the application system is the server side, for the work node is a bit like a client. 2.2. Working nodes (worker node, abbreviation WN)
does not store metadata and stores actual data. Performs a query request from the coordination node. In principle, the service is not directly provided for the application system. However, the direct operation of the table on the work node can also be achieved.
"Related Actions"
#在协调节点上增加工作节点:
SELECT * from Master_add_node (' 192.168.7.130 ', 5432);
SELECT * from Master_add_node (' 192.168.7.131 ', 5432);
SELECT * from Master_add_node (' 192.168.7.132 ', 5432);
#查看工作节点:
SELECT * from Master_get_active_worker_nodes ();
Node_name | node_port
---------------+-----------
192.168.7.130 | 5432
192.168.7.131 | 5432
192.168.7.132 | 5432
3. Fragmentation (shards) and replicas (placement)
The data in the same logical table is stored in separate physical tables according to a certain policy. A physical table is called a fragment. Fragmentation and work nodes are different concepts, the same work node can be placed many fragments, or even a logical table can be divided into two pieces, and the two fragments are stored on the same work node.
Fragmentation principles
When designing a distributed database, the designer must consider how the data is distributed across sites, that is, how the global data should be logically partitioned and physically partitioned. What data should be distributed, which does not require distributed storage, and which data needs to be replicated. The system wakes up to the overall consideration and makes the system performance optimal. However, in any case, the following guidelines should be followed:
Completeness: All global data is mapped to a fragment.
Refactoring: All fragments must be able to compose global data again.
Disjoint: Fragmented fragments contain data that does not intersect.
Replica, that is, fragmentation redundancy.
Related actions
#配置分片策略 (after creating a table on CN): SELECT master_create_distributed_table (' test_table ', ' id ', ' hash ');
#进行分片操作 (divides the table into 3 slices with 2 copies per fragment): SELECT master_create_worker_shards (' test_table ', 3, 2);
#查看分片: SELECT * from Pg_dist_shard; Logicalrelid | Shardid | Shardstorage | Shardminvalue | Shardmaxvalue--------------+---------+--------------+---------------+---------------test_table | 102001 | T | -2147483648 | -1610612737 test_table | 102002 | T | -1610612736 | -1073741825 test_table | 102003 | T | -1073741824 |
-536870913 #查看分片分布: SELECT * from Pg_dist_shard_placement ORDER by Shardid, Placementid; Shardid | Shardstate | Shardlength | nodename | Nodeport | Placementid---------+------------+-------------+---------------+----------+-------------102001 | 1 | 0 | 192.168.7.130 | 5432 | 33 102001 | 1 | 0 | 192.168.7.131 | 5432 | 34 102002 | 1 | 0 | 192.168.7.131 | 5432 | 35 102002 | 1 | 0 | 192.168.7.132 | 5432 | 36 102003 | 1 | 0 | 192.168.7.132 | 5432 | 37 102003 | 1 | 0 | 192.168.7.130 | 5432 | 38
From the above analysis, it can be seen that the table test_table is divided into 3 slices (102001,102002,102003), each fragment has 2 copies, respectively, stored on the adjacent two nodes. As shown in the following figure.
The shardstate in the Fragment distribution table is 1 o'clock, indicating that the current copy's state is valid (synchronized), and shardstate is 3 o'clock, indicating that the status of the current replica is invalid (failed). 4. Data Access
By using CN to insert the table test_table, according to the previously defined partitioning strategy, Citus automatically selects a fragment for the inserted record to write according to the hash value of the ID.
When WN3 is off-line, the table test_table is queried through CN, because WN1 and WN2 already contain all the fragments, so the query can return the proper results. View the fragment distribution at this point and find that all replica states are still valid.
When WN3 is offline, the table test_table is inserted/updated/deleted via CN, and if the affected records belong to 201001 slices, then citus modifies the WN2 table data on WN1 and test_table_102001. And does not affect the state of any replica, and if the affected record is 201002 fragmented (because WN3 is offline), Citus modifies the data on the test_table_102002 table on the WN2, and in the distributed fragmentation information, the status of the 36th copy is set to invalid (failed). Note that the status of copy 37th is still valid (synchronized).
Then let WN3 back online, check the distribution of fragmentation information, you can see that the status of copy 36th is still "invalid (lost)", it is not automatically repaired. At this point, all the read and write operations for 201002 slices are only performed on the 35th copy. 5. Fragment Repair
"Related Actions"
#先查看分片分布: SELECT * from Pg_dist_shard_placement ORDER by Shardid, Placementid; Shardid | Shardstate | Shardlength | nodename | Nodeport | Placementid---------+------------+-------------+---------------+----------+-------------102001 | 1 | 0 | 192.168.7.130 | 5432 | 33 102001 | 1 | 0 | 192.168.7.131 | 5432 | 34 102002 | 1 | 0 | 192.168.7.131 | 5432 | 35 102002 | 3 | 0 | 192.168.7.132 | 5432 | 36 102003 | 1 | 0 | 192.168.7.132 | 5432 | 37 102003 | 1 | 0 | 192.168.7.130 | 5432 |
#用35号副本的数据去覆盖36号副本: SELECT master_copy_shard_placement (102002, ' 192.168.7.131 ', 5432, ' 192.168.7.132 ', 5432);
#再次查看分片分布: SELECT * from Pg_dist_shard_placement ORDER by Shardid, Placementid; Shardid | Shardstate | Shardlength | nodename | Nodeport | Placementid---------+------------+-------------+---------------+----------+-------------102001 | 1 | 0 | 192.168.7.130 | 5432 | 33 102001 | 1 | 0 | 192.168.7.131 | 5432 | 34 102002 | 1 | 0 | 192.168.7.131 | 5432 | 35 102002 | 1 | 0 | 192.168.7.132 | 5432 | 36 102003 | 1 | 0 | 192.168.7.132 | 5432 | 37 102003 | 1 | 0 | 192.168.7.130 | 5432 | #可见36号副本已经修复.
Repairs can be made only if the number of replicas is set to be greater than 1 when the fragment is present, and the fragment currently has a valid copy. From the current known situation, Citus cannot be repaired automatically. The state of each node and replica can be detected by the development daemon, and when an invalid copy is found, the Master_copy_shard_placement method is invoked in the service program to implement automatic repair. 6. Cluster performance
The Pgbench model of tpc-b test tool is used to compare the 1CN+2WN citus cluster environment (two slices, single copy) and single node traditional PostgreSQL10 based on PostgreSQL10. In the case of a record number of 1 million, the following results are obtained: tps[citus]=258,tps[pg10]=688. That is, the overall reading and writing efficiency of Citus cluster under this configuration is 37.5% of the traditional single node PG10.
Through reasonable fragmentation, most of the operation can be directly carried out in the WN, can effectively improve the efficiency of citus cluster, but in the case of replicas, it is necessary for the application to guarantee the consistency of the different replicas of the citus system in the same fragment. 7. Reference Documents
The partitioning method of distributed database
Deployment of POSTGRESQL10-based citus distributed databases on CentOS7.2
"Citus Preliminary Test"
"Citus Performance Test"
Study on the fragmentation distribution of citus data (one in the Work Node Direct action sheet)
Study on fragmentation distribution of citus data (two copies and faults)
Research on CITUS Data fragment distribution (three-node fault manual repair)