Original: Http://mp.weixin.qq.com/s/cqIK5Bv1U0mT97C7EOxmnA distributed unique ID minimalist tutorial
One, the signature
All business systems have the need to generate IDs, such as order IDs, product IDs, article IDs, and so on. This ID will be the only primary key in the database, and a clustered index will be established on it.
The core requirements for ID generation are two points:
Globally unique
Trend Orderly
Second, why to be globally unique.
The famous example is the ID card number, the ID card number is really the only one, but a person can handle multiple ID cards, such as your identity card lost, and a new one, the number unchanged.
The problem is, because the system is based on the ID number to do a unique primary key. At this point, if the identity card is stolen, you have no way to log off in the system, because the old and new 2 ID card "PRIMARY key" is the ID number.
That is to say, the old ID card is still free from the outside, completely effective. This time, fortunately there is an ID card effective time things, only rely on the validity of the ID card to identify. However, this is now the origin of so many banks, telecommunications fraud, pick up an ID card, go to a lot of banks, mobile phones, hotels can be used. The identity card lacks the cancellation mechanism.
So, experience tells us. Do not trust their intuition, the business of the so-called only often is not reliable, can not afford the time of the postgraduate. So you need to set up a primary key that is not business-independent, and the professional term is called the surrogate primary key (surrogate key).
This is also why the database design paradigm, the only primary key is the first paradigm.
Third, why should the trend be orderly
Take MySQL For example, the InnoDB engine table is an index organization table (IOT) based on a B + tree; Each table needs to have a clustered index (clustered index); All row records are stored in the leaf node of the B + tree (leaf pages of the trees) The efficiency of increasing, deleting, changing and searching based on clustered index is the highest; the following figure:
If we define a primary key (PRIMARY key), then InnoDB will select it as the clustered index;
If you do not explicitly define a primary key, InnoDB selects the first single index that does not contain a null value as the primary key index;
If there is no such unique index, then InnoDB selects the built-in 6-byte-long rowid as the implied clustered index (ROWID is incremented with the write of the row record and the ROWID is not as referenced as Oracle's ROWID, which is implied).
In summary, if the data write order of the InnoDB table can be consistent with the leaf node order of the B + Tree index, then the access efficiency is the highest, that is, the following are the highest access efficiency
Using the self-added column (int/bigint type) master key, this time the write order is self increasing, and the B + number leaves node division sequence;
The table does not specify the self-added master key, nor does it have a unique index (above) that can be selected as the primary key, at which point InnoDB chooses the built-in ROWID as the main key, and the write order is consistent with the ROWID growth order;
In addition, if a InnoDB table does not display a primary key and a unique index that can be selected as the primary key, the unique index may not be an incremental relationship (such as a string, UUID, multiple-field federated unique index), and the table's access efficiency will be poor. )
That's why our distributed IDs must be trend-increasing. So in the development, faced with this kind of distributed ID requirements, what are the common solutions?
Four, the database from the growth sequence or field
The most common way. Using the database, the whole database is unique.
Advantages:
1 simple, code convenient, performance acceptable.
2 A natural sort of digital ID, which is helpful for pagination or the results that need sorting.
Disadvantages:
1 different database syntax and implementation of different, database migration time or multiple database version to support the need to deal with.
2 only one master library can be generated in a single database or read-write separation or a master-multiple. There is a risk of a single point of failure.
3 The performance is not up to the requirements of the situation, it is more difficult to expand.
4 It would be rather painful to meet multiple systems to merge or involve data migrations.
5 There will be trouble when the table is divided into the library.
Optimization scenario:
1 for the main library single point, if there are more than one master library, then each master library set the starting number is not the same, step size, can be the number of master. For example: Master1 generated by 1,4,7,10,master2 is generated by 2,5,8,11 Master3 generated is 3,6,9,12. This can effectively generate a unique ID in the cluster, or it can significantly reduce the load on the ID generation database operation.
Five, UUID
Common way. The database can also be used to generate the program, generally the only global.
Advantages:
1) simple, convenient code.
2 Generate ID performance is very good, basically do not have performance problems.
3 The world's only, in the face of data migration, system data consolidation, or database changes, etc., you can calmly.
Disadvantages:
1 There is no order, the trend is not guaranteed to increase.
2 The UUID is often stored with strings and the query is less efficient.
3 storage space is relatively large, if it is a massive database, you need to consider the problem of reserves.
4) Large amount of transmitted data
5 Not readable.
Six, Redis generation ID
When using a database to generate insufficient ID performance, we can try to generate IDs using Redis. This relies primarily on the redis being single-threaded, so it can also be used to generate globally unique IDs. Can be implemented using Redis atomic operations incr and Incrby.
You can use the Redis cluster to obtain higher throughput. If there are 5 sets of Redis in a cluster. Each redis can be initialized with a value of 1,2,3,4,5, and then the step size is 5. The IDs generated by each Redis are:
a:1,6,11,16,21
b:2,7,12,17,22
c:3,8,13,18,23
d:4,9,14,19,24
e:5,10,15,20,25
This, random load to which machine to determine good, the future is difficult to make changes. However, the 3-5 servers can be basically satisfied with different IDs. But the step and the initial value must be needed beforehand. Using a Redis cluster can also be a problem with a single point of failure.
In addition, it is more suitable to use Redis to generate the flow number starting from 0 per day. For example, order number = date + day from growth. You can generate a key every day in the Redis and add it using INCR.
Advantages:
1 not dependent on the database, flexible and convenient, and better performance than the database.
2 A natural sort of digital ID, which is helpful for pagination or the results that need sorting.
Disadvantages:
1 if there is no redis in the system, new components need to be introduced to increase the complexity of the system.
2 the need to encode and configure a larger workload.
Seven, Twitter
In the process of migrating storage systems from MySQL to Cassandra, Twitter has developed a set of globally unique ID generation services because Cassandra has no sequential ID generation mechanism: snowflake.
1 41-bit time series (accurate to milliseconds, 41-bit length can be used for 69 years)
2 10-bit machine ID (10-bit length supports deployment of up to 1024 nodes)
3 12-bit counting sequence numbers (12-bit counting order numbers support each node to produce 4,096 ID numbers per millisecond) the highest bit is the sign bit, always 0.
Advantages:
High performance, low latency, independent application;
Orderly by time.
Disadvantages:
Requires independent development and deployment.
Strong reliance on the clock, if the host time callback, will cause a duplicate ID, will produce
ID is ordered, but discontinuous
Principle
Eight, MongoDB's objectid.
MongoDB's Objectid and snowflake algorithms are similar. It is designed to be lightweight, and different machines can be easily generated with a globally unique and homogeneous method. MongoDB was designed to act as a distributed database from the outset, and handling multiple nodes is a core requirement. Make it much easier to generate in a fragmented environment.
The Objectid uses 12 bytes of storage, which is generated in the following ways:
|0|1|2|3|4|5|6 |7|8| 9|10|11|
| time Stamp | machine id| pid| Counter |
The first four-byte timestamp is the timestamp starting from the standard era, in seconds, with the following characteristics:
1 time stamp with 5 bytes behind, to ensure the uniqueness of the second level;
2 Ensure that the insertion order is roughly sorted by time;
3 implied the creation time of the document;
The actual value of the 4 timestamp is not important, and there is no need to synchronize the time between servers (since adding the machine ID and process ID guarantees that this value is unique, uniqueness is Objectid's final appeal).
The machine ID is the server host identity, usually the hash value of the machine host name.
You can run multiple Mongod instances on the same machine, so you also need to add process identifier PID.
The first 9 bytes guarantee the uniqueness of the objectid generated by different processes of the same second machine. The last three bytes are an automatically incremented counter (a Mongod process requires a global counter) to ensure that the same second objectid is unique. A maximum of one second allows each process to have a different objectid (256^3 = 16777216).
Summary: Timestamp to ensure that the second level only, machine ID to ensure that the design to consider distributed, to avoid clock synchronization, PID to ensure that the same server running multiple Mongod instances of the uniqueness, the final counter to ensure the same second unique (select a few bytes to consider the storage economy, Also consider the upper limit of concurrent performance).
"_id" can be generated either on the server side or on the client side, and the client generation can reduce the pressure on the server.
Nine, class snowflake algorithm
There are many domestic manufacturers based on the snowflake algorithm for localization, such as
Baidu's Uid-generator:
Https://github.com/baidu/uid-generator
The leaf of the American Regiment:
Https://github.com/zhuzhong/idleaf
The basic is the further optimization of the snowflake, such as solving the clock callback problem.
Ten, summary
Overall, the distributed unique ID needs to meet the following criteria:
High availability: Cannot have a single point of failure.
Global uniqueness: Duplicate ID numbers cannot occur, and since it is the only identifier, this is the most basic requirement.
Trend increment: In the MySQL InnoDB engine uses the clustered index, because most RDBMS uses the B-tree data structure to store the index datum, in the primary key choice above we should try to use the orderly primary key to guarantee writes the performance.
Time Order: In chronological order, or ID contains time. Such one can be less than one index, and the other is easy to separate hot and cold data.
Fragmentation support: Can control shardingid. For example, a user's article should be placed in the same fragment, so that query efficiency, and easy to modify.
monotonically increasing: guarantees that the next ID must be greater than the previous ID, such as transaction version number, IM increment message, sorting, and so on.
Medium length: Not too long, preferably 64bit. Using long is a good operation, if it is 96bit, it will be quite inconvenient to shift, and some components may not support such a large ID.
Information security: If the ID is continuous, malicious users of the seizure is very easy to do, directly in order to download the specified URL can; If the order number is more dangerous, competitors can directly know our single amount of the day. So in some scenarios, you'll need IDs without rules and rules.
If we define a primary key (PRIMARY key), then InnoDB will select it as the clustered index;
If you do not explicitly define a primary key, InnoDB selects the first single index that does not contain a null value as the primary key index;
If there is no such unique index, then InnoDB selects the built-in 6-byte-long rowid as the implied clustered index (ROWID is incremented with the write of the row record and the ROWID is not as referenced as Oracle's ROWID, which is implied).
In summary, if the data write order of the InnoDB table can be consistent with the leaf node order of the B + Tree index, then the access efficiency is the highest, that is, the following are the highest access efficiency
Using the self-added column (int/bigint type) master key, this time the write order is self increasing, and the B + number leaves node division sequence;
The table does not specify the self-added master key, nor does it have a unique index (above) that can be selected as the primary key, at which point InnoDB chooses the built-in ROWID as the main key, and the write order is consistent with the ROWID growth order;
In addition, if a InnoDB table does not display a primary key and a unique index that can be selected as the primary key, the unique index may not be an incremental relationship (such as a string, UUID, multiple-field federated unique index), and the table's access efficiency will be poor. )
That's why our distributed IDs must be trend-increasing. So in the development, faced with this kind of distributed ID requirements, what are the common solutions?
Four, the database from the growth sequence or field
The most common way. Using the database, the whole database is unique.
Advantages:
1 simple, code convenient, performance acceptable.
2 A natural sort of digital ID, which is helpful for pagination or the results that need sorting.
Disadvantages:
1 different database syntax and implementation of different, database migration time or multiple database version to support the need to deal with.
2 only one master library can be generated in a single database or read-write separation or a master-multiple. There is a risk of a single point of failure.
3 The performance is not up to the requirements of the situation, it is more difficult to expand.
4 It would be rather painful to meet multiple systems to merge or involve data migrations.
5 There will be trouble when the table is divided into the library.
Optimization scenario:
1 for the main library single point, if there are more than one master library, then each master library set the starting number is not the same, step size, can be the number of master. For example: Master1 generated by 1,4,7,10,master2 is generated by 2,5,8,11 Master3 generated is 3,6,9,12. This can effectively generate a unique ID in the cluster, or it can significantly reduce the load on the ID generation database operation.
Five, UUID
Common way. The database can also be used to generate the program, generally the only global.
Advantages:
1) simple, convenient code.
2 Generate ID performance is very good, basically do not have performance problems.
3 The world's only, in the face of data migration, system data consolidation, or database changes, etc., you can calmly.
Disadvantages:
1 There is no order, the trend is not guaranteed to increase.
2 The UUID is often stored with strings and the query is less efficient.
3 storage space is relatively large, if it is a massive database, you need to consider the problem of reserves.
4) Large amount of transmitted data
5 Not readable.
Six, Redis generation ID
When using a database to generate insufficient ID performance, we can try to generate IDs using Redis. This relies primarily on the redis being single-threaded, so it can also be used to generate globally unique IDs. Can be implemented using Redis atomic operations incr and Incrby.
You can use the Redis cluster to obtain higher throughput. If there are 5 sets of Redis in a cluster. Each redis can be initialized with a value of 1,2,3,4,5, and then the step size is 5. The IDs generated by each Redis are:
a:1,6,11,16,21
b:2,7,12,17,22
c:3,8,13,18,23
d:4,9,14,19,24
e:5,10,15,20,25
This, random load to which machine to determine good, the future is difficult to make changes. However, the 3-5 servers can be basically satisfied with different IDs. But the step and the initial value must be needed beforehand. Using a Redis cluster can also be a problem with a single point of failure.
In addition, it is more suitable to use Redis to generate the flow number starting from 0 per day. For example, order number = date + day from growth. You can generate a key every day in the Redis and add it using INCR.
Advantages: