This article mainly describes how to generate globally unique IDs in a distributed system. Preface
Simply generating global IDs is not a challenge, generating a global unique ID to meet the following requirements: Ensure that the generated ID is globally unique future data migration between multiple shards will not be affected by the ID generation of the limit generated by the ID of the best can take time information, such as the ID of the first k bit Is Timestamp, so that the ability to sort data by time by sorting the first k bits of the ID is better than the speed at which the ID is generated by a few bits. For example, in a high-throughput scenario, you need to generate tens of thousands of IDs per second (Twitter's latest spike reached 143,199
TWEETS/S, i.e. 100,000 +/s) the entire service is best without a single point of problem description
When the user surge system architecture evolves to a certain stage, it is often designed into a sub-database table,
For example, according to the ID of the user table (T_user) table, [0,999999] saved in the T_user_0 table, [1000000,1999999] saved in the T_user_1 table, and so on, how to give these users a global unique ID? several ways to generate global IDs 1. Database self-increment ID
When the database used by the service has only library single table, the auto_increment of the database can be used to generate a globally unique incrementing ID.
Advantages: Simple, no need to program any additional operations to maintain a fixed-length increment in a single table can remain unique
Disadvantage: Poor performance under high concurrency, the primary key performance limit is the upper limit of the database server stand-alone. Horizontal scaling is difficult, and in a distributed database environment, uniqueness cannot be guaranteed. 2. UUID
The implementation of the UUID in the general programming language, such as the UUID Method Uuid.randomuuid () in Java. toString () can be generated locally by the service program, and the generation of the ID does not depend on the implementation of the database.
Advantage: Generate ID locally and do not need to make a remote call. Globally unique is not duplicated. The ability to scale horizontally is very good.
Disadvantage: The ID has bits, occupies a large space, need to be stored into a string type, index efficiency is very low. No timestamp in the generated ID, no guaranteed trend increment 3, FLICKR Global primary key generation scheme
Flickr cleverly uses the self-increment ID of MySQL and the replace into syntax, which is very simple to implement the Shard ID generation function. See: http://code.flickr.net/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/
For example, create a 64-bit self-increment ID:
First, create a table:
CREATE TABLE ' uid_sequence ' (
' id ' bigint () unsigned not null auto_increment,
' stub ' char (1) NOT null default ' ,
PRIMARY key (' id '),
UNIQUE key ' stub ' (' Stub ')
) Engine=myisam;
SELECT * from Uid_sequence output:
+ ——————-+--+
| ID | Stub |
+ ——————-+--+
| 72157623227190423 | A |
If I need a global unique 64-bit UID, then execute:
REPLACE into Uid_sequence VALUES (' a ');
SELECT last_insert_id ();
Description: The advantage of replacing INSERT into with replace into is to avoid too large table rows and to clean up regularly. The stub field is set to a unique index, and the sequence table has only one record, but you can also generate global primary keys for multiple tables, such as user_order_id. Unless you need a table's primary key to be contiguous, build another user_order_id_sequence table. With the actual comparison test, using MyISAM has higher performance than INNODB.
Here, Flickr uses two databases (and more) as self-increment sequences, which are both machine-generated and load-balanced.
TicketServer1:
auto-increment-increment = 2
auto-increment-offset = 1
TicketServer2:
Auto-increment-increment = 2
auto-increment-offset = 2
Advantages: simple and reliable.
Disadvantage: ID is just an ID, no time, shardingid and other information. 4. Twitter Snowflake
Twitter uses zookeeper to implement a global ID-generated service snowflake:https://github.com/twitter/snowflake
The composition of the unique ID generated by the snowflake (from high to Low): Bits:timestamp (ms) Ten bits: Node ID (Datacenter ID 5 bits + worker ID 5 bits) 12 Bits:sequence number
Altogether five bits (the highest bit is 0)
Unique ID generation process: The machine number of the ten bits, which is obtained from a Zookeeper cluster when the ID assignment worker is started (ensuring that all workers do not have duplicate machine numbers); Timestamp bits: Every time you want to generate A new ID, you will get the current Timestamp, and then generate sequence number in two cases, if the current Timestamp and the previous generated ID Timestamp the same (in the same millisecond), with the previous ID of the S Equence number + 1 as the new sequence number (BITS); If all IDs in this millisecond are exhausted, wait for the next millisecond to continue (no new IDs can be assigned during this wait), or if the current Timestamp is larger than the previous ID Timestamp, randomly generate an initial sequence number (12bits) as is the first sequence number in this millisecond;
The whole process is to be able to work independently after the worker starts up and has to be dependent on the external (need to get the worker number from Zookeeper), to be centralized. 5. Instagram practices
Based on the Flickr experience, Instagram has used the features of the Postgre database to achieve a simpler and more reliable ID generation service. Links: Http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram
The composition of the Instagram unique ID: Bits:timestamp (ms) bits: The code name of each logic Shard (maximum 8 x 1024 logic shards supported) Bits:seque NCE number; Each Shard can generate up to 1024 IDs per millisecond
An example of Instagram is illustrated:
Assuming that the time is September 9th, at 5:00pm, the number of milliseconds is 1387263000 (the number of milliseconds that the system gets directly from the beginning of 1970). So first put the time data in the ID:
id = 1387263000 << (64-41)
Then put the Shard ID in time, assuming that the user ID is 31341, there are 2000 logical shards, then the Shard ID is 31341%---1341:
ID |= 1341 << (64-41-13)
Finally, the self-increment sequence is placed in the ID, assuming that the previous sequence is 5000, then the new sequence is 5001:
ID |= (5001 1024)
This gives you a global shard ID.
We can return the ID to the application via the RETURNING keyword of the INSERT statement;
Here is a complete example of the Pl/pgsql (example of SCHEMA:INSTA5):
CREATE OR REPLACE FUNCTION insta5.next_id (out result bigint) as $$
DECLARE
our_epoch bigint: = 1314220021721;
seq_id bigint;
Now_millis bigint;
shard_id int: = 5;
BEGIN
SELECT nextval (' Insta5.table_id_seq ') with a percent of seq_id;
SELECT Floor (EXTRACT, EPOCH from Clock_timestamp ()) * +) into Now_millis;
Result: = (Now_millis-our_epoch) <<;
Result: = Result | (shard_id <<);
Result: = Result | (seq_id);
END;
$$ LANGUAGE Plpgsql;
And when creating the table, we do:
CREATE table insta5.our_table (
"id" bigint not NULL DEFAULT insta5.next_id ( ),
.... Rest of the table schema ...
)
6. Other Programs
For example: MongoDB's Objectid, which takes a length of 12 bytes, and encodes the timestamp. Links: https://docs.mongodb.com/manual/reference/method/ObjectId/
References
Http://darktea.github.io/notes/2013/12/08/Unique-ID