An ID generation policy must be assigned to each user in the system for subsequent user identification. This demand should be very common and will not be a problem for systems with a small number of users, but it is not a simple problem for Internet systems with a large number of users. The following is an article by the most popular Instagram app developer.ArticleLet's see how a company with more than a dozen people solves this problem:
First give the original link: http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram
The following is a simple translation (For details, refer to the original article ):
Instagram fragments and IDs
Receiving 25 images and 90 "like" shares per second, Instagram stores a large amount of data. To ensure that all important data is stored in the memory and available to users as soon as possible, we split the data. In other words, we saved the data to many small slices, each shard holds a part of the data.
We use Django and PostgreSQL as backend database systems. After deciding on data sharding, the first problem we encountered was whether PostgreSQL was still our main data storage system, or another one. We evaluated some different nosql solutions, but ultimately decided that the most satisfying thing we needed was to split the data to a server group composed of multiple PostgreSQL databases.
Before writing data to a PostgreSQL server group, we must first resolve how to specify a unique identifier (for example, each image published on our system) for each piece of data in the database ). A typical solution works in a single database-directly use the auto-increment feature of the database to assign unique tags. However, this solution won't work if you want to insert data to multiple databases at the same time. The following content of this article illustrates how we can solve this problem.
Before getting started, we listed several required functions in several systems:
1. The generated IDs must be sorted by time (in this way, a group of images can be sorted without looking for other related information)
2. The ID should be 64bit (to make the index smaller and easier to store in a system like redis)
3. The smaller the uncertainty (or changes) caused by the new system, the better. The reason why we can use so few engineers to deal with Instagram is that we choose a simple, easy-to-understand, and reliable solution.
Existing solutions:
There are already many existing solutions for ID generation. We can consider the following:
Generate an ID in a web application
In this way, all questions about ID generation are handed over to your application.ProgramInstead of the database.
For example, MongoDB objectid has a total length of 12 bytes and uses the encoded timestamp as the primary component. Another common method is UUID.
Advantages:
1. Each application thread generates an ID independently, reducing faults and inconsistencies.
2. If you use timestamp as the primary component of ID, ID can be sorted by time.
Disadvantages:
1. To ensure uniqueness and reliability, more storage space (96 bit or more) is required ).
2. Some UUID is completely random and cannot be naturally sorted.
Use a dedicated service to generate an ID
For example, Twitter's snowflake is a simple service that uses Apache zookeeper to integrate all nodes and then generate a 64bit unique ID.
Advantages:
1. The ID generated by snowflake is only 64-bit and only half the uuid size.
2. Timestamp can be used as the primary component and can be sorted.
3. In a distributed system, it is okay for a node to go down.
Disadvantages:
1. additional complexity will be introduced to the overall architecture, and some uncertain content (zookeeper, snowflake servers)
Database "ticket" Server
Use the auto-Increment Function of the database to ensure uniqueness. Flickr uses this method-but uses two database servers (one generates an odd number and the other generates an even number) to prevent single-point hosting.
Advantages:
1. The database is very familiar with many foreseeable factors.
Disadvantages:
1. it will eventually become a write bottleneck (according to the report of Flickr, this will not become a problem even in the case of large-scale concurrency)
2. An additional machine is added to the Administrator.
3. If a single database is used, single point of failure may easily occur. If multiple databases are used, time-based sorting cannot be ensured.
Among the above methods, Twitter is the closest we want, but the extra complexity caused by running a dedicated ID service is a negative factor.
Instead, we chose a similar conceptual method and integrated it into PostgreSQL.
Our Solutions
Our sharding system consists of thousands of logical shards inCodeIs mapped to a very small number of physical shards. In this way, we can start from a few database servers and finally switch to more servers: we only need to move some logic chips from one server to another, you do not need to repackage any data. To facilitate coding and management, we use the schema function of S.
Schemas (not the schema of a table in SQL) is a set of logical functions. Each S database can have multiple schemas, with one or more tables in each schema. The table name is unique in the schema and not unique in the DB. The default value is, the database puts all the information in a schema named "public.
In our system, each logical partition is a schema, and each partitioned table exists in each schema. We use PL/pgsql (Postgres built-inProgramming Language) And postgers own auto-increment functions, each table in each partition is assigned the ID generation function.
Each ID consists of the following parts:
1.41 bits is stored in milliseconds.
2.13 bits indicates the logical part ID.
3.10 bits stores the result of the auto-incrementing Sequence Value of 1024 modulo, which means that each shard can generate 1024 IDs per second.
The following is a test: Suppose it is, and our business starts from. We have been running for 1387263000 milliseconds from the beginning to the present, and started to construct our ID. We use the left shift method to fill the 41 bits on the left:
Id = 1387263000 <(64-41)
Next, we get the ID of the shard where the data to be inserted is located. Suppose we split the data by user ID. There are 2000 Logical shards in the system. If our user ID is 31341, then our shard ID is31341% 2000-> 1341
. We use this value to fill in the next 13 digits:
Id |= 1341 <(64-41-13)
Finally, we use any generated auto-incrementing Sequence Value (unique in a single schema table) to fill the remaining digits. Assume that the table we want to insert has generated 5000 IDs, And the next value is 5001. We use 1024 Modulo for the table (the generated data is 10 bits ), then add:
Id | = (5001% 1024)
Now that we have obtained the desired ID, we can use the return keyword to return it to the application as part of the insert process.
The following is the PL/pgsql code implementation of the above process (for example, in schema instance 5 ):
Create Or Replace Function Insta5.next _ ID (out result Bigint ) As $
Declare
Our_epoch Bigint : = 1314220021721 ;
Seq_id Bigint ;
Now_millis Bigint ;
Shard_id Int : = 5 ;
Begin
Select Nextval ( ' Insta5.table _ id_seq ' ) % 1024 Into Seq_id;
Select Floor (Extract (Epoch From Clock_timestamp ()) * 1000 ) Into Now_millis;
Result: = (Now_millis - Our_epoch) < 23 ;
Result: = Result | (Shard_id < 10 );
Result: = Result | (Seq_id );
End ;
$ Language plpgsql;
Create a database table as follows:
Create Table Insta5.our _ TABLE (
"ID" Bigint Not Null Default Insta5.next _ ID (),
... Rest Of Table Schema ...
)
That's it! Primary keys are unique in our applications (as private goods in the shard, to facilitate ing with shard IDs ). We have applied this method to the product. Currently, the results seem quite satisfactory. Are you interested in helping us solve these problems? Our contact information
Mike Krieger, co-founder