Summary of yupoo website architecture
I have previously introduced the architecture of the world's largest online Image Service website, Flickr. yupoo (youpai) is the largest image service provider in China. Let's take a look at its architecture, we also provide the image service. Let's take a look at the differences between the image service and Flickr. After reading this article, you can think about it.
I. Let's take a look at the basic information of the yupoo Website:
Bandwidth: 4000 Mb/s (reference)
Number of servers: About 60
Web servers: Lighttpd, Apache, and nginx
Application Server: Tomcat
Others: Python, Java, mogilefs, ImageMagick, etc.
The architecture diagram is as follows:
Source image Link
Ii. About squid and tomcat
Squid and tomcat appear to be rarely seen in the architecture of Web 2.0 sites. First of all, I have some questions about squid. The explanation for this is: "No cache system with higher efficiency than squid has been found yet. The hit rate is indeed poor, later, we installed the layer Lighttpd in front of squid, and made a hash Based on the URL. The same image will always be sent to the same squid, so the hit rate is greatly improved"
For Tomcat at the application server layer, yupoo now! Technicians are gradually using other lightweight alternatives, and ypws/ypfs is now developed using python.
Ranking explanation:
- Ypws -- yupoo Web Server ypws is a small Web server developed using python. It provides basic web services and adds logical judgment on user, image, and external link website display, it can be installed on any server with idle resources to facilitate horizontal scaling in case of performance bottlenecks.
- Ypfs -- yupoo file system is similar to ypws. ypfs is also an Image Upload server developed based on this web server.
[Updated: Some netizens commented on the efficiency of Python. yupoo boss Liu Pingyang is at Del. icio. "ypws is written in Python. Each machine can process 294 requests per second. The pressure is almost below 10% "]
Iii. Image Processing Layer
The following image process server processes the images uploaded by users. The software package used is also ImageMagick. During the last storage upgrade, the sharpening ratio was also adjusted (I personally feel that the effect is indeed much better )." Magickd is a remote interface service for image processing. It can be installed on any machine with idle CPU resources, similar to memcached.
We know that the thumbnail feature of Flickr was originally using the ImageMagick software package. After it was acquired by Yahoo, it was not used for copyright reasons (?); EXIF and IPTC flicke are used
PerlExtracted. I strongly recommend yupoo! There are some articles about EXIF, which is also a key to potential benefits.
Iv. image storage layer
Yupoo! Disk Array cabinet is used for storage, based on NFSAs the data volume increases, "yupoo! The Development Department began to study a set of large-capacity projects that can meet yupoo in March June! The secure and reliable storage system required for future development ", it seems yupoo! The system is confident and promising. After all, this must support the storage and management of Massive images computed in terabytes. We know that an image has different sizes in addition to the original image. These images are stored in mogilefs.
For other parts, common Web 2.0 websites can be seen through software, such as MySQL, memcached, and Lighttpd. Yupoo! On the one hand, we use a lot of relatively mature open-source software, and on the other hand, we are also developing and customizing the appropriate architecture components. This is also a way for a Web 2.0 company to take.
V. Database sharding Design
Like many MySQL 2.0 sites, the MySQL cluster of upyun has gone through a development process from a master database to multiple slave databases, and then to multiple master databases to multiple slave databases.
Figure 3:Database Evolution
It was originally composed of a master database and a slave database. At that time, the slave database was used for backup and disaster tolerance only. When the master database fails, the slave database is manually changed to the master database. Generally, the slave database does not perform read/write operations (except synchronization ). With the increasing pressure, we added memcached, which only cached a single row of data. However, the cache of a single row of data does not solve the pressure problem well, because the query of a single row of data is usually very fast. Therefore, we put some queries with low real-time requirements into the slave database for execution. The query pressure will be diverted by adding multiple slave databases. However, as the data volume increases, the write pressure on the master database will also increase.
After referring to some related products and other website practices, we decided to split the database. That is, data is stored in different database servers. data can be split by two latitudes:
Vertical Split: Split by function module. For example, you can store group-related tables and photo-related tables in different databases.Different table Structures.
Horizontal SplitHorizontal Split stores data from the same table in different databases.The table structure is identical..
Splitting Method
Generally, vertical sharding is performed first, because this method is easy to implement and you can access different databases based on the table name. However, the vertical split method cannot completely solve all the pressure problems. In addition, it also depends on whether the application type is suitable for this split method. If appropriate, it can also play a good role in dispersing the database pressure. For example, I think vertical split is more suitable for Douban, because the core business/modules (books, movies, and music) of Douban are relatively independent, and the data increase speed is also relatively stable. The difference is that the core business object of youpai is the photos uploaded by users, and the increasing speed of photo data is faster and faster as the number of users increases. The pressure is basically on the photo table. Obviously Vertical Split does not fundamentally solve our problem. Therefore, we adopt horizontal split.
Sharding rule
The implementation of horizontal splitting is relatively complex. First, we need to determine a splitting rule, that is, to split the data according to what conditions. Generally, 2.0 of websites are user-centered, and data is basically followed by users, such as users' photos, friends, and comments. Therefore, a natural choice is to split data based on users. Each user corresponds to a database. when accessing data of a user, we need to first determine the database corresponding to the user and then connect to the database for actual data reading and writing.
So how does one correspond to users and databases? We have these options:
Matching by Algorithm
The simplest algorithm is based on the parity of the user ID. Users with odd IDs correspond to database A, while users with even IDs correspond to database B. The biggest problem with this method is that it can only be divided into two databases. Another algorithm is the Inter-region correspondence based on the user ID. For example, the user with the ID between 0 and corresponds to database A, and the user with the ID in the 10000-20000 range corresponds to database B, and so on. It is convenient and efficient to implement by algorithm, but it cannot meet the subsequent scalability requirements. If you need to add database nodes, you must adjust the algorithm or move a large dataset, it is difficult to expand database nodes without stopping services.
Corresponding to the index/ ing table
This method creates an index table to store the correspondence between each user's ID and the database ID. Each time you read and write user data, the corresponding database is obtained from the table. After a new user is registered, randomly select one of all available databases to create an index for it. This method is flexible and scalable. One disadvantage is that a database access is added, so the performance is not properly matched by algorithms.
After comparison, we adopt the index table method, and we are willing to lose some performance for its flexibility. What's more, we also have memcached, because the index data will not change, the cache hit rate is very high. Therefore, the performance loss is greatly reduced.
Figure 4:Data access process
You can easily add database nodes by indexing tables. When adding nodes, you only need to add them to the list of available databases. Of course, if you need to balance the pressure on each node, you still need to migrate data, but at this time the migration is a small amount, you can proceed step by step. To migrate user a's data, first set its statusMigrating dataUsers in this status cannot perform write operations and prompt on the page. Copy all data of user a to the newly added node, update the ing table, and set the status of user aNormalAnd delete the data in the original database. This process is usually performed in the morning, so few users encounterMigrating data.
Of course, some data does not belong to a user, such as system messages and configurations. we store the data in a global database.
Problem
Database sharding can cause a lot of trouble in application development and deployment.
Cross-database Association queries cannot be performed
If the data to be queried is distributed in different databases, we cannot obtain it through join. For example, to get the latest photos of friends, you cannot ensure that all friends have data in the same database. One solution is to query multiple times and then aggregate them. We need to avoid similar requirements as much as possible. Some requirements can be addressed by saving multiple copies of data, such as the user-A and user-B databases are DB-1 and DB-2, respectively, when user-A commented on the user-B photo, we save this comment both in the DB-1 and in the DB-2, first Insert a new record in the photo_comments table in the DB-2, and then insert a new record in the user_comments table in the DB-1. Shows the structure of the two tables. In this way, we can query the photo_comments table to obtain all the comments of a user-B photo,
You can also query the user_comments table to obtain all user-A comments. In addition, full-text retrieval tools can be used to solve certain requirements. SOLR is used to provide full-site tag retrieval and photo search services.
Figure 5:Comment table structure
Data Consistency/integrity cannot be guaranteed
There is no foreign key constraint on cross-database data and there is no transaction guarantee. For example, in the example of the comment photo above, it is likely that the photo_comments table is successfully inserted, but an error occurs when the user_comments table is inserted. One way is to enable transactions on both databases, insert photo_comments first, insert user_comments, and commit two transactions. This method cannot completely guarantee the atomicity of this operation.
All queries must provide database clues
For example, if you want to view a photo, it is not enough to use only one photo ID. You must also provide the user ID (that is, the database clue) for uploading this photo to find its actual storage location. Therefore, we must redesign many URL addresses, and we must ensure that some old addresses are still valid. We changed the photo address to/photos/{username}/{photo_id}/. Then, we added a ing table for the photo ID uploaded before the system upgrade, save the relationship between photo_id and user_id. When accessing the old photo address, we can query this table to obtain the user information and redirect it to the new address.
Auto-increment ID
If you want to use the auto-increment field on the node database, we cannot guarantee that it is globally unique. This is not a serious problem, but when the data between nodes is related, it will make the problem more troublesome. Let's take a look at the comments mentioned above. If the auto-incrementing field of comment_id in the photo_comments table, when we insert a new comment in the DB-2.photo_comments table, we get a new comment_id, if the value is 101, and the ID of user-A is 1, then we also need to insert in the DB-1.user_comments table (1,101 ...). User-A is a very active user and he commented on the user-C photo, while the user-C database is a DB-3.
Coincidentally, the ID of this new comment is also 101, which is very likely to happen. Then we insert a row in the DB-1.user_comments table like this (1,101...) data. So how do we set the primary key of the user_comments table (identify a row of data )? You can not set it. Unfortunately, sometimes (framework, cache, and other reasons) must be set. You can use user_id, comment_id, and photo_id as the primary key combination, but photo_id may be the same (coincidentally ). It seems that only photo_owner_id can be added, but this result makes us a little unacceptable. a complex combination of keys will have a certain performance impact during writing, such a natural key looks unnatural. Therefore, we abandoned using the auto-increment field on the node and tried to make these IDs globally unique. Therefore, a database dedicated to generating IDS is added. The table structure in this database is very simple and there is only one auto-increment field ID.
When we want to insert a new comment, we first insert an empty record in the photo_comments table of the ID library to obtain a unique comment ID. Of course, these logics have been encapsulated in our framework and are transparent to developers. Why not use other solutions, such as key-value databases that support incr operations. We are more assured to put data in MySQL. In addition, we regularly clean up the data in the ID library to ensure the efficiency of acquiring new IDs.
Implementation
We call a database node shard. A shard consists of two physical servers. We call them node-A and node-B, node-A and node-B are configured as master-master nodes for mutual replication. Although it is a master-master deployment method, we only use one of them at the same time because of replication latency. Of course, in Web applications, we can place a A or B in a user session to ensure that the same user accesses only one database at a time, so as to avoid some latency issues. However, our Python tasks are not in any State and cannot read or write the same database as the PHP application. So why cannot I set it to master-slave? We think it is a waste to use only one logical database, so we create multiple logical databases on each server.
As shown in, we have created two logical databases shard_001 and shard_002 on node-A and node-B. shard_001 on node-A and shard_001 on node-B form a shard, at the same time, only one logical database is active. In this case, if we need to access the shard-001 data, we connect shard_001 on node-A, while the data accessing shard-002 is connected to shard_002 on node-B. In this way, the pressure is distributed to each physical server. Another advantage of using master-master deployment is that we can upgrade the table structure without stopping the service,
Before the upgrade, stop the replication, upgrade the inactive database, upgrade the application, switch the database that has been upgraded to the active state, and switch the original active database to the inactive state, then, upgrade its table structure and restore the replication. Of course, this step is not necessarily suitable for all upgrade processes. If the table structure changes will cause data replication failure, you still need to stop the service and then upgrade it.
Figure 6:Database layout
As mentioned above, we need to migrate some data to the new server to ensure Load Balancing when adding servers. To avoid the need for short-term migration, eight logical databases are deployed on each machine during actual deployment. After the server is added, we only need to migrate these logical databases to the new server. It is best to add a server that doubles each time, and then migrate 1/2 of the Logical Data of each server to a new server, so as to balance the load well. Of course, when there is only one logical database on each stage, migration cannot be avoided, but it should be a long time.
We have encapsulated the database sharding logic in our PHP framework, and developers basically do not need to be troubled by these cumbersome tasks. The following are some examples of reading and writing photo data using our framework:
12345678910111213141516171819202122 |
array ( "type"
=> "long" ,
"primary" => true,
"global_auto_increment" => true),
"user_id"
=> array ( "type"
=> "long" ), "title"
=> array ( "type"
=> "string" ), "posted_date"
=> array ( "type"
=> "date" ), )); $photo
= $Photos ->new_object( array ( "user_id"
=> 1, "title" =>
"Workforme" )); $photo ->insert(); // Load the photo with ID 10001. Note that the first parameter is the user ID. $photo
= $Photos ->load(1, 10001); // Modify photo attributes $photo ->title =
"Database Sharding" ; $photo ->update(); // Delete the photo $photo -> delete (); // Obtain the photos uploaded by users with ID 1 after $photos
= $Photos ->fetch( array ( "user_id"
=> 1, "posted_date__gt"
=> "2010-06-01" )); ?> |
First, define a shardeddbtable object. All APIs are open through this object. The first parameter is the object type name. If the name already exists, the previously defined object will be returned. You can also use the get_table ("Photos") function to obtain the previously defined table object. The second parameter is the corresponding database table name, and the third parameter is the database clue field. You will find that you need to specify the value of this field in all the subsequent APIs. The fourth parameter is the field definition. The global_auto_increment attribute of the photo_id field is set to true, which is the global auto-increment ID mentioned above. If this attribute is specified, the framework will handle the ID.
To access data in the global database, we need to define a dbtable object.
1234 |
array ( "type"
=> "long" ,
"primary" => true,
"auto_increment" => true), "username"
=> array ( "type"
=> "string" ), )); ?> |
Dbtable is the parent class of shardeddbtable. Apart from defining different parameters (dbtable does not need to specify database clue fields), dbtable provides the same APIs.
Vi. Cache
Our framework provides the cache function, which is transparent to developers.
For example, in the above method call, the framework first tries to search in the cache with photos-1-10001 as the key. If not found, it then executes the database query and puts it into the cache. When you change a photo attribute or delete a photo, the Framework deletes the photo from the cache. The cache implementation of a single object is relatively simple. A little troublesome is the cache of list query results like below.
12 |
fetch( array ( "user_id"
=> 1, "posted_date__gt"
=> "2010-06-01" )); ?> |
We divide this query into two steps. The first step is to find the photo ID that meets the condition, and then find the specific photo information based on the photo ID. This can make better use of the cache. The cache key for the first query is photos-list-{shard_key}-{MD5 (query condition SQL statement)}, and the value is the photo ID List (separated by commas ). Shard_key is the value of user_id 1. Currently, list caching is not troublesome. However, if you modify the upload time of a photo, the data in the cache may not meet the conditions. Therefore, we need a mechanism to ensure that we do not get expired list data from the cache. We set a revision for each table. When the data in this table changes (the insert/update/delete method is called ),
We will update its revision, so we will change the cache key of the List to photos-list-{shard_key}-{MD5 (query condition SQL statement)}-{revision }, in this way, we will no longer get the expiration list.
The revision information is stored in the cache, and the key is photos-revision. This looks good, but it seems that the utilization of the List cache is not too high. Because we use the revision of the entire data type as the suffix of the cache key, it is clear that this revision is updated very frequently. any user who modifies or uploads a photo will update it, even if the user is not in the shard we want to query. To isolate the impact of user actions on other users, we can narrow down the scope of revision. Therefore, the cache key of revision is changed to photos-{shard_key}-revision. In this way, when a user with ID 1 modifies his/her photo information,
Only the revision corresponding to the key photos-1-revision will be updated.
Because the global database does not have shard_key, modifying a row of table data in the global database still results in invalid cache of the entire table. However, in most cases, the data is regional. For example, the topic posts of our help Forum belong to the topic. You have modified a post on one of the topics, and there is no need to invalidate the cache of all the topics. Therefore, an attribute named isolate_key is added to dbtable.
123456789 |
array ( "type"
=> "long" ,
"primary" => true), "post_id"
=> array ( "type"
=> "long" ,
"primary" => true,
"auto_increment" => true), "author_id"
=> array ( "type"
=> "long" ), "content"
=> array ( "type"
=> "string" ), "posted_at"
=> array ( "type"
=> "datetime" ), "modified_at"
=> array ( "type"
=> "datetime" ), "modified_by"
=> array ( "type"
=> "long" ), ),
"topic_id" ); ?> |
Note that the last parameter topic_id of the constructor is to use the field topic_id as the isolate_key, which is used to isolate the range of the revision as shard_key.
Shardeddbtable inherits from dbtable, so you can also specify isolate_key. Shardeddbtable specifies isolate_key, which can greatly narrow the scope of revision. For example, when a user adds a new photo to one of his albums in the associated table yp_album_photos of the album and photo, the cache of the photo lists of other albums will also become invalid. If I specify the isolate_key of this table as album_id, we will limit this impact to this album.
Our cache is divided into two levels. The first level is just a PHP array, and the effective range is request. The second level is memcached. The reason for this is that a lot of data needs to be loaded multiple times in a request cycle, which can reduce the network requests of memcached. In addition, our framework will try its best to send memcached's gets command to obtain data, thus reducing network requests.
VII. Analysis of more large website Architectures
Website architecture
Review the bumpy path of MySpace Architecture
Youku Network Architecture
Twitter website architecture
Another legend of the plentyoffish.com. NET Website
VIII. References
Http://www.kuqin.com/database/20100704/85908.html
Http://www.dbanotes.net/arch/yupoo_arch.html