Weibo database: design ideas behind three stages of change

Source: Internet
Author: User
Editor's note: High-availability architecture sharing and dissemination of articles of typical significance in the architecture field, this article is shared by Xiao Peng in the high-availability architecture group. For more information, see "ArchNotes 」.

Editor's note: High-availability architecture sharing and dissemination of articles of typical significance in the architecture field, this article is shared by Xiao Peng in the high-availability architecture group. For more information, see "ArchNotes 」.


Xiao Peng, technical manager of Weibo R & D center, is mainly responsible for business guarantee, performance optimization, architecture design, and surrounding automatic system construction related to Weibo database (MySQL, Reids, HBase, and Memcached. After 10 years of Internet database architecture and management experience, the architecture has undergone various stages of microblog database transformation, including service guarantee and SLA system construction, Weibo multi-data center deployment, and Weibo platform transformation, focuses on the high performance and high availability technical support of databases.

 

The growth of database experts

"The relationship with MySQL is mainly due to interest. My first job was in a small company. due to limited manpower, I had to work in various fields. In contrast, I found that I was most interested in databases, therefore, I have been engaged in database-related technical work. With the increase in service life, more and more experience has been accumulated in the database, and the database administrator (DBA) is a practical type of work, many theoretical things will change in reality, such as the "anti-paradigm" design. Therefore, if you want to become a database expert, it is recommended that you select an appropriate environment. many big platforms may encounter many challenges due to qualitative changes caused by quantitative changes, solving these problems is the only way to become a technical expert." -- Xiao Peng


Changes in Weibo database experience

First, I would like to share with you several important stages of Weibo database experience.


Initial stage

Weibo, as an innovative internal product in the early stage, has simple functions. the database architecture adopts the standard 1 Mbit/S/1 Mbit/S structure. it is designed based on read/write splitting and writes to the master database, access from the database. If the access pressure is too high, you can scale out the number of databases by resizing.

The red color indicates writing, the green color indicates reading, and the black color maps to the internal structure. The figure shows that the business is only vertically split, that is, it is differentiated by the business modules such as users, content, and relationships, the database is used separately. In the early stage, this is actually a very good architecture and provides the basis for decoupling in functional modules, when a problem occurs, you can easily locate the problem and downgrade according to different functional modules at the very beginning.

 

I personally think that in the early stage, this architecture can actually meet the business growth. there is no need to carry out over-design. if it is too complicated at the beginning, it may lead to the loss of agility.

 

Outbreak stage

With the increase in user activity after Weibo is launched, the pressure on the database is also increasing. First, we purchase high-performance hardware devices to scale up the performance of a single machine to meet the needs of supporting rapid business development. Then, we use high-performance devices to vertically split Weibo's overall business, and separately store user, relationship, blog, forwarding, comment, and other functional modules, based on the vertical split, the business modules that are expected to generate massive data are re-split.

 

I would like to say a few more about the use of hardware. since Weibo experienced a high user growth peak at the very beginning, our technical accumulation at this stage was not very rich, in addition, the most important thing is that there is no time for architecture transformation. Therefore, many core businesses are supported by purchasing PCIE-Flash devices. I still clearly remember that the feed system was heavily dependent on MySQL at the beginning, on the day of the Spring Festival Gala in 2012, the QPS written to MySQL reached 35000.

 

Although it seems that the price of high-performance hardware will be much higher than that of ordinary hardware, the time to win is the most valuable. it is very likely that in the early stages of product life, product faults may be caused by performance problems, this directly leads to the loss of users, making the loss even worse. Therefore, I personally think that in the early outbreak stage, it is the most cost-effective to invest money to solve the problem.

 

Continue to talk about database splitting. the following uses blog as an example. Blog posts are the main content produced by Weibo users. it is foreseeable that as the time dimension increases, it will eventually become very huge. how can we meet the business performance requirements, using less-cost storage as much as possible is a challenging problem.

  • First, we split the index with the same content. because the index requires less storage space, and the content storage requires a large amount of space, the use requirements of the two are also different, the access frequency is also different and needs to be treated differently.

  • Then, hash the index and content respectively, and then split horizontally based on the time dimension to ensure that the capacity of each table is within the controllable range to ensure the query performance indicators.

  • Finally, the business first obtains the id of the actual required content through the index, then obtains the actual content through the content library, and accelerates the entire process by deploying memcached. although it seems that there are more steps, however, the actual results can fully meet the business needs.

 


At first glance, this is just a database architecture diagram of the blog function module. we can see that the index and content are divided into many ports, each port is divided into many databases, and tables in each database are first hashed and then split according to the time dimension, in this way, we can choose to archive or adjust the deployment structure when we encounter a capacity bottleneck or performance bottleneck in the future, which is very convenient. In addition, after archiving, you can use different hardware to undertake different services, improve hardware utilization and reduce costs.

 

At this stage, we have split and transformed many Weibo functions, such as users, relationships, blog posts, forwarding, comments, and likes. Basically, we have split all our core functions into data, to ensure that the system can be transformed and adjusted according to the plan in case of any bottleneck.

 

Precipitation Phase

In the last phase, Weibo databases experienced a lot of sharding and transformation, which directly led to exponential growth in scale. after the rapid growth of business, it began to become stable. At this stage, we began to focus on automated construction. we used automated tools to implement the previously accumulated experience during the rapid expansion period to form standardized and streamlined platform services. We have successively built and transformed backup systems, monitoring systems, AutoDDL systems, MHA systems, inspection systems, slow query systems, and maya middleware systems. In addition, compared with the internal management system, the iDB system has been re-developed for users of the database platform to improve business efficiency and reduce the number of communications. Through the iDB system, you can easily understand the running status of your business database and directly submit the DDL modification requirements for the database. The DBA only needs to click to approve the modification, it can be submitted to the Robot for online execution, which not only improves the work efficiency, but also improves the security and standardization.

 

Because many automation systems are involved, we will not describe them one by one. In fact, my personal understanding is that after the product develops to a certain stage, O & M will enter the automation stage no matter how it is maintained, human intervention is sufficient to support changes and operations, and many special cases require human intervention, especially human brain intervention.

 

The importance of the specifications should be emphasized here. In terms of MySQL development specifications, if the conventions are made in advance and restrictions are imposed, although developers may feel constrained when using them, however, this can avoid completely uncontrollable faults online, and some problems will never happen due to the existence of specifications.

 

For example. Slow query of MySQL is the culprit of slow online performance. However, in many cases, it is not because of the absence of indexes, but because of code writing problems and implicit conversions. In this case, we generally recommend that you add double quotation marks to all where conditions to directly eliminate the possibility of implicit conversion, when writing code, developers do not have to worry about whether the code is of the generic or int type.

 

Continue with automation. After the initial stage and scale-up, there will be more and fewer active users. this pressure will urge everyone to automatically seek a solution and automate the transformation. Of course, there is time to develop after the business is stable, which is actually a more important reason.

 

I personally think that automation is divided into two stages. The first stage is to replace machines with humans, that is, to hand over most mechanical labor to procedures to solve the problems of batch operations and repetitive labor. The second stage is to replace machines, that is to say, machines can make self-selection after making certain judgments for humans, freeing up manpower. However, the second stage is the ideal state we have been pursuing. so far, we have only completed some simple small functions, such as dynamic adjustment of max mem and other simple logic functions.

 

Optimization and design of Weibo database

Next, we will introduce some recent improvements and optimizations to the Weibo database platform.

 

The database platform does not only support MySQL, but also Database services such as Redis, Memcached, and HBase. under the trend that the cache is king, Weibo will focus its R & D efforts on Redis in 2015.

 

Weibo was using Redis earlier and started with a large volume. Therefore, many practical problems were encountered during actual use, our internal branch versions are optimized to address these actual problems, which are characteristic of the following.

  • Added the pos-based synchronization function. In Redis 2.4, once synchronization is interrupted, the data in the master database is "all" transferred to the slave database again, resulting in instantaneous network bandwidth peak, in addition, for businesses with large data volumes, the recovery time from the database is slow. For this reason, we work with the architecture team to learn from MySQL's master-slave synchronous replication mechanism, the aof of apsaradb for Redis is transformed to record the pos bit and record the synchronized pos bit from the database. in this way, even if the network is fluctuating, the data is only a part of the data, does not affect the business.

  • Online hot upgrade. In the early stages of use, due to the addition of many new features, the Redis version is constantly upgraded. in order not to affect the business, the master database must be switched during each upgrade, which brings great challenges to O & M, therefore, a hot upgrade mechanism was developed to dynamically load libredis. so to achieve version changes, there is no need to switch the master database, greatly improving the O & M efficiency, but also reduces the risk of changes.

  • Customization and transformation. In the later stages of using Redis, due to the many technical requirements on Weibo products, Redis-compatible redisscounter was specially developed to store technical data, replacing hash table with array greatly reduces memory usage. After that, the phantom based on the bloom filter was developed to meet the judgment class scenario requirements.

 

Redis middleware

In 2015, our self-developed Redis middleware tribe system was developed and launched. tribe adopts the proxy architecture design with a central node and manages cluster nodes through configer server, it also draws on the design idea of the slot shard of the official Redis cluster to complete data storage, and finally implements functions such as routing, partitioning, automatic migration, and fail over, besides, APIs for operations and monitoring are reserved to connect to other automated O & M systems.

 


The main purpose of tribe development is to solve the problem of automatic migration. because Redis memory usage changes volatility, it may change to 10% on the previous day and 80% on the next day, in this case, manual migration cannot respond to business changes, and if it happens to encounter a bottleneck in the physical memory, it will be more troublesome, data hash reconstruction involving services may cause faults.

 

The dynamic migration based on slot is not aware of the business first, and the whole server is no longer needed. you only need to find a server with available memory to migrate some slots to directly solve the expansion and migration problem, it can greatly improve server utilization and reduce costs.

 

The provided routing function can lower the development threshold and no longer need to write the resource logic configuration to the code or front-end configuration file, and no longer need to go online each time a change is made, this greatly improves the development efficiency and reduces the risk of faults caused by online changes. after all, 90% of faults are caused by active changes.

 

I personally think that every company has its own scenario. open-source software can provide us with a good solution, but it cannot fully adapt to application scenarios, therefore, restructuring is not unacceptable. you must compromise on some things.

 

Databus

Because we have databases such as Redis and HBase after MySQL, there is a scenario where the data has been written into MySQL, but the data needs to be synchronized to other databases, we have developed Databus to synchronize data to other heterogeneous databases based on MySQL binlog and support custom business logic. Currently, the data flow from MySQL to Redis and from MySQL to HBase has been implemented. The next step is to develop the data flow from Redis to MySQL.

 


The original intention of databus development was to solve the Redis writing problem. some data needs to be written into MySQL and Redis. Enabling dual-write on the front end can solve this problem, but this may cause code complexity. implementing a data link on the back end will make the code clearer, at the same time, data consistency can be ensured. Later, in practical application, databus gradually began to provide the data import function.

 

The following describes the database design habits accumulated by Weibo. Generally, we will adopt some design ideas of "anti-paradigm". while "anti-paradigm" design brings convenience, it also brings about some problems, especially when the data size increases. There are several solutions.

  • Pre-split. The capacity should be evaluated in advance when the requirement is received, and split horizontally after vertical. if the design can be based on the time dimension, the archive mechanism should be incorporated. Solve the capacity storage problem by splitting database tables.

  • Introduce message queue. The one-write, multi-read feature of the queue or multiple queues can be used to meet the multi-copy write requirement of redundant data. However, the consistency can only be ensured, and data delay may occur in the middle.

  • Introduce the interface layer. Data is summarized through interfaces of different business modules and then returned to the application layer, reducing the coding complexity of application layer development.

 

Another point is that, if the database estimate is relatively large, we will refer to the blog design idea to separate the index and content at the beginning, and design the hash and time dimension table shards, it is possible to minimize the problems and difficulties that may be encountered in subsequent splitting.

 

Plans for the future of Weibo database platform

Finally, I would like to share with you some thoughts on the development of the Weibo database platform, hoping to provide some ideas for you. of course, I also hope you can give me some suggestions and opinions, reducing detours and pitfalls.

 

With the development of business, more and more scenarios will be encountered. we hope to introduce the most suitable database to solve the scenario problems, such as PostgreSQL and SSDB. At the same time, the new version of MySQL features, such as MySQL 5.7 parallel replication, GTID, dynamic adjustment of BP, and constantly optimize the performance and stability of existing services.

 

In addition, to promote the servitization of existing NoSQL services, the storage nodes are organized by using proxy to provide external services, reducing developers' development complexity and obtaining resource fineness, improve standalone utilization rate internally and solve the bottleneck of horizontal expansion of the resource layer.

 

At the same time, we try to use various cloud computing resources to achieve dynamic expansion and contraction of the cache layer, and make full use of elastic cloud computing resources to solve business access fluctuations.

Q &

1. is data and index separation implemented at the business layer or by Middleware? I feel that middleware has done a lot of work. can this part be expanded a bit?

Since the middleware solution was not taken into account during the initial split and transformation, the indexing and content separation is implemented in the business logic. In my personal experience, even if middleware solutions are used, indexes and content should still be separated at the business logic layer.


The core function of middleware is to isolate the program and backend resources. no matter how many resources are available on the backend, the program is a unified portal. Therefore, middleware solves the horizontal split problem, indexing and content separation are in the scope of vertical split. I personally think that they should not be solved by middleware.

 

2. can I recall the most impressive database service failure and make some notes?

The most impressive one was the database service failure. at that time, a colleague accidentally executed the drop table command. people who knew the database knew how powerful the command was, we urgently restored a single table when we tried to use the architecture. although The table was downgraded for a while, it did not affect the user as a whole.


The precautions for this are standards. Since then, we have modified all the procedures for table deletion requirements and made sure that they are strictly enforced. no matter how urgent the deletion needs are, we have to perform 24 hours of cooling, as shown below.

  • Execute the rename table operation and rename the table into table-will-drop.

  • Wait 24 hours before performing the drop operation.

 

3. during the rapid growth of Weibo, the table is split by "hash the index and content, and then split by time latitude". can this hash part be further discussed?

This is actually not that complicated. First, we will estimate the approximate quantity level in a year, then calculate the number of tables to be split, and try to keep each table within 3 million rows of records (of course, this is just hope, reality proves that the plan cannot keep up with the change ). For example, we use the mode 1024 to divide all generated blog posts into 1024 tables based on the blog id (this blog id involves our uuid global sender and will not be expanded ).


Because most Weibo users generate content that is linked to time, the time dimension is a strong attribute for us and is basically there. suppose that a table is created every month, this means that 1024 tables will be produced every month. If there is a bottleneck in the database capacity, we can solve it by time, for example, migrating all the tables in 2010 to other databases.


4. Linkedin launched a databus-like project many years ago. some open-source projects in the future support data synchronization from MySQL to HBase and ES. can I start Weibo?

This heterogeneous data synchronization is indeed available in many companies. as far as I know, Alibaba's DRC is also doing the same thing. Our main idea is to rely on MySQL's binlog. we all know that when MySQL's binlog is set to the row format, it will record all affected data to the log, in this way, all data changes are provided.


We parse the MySQL binglog in row format to read the data changes to databus, and then pass the actual business logic. When the so file is loaded to databus, databus reprocesses the data changes based on the business logic and then outputs the changes to downstream resources.


Databus is an open-source project. you can search for "MyBus" directly on GitHub.


5. what is the index in question 1? how can I find an algorithm for the corresponding content?

Indexing is not actually a content algorithm. for example, if you need to store blog posts, you must store a unique id for differentiation, then we will store the status of this blog post, such as who sent the post, when it was sent, and we think that these statuses and IDs are indexes. the blog content is the actual content, because the content is relatively large, storing it together with the index will lead to a decline in MySQL Performance. In addition, many queries only need to get the blog id, which is actually the actual blog.


After filtering indexes, we can obtain an index list that will be output to the user. then, we can find the actual content from the content library based on the list, in this way, the smaller the returned result set of MySQL, the higher the performance, and thus the optimization effect.


6. What role does NoSQL play?

Weibo NoSQL has played an increasingly important role. for example, rediscounter, our self-developed counting service, which was originally stored in MySQL, since all the counts are update set count = count + 1, the high concurrency write operations on a single row of data will lead to multiple MySQL locks, and the higher the concurrency, the more powerful the lock.

It can be seen that if the number of concurrent operations on a single row reaches more than 500, the tps will change from tens of thousands to hundreds. Therefore, MySQL cannot support this business scenario well no matter how it is optimized, redis can.



I personally think that NoSQL is like the Swiss Army knife. in the most suitable place, it is the best solution. I personally think this is also the future direction of NoSQL development, and each has an optimal scenario.


7. our company will have many smart devices in the future. more than 20 thousand devices may be available this year, and more than doubled in a year. the data to be collected will be in JSON message format, fields in JSON are used to identify different types of packets. it seems that it is not suitable for storing varchar or longtext fields as strings. if DB storage can make full use of native JSON of MYSQL 5.7, just save one column, instead of using extended "wide columns" for storage, or does IPVs or other databases provide more convenient storage? When MySQL 5.7 was not FINAL in the early stage, the mariadb multi-master was used to share the load of N multiple devices, and the DB write layer, what are the guidelines for writing data to a large number of devices?

We are also looking at the new features of MySQL 5.7. JSON has a great impact on database design. according to your statement, you really should not use fields such as varchar or text, however, I personally suggest using smart devices. if possible, it is best to directly access HBase, because sooner or later the data volume will become difficult for MySQL to support, which is inherently no advantage.


8. what does "data and index separation" mean? is data files and index files stored separately on different machines?

It should be content and index, so we can better understand this as a vertical split on the business layer. Vertical split is bound to exist in different database instances, so it can be placed on a physical machine or on a different physical machine, we usually place them on different physical machines to avoid mutual interference.

 

9. what information is stored in MySQL? Which stores NoSQL? What kind of NoSQL is used?

This involves a hierarchical storage problem. Currently, MySQL + Redis + mc is the mainstream. both mc and Redis are used for heat resistance and peak values, while MySQL implements data, ensure that raw data is traceable. Most requests will be returned to the mc or Redis layer, and less than 1% of the data will be returned to MySQL.


Of course, there are also special cases. for example, we have stored and used the rediscounter, but we have not stored a copy of data in MySQL.


Finally, I personally think that the biggest advantage of NoSQL is the convenience of development. it is easier for developers to use NoSQL than to use MySQL. because it is a KV structure, all queries are primary key queries, there is no need to consider the issue of index optimization or table creation, which is a fatal temptation for the current Internet speed.


10. What are the advantages and disadvantages of the globally unique sender and automatically generated id?

The globally unique sender has many implementation solutions. We use the mc protocol to change Redis, mainly pursuing performance. I have also seen the use of the MySQL auto-incremental id as the sender, but because the MySQL lock is too heavy, problems often occur after the business starts, so I gave up.


Another advantage is that the development of the globally unique sender is not very difficult. you can encapsulate the attributes you want into uuid, for example, the id format can be "timestamp + business flog + auto-incrementing sequence number", so that you can know the time and the corresponding business when you get this id, compared with a simple meaningless serial number sent by MySQL itself, more things can be done.


Learn about other articles shared by Weibo technical team

  • Upsync: Weibo open-source dynamic traffic management solution based on Nginx containers

  • A lightweight RPC framework supporting hundreds of billions of calls on Weibo: Motan

  • Docker-based Weibo hybrid cloud platform design and practices

  • Experiences on deploying Weibo's "multiple active standbys in different regions"

  • Practices of Docker container-based hybrid cloud migration on Weibo

  • MySQL optimization and O & M for a single table with 6 billion records

  • Troubleshooting of Weibo's large-scale and high-load system problems




The Weibo technical team recruited various technical talents, including database MySQL/NoSQL DBA, big data engineers, C/C ++, Java, O & M personnel, and other technical positions, all engineers are equipped with MacBook Pro and DELL big screen monitors, and have a wealth of development tips and training documents. this is the best choice for engineers to reflect technical value and improve their individual abilities. Scan the QR code to learn more.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.