Push and pull modes of the Weibo Feed SystemTime partition Pull ModeArchitecture Discussion
[ArticleAuthor: Sun Li link: http://www.cnblogs.com/sunli/ updated by: 2010-08-24]
The SNS and Weibo systems are applied to the feed (each microblog or the new things in SNS are called feed) system, whether it is Twitter.com or Sina Weibo in China or Renren.Technical CommunityAt the Technology Conference, we all shared our feed architecture, that is, the push-pull mode (timyang also shared Sina's meager model last time ). Next we will discuss the push (pull) mode of Weibo feed and propose a newTime partition Pull Mode.
As we all know, when you post a microblog on Weibo, all followers of you will receive your meager message within a certain period of time, which is a bit like sending a group of emails, all CC recipients will receive it within a certain period of time. At this point, you may feel that there is no difficulty. Let's take a look at the following:
Figure 1: Sina Weibo Yao Chen
Figure 2: Feng dahui on Twitter
Yao Chen has 2594751 million followers on Sina Weibo. If she publishes any Weibo post, 2594751 followers will receive it within a certain period of time. If Twitter's Feng dahui published an article, 19868 followers are required.
On the contrary, Yao Chen needs to receive all updates from his 545 followers, while Feng dahui needs to receive all updates from his 2525 followers. So far, do you feel a little bit of a challenge?
Let's take a look at the general overall structure of Weibo:
Figure 3: Overall Weibo Structure
The figure shows the overall data process of Weibo. First, we need to understand the overall data structure, but we do not have to deal with push-pull modes such as followers. Next let's look at the push mode ):
Figure 4: Push mode structure
In the push mode, we need to push a microblog to all followers (to all fans). For example, Yao Chen, we need to push the microblog to feeds tables of 2594751 users. Of course, the feeds table can be sharding well, and the storage is also some numeric fields. The storage space may not be very large. When users query the feeds of all users they are concerned about, the speed is fast, the performance is very high, but the push volume is very large. When Yao Chen published an article, more than 2 million pieces of data will be generated. Imagine whether a meager system with a large number of users will generate amazing data by using the push mode?
The following figure shows the drop-down mode (pull)
Figure 5: Pull Mode
In PULL mode, you only need to store a microblog data to the feeds table when posting a microblog (the feeds table can be a temporary table and only save the data of a recent acceptable range ). each time a user queries a feed, the feeds table is queried. For example, when Yao Chen opens his own meager homepage, the following code is generated: Select ID from feeds where uid in (following uid list) order by id desc limit N (query the latest N records) and cached to memcached.
Uidlist => {data: ID list, timeline: Time of the last queried latest data}
Refresh again: Select ID from feeds where uid in (following uid list) and timeline> (last timeline stored in memcached) order by id desc limit n
This mode is also relatively simple and easy to implement, but the cache structure needs to be considered when querying. However, the feeds table will generate a lot of pressure. How can we say that the feeds table also needs to store data for the last ten days and a half months? For a large system, this will produce relatively large data, if the number of following members is large, the database pressure will be very high. Generally, online users and clients scan regularly, which increases the query performance.
Next, we will make some improvements and Optimizations to the PULL mode.
Figure 5: Pull Mode (pull)-improvement (Time partition Pull Mode)
The pull mode is improved mainly in the storage of feeds, using time-based partition storage. It can be divided into recent time periods (such as the last hour), recent periods, and relatively long periods. Let's take a look at the query process. For example, if Yao Chen logs on to the Weibo homepage and there is no data in the cache, We can query the feeds table for a long period of time and enter the cache. In the next query, You can query the timeline of the cached data. If timeline is still in the last hour, you only need to query the Feed Table of the data in the last hour, the feeds table in the last hour is much smaller than the feeds table in figure 4, and the query speed must be several orders of magnitude faster.
The improved mode focuses on the time partition storage of feeds. Based on the last queried timeline, the query should fall into that table. Generally, online users, frequently used client scanning operations, and frequently logged-on users are located in the latest feeds table range, which is efficient in queries. Users who log on only once every ten days and half a month need to query the feeds large table for a long time. Once the query has passed, it will fall into the latest region, therefore, the efficiency is very high.
The time partition must be reasonably Split Based on the Data Volume and user access features. If the data volume is very large, more partitions can be created.
ThePush mode and pull modeThey all have their own characteristics.Time partition Pull ModeIt makes up for the shortcomings of the PULL mode shown in Figure 4 and is a low-cost solution. Of course, the time partition PULL mode can also be combined with the push mode to increase the system performance according to some features.
Postscript: The purpose of this article is to introduceTime partition PULL mode,I am not clear about the details of Sina Weibo and Twitter push-pull modes.