Citation: Https://www.zhihu.com/question/19645686/answer/85075806?from=profile_answer_card
How does the feed stream in Weibo and know-how be implemented? In Weibo, everyone pays attention to hundreds of thousands of people who pay attention to the topic and focus on specific topics, how the output of these feeds can effectively reduce the system query load, especially the results of the feed stream will need a specific sort. Do not know how these are optimized, can you instruct twos? 7 Reviews ShareSort by poll by Time6 answers Agree objection, will not show your nameFengspCellier,elune Yin, Wei Hua and other people agree In short, the feeds consists of two pieces of content, which is to generate feeds and update feeds. What does it mean to generate feeds, for example, if someone we already care about has done something specific, we need to add these activities to your feeds so you can receive them. Update feeds includes more content, one is that your focus has been updated, such as your new focus on a person, you need to add his activities to the existing feeds, similar to the cancellation of attention is also the same, the other is that your focus has done some updates, such as one of your concerns to cancel the attention of a problem.
Let's talk about feeds generation, user A did something, such as focus on a problem, when we first find the user A of all the followers, and then to the need to push the followers push this operation, we can put everyone's feeds simple imagine as a sequence of tables, push is very simple, is to add this action at the end of everyone's list. How, isn't it simple,:)
Then we talk about the feeds update, the first case and the second situation are actually the same, we need to update the feeds for these activities, such as we have a new focus on a person, we need to take out the person's activity history, and then in chronological order to put these history into your feeds. The complexity of this operation is high and the appropriate data structure is used to achieve optimal performance, currently O (log (N)).
Of course, the real situation is not so simple, there are a lot of other logic and attention points, such as our feeds need to do the same for many people do the same thing (you can make the above feeds ordered list of the ordered set), the same content is created by the time of the latest operation to calculate, A person to do a lot of operations need to do a merger and so on. All you see feeds the corresponding storage takes into account the performance will use memory, in addition to all operations need to do persistent storage, otherwise we will not be able to update feeds:)
Let's talk about the technical challenges and related technologies in this section, regardless of the current technology decision and use technology, and I'll share with you a few foreign team engineering decisions and their optimization tools.
Let's talk about Strava, they used the Kafka distributed subscription messaging system, which is used for event publishing, and also uses the storm distributed real-time computing platform, which subscribes to Kafka events and then finishes the corresponding processing, where they do an optimization, Is that an event will not be pushed to all followers, only to the active user (how to determine that a user is active, which depends on the actual situation and the data itself optimized).
And then to talk about Instagram, their product read and write ratio reached 100:1, in fact, most of the Internet products are like this, so this is the push technology more appropriate reason, push the cost may be a little bigger, but push (that is, write) the number of occurrences is significantly less than read, because some Daniel followers very much, Up to hundreds of millions, this process will be executed asynchronously in order to reliably push. Similarly, we need a task scheduler and Message queue, task scheduling they chose the celery, the following need to select a message queue, Redis relies on subscribers polling, does not have a copy backup, and strong dependence on memory is a flawed, inappropriate; beanstalk other good, but still do not support replication (Replication), discarded; finally they chose rabbitmq, fast, efficient, copy-enabled, and highly compatible with celery.
Then talk about Pinterest, the focus is on creating an intelligent feed, that is, the feed will include some hot and recommended, and will be sorted according to the specific algorithm. When an event occurs, the user's content feed is finally accessed through a series of processes, first handled by the smart feed worker, who receives the event and scores the event based on a specific user, and then these events are inserted into a sorted feed pool, Different types of events are inserted into their pools, and now they implement this priority queue using HBase's key-based ordering, and then the smart Feed content generator takes over, and it removes feeds from several pools and even rejects some feeds, Finally, the user-facing smart feed service integrates the old feed and the newly generated feed to show the home Feeds the user sees.
In the end, simply talk about Facebook, where the number of users will be higher, and every Facebook user will have a unique, dynamically generated home page. Many teams use the User ID key to put feeds into key-value storage systems, such as Redis, where remote process calls over a network connection are slow to meet real-time requirements, so the Facebook team also started using the embedded database. They also open up their own rocksdb.
These are all mentioned in their technical blog and are linked here:
Strava Engineering
How Instagram Feeds work:celery and RabbitMQ
Making Pinterest
https://www. facebook.com/notes/10151822347683920/
Just like the movie effect is for the plot service, technology is for the product service. For different business scenarios, the appropriate technology is not the same. With the adjustment of products and expansion of business scale, the corresponding technology will do evolution and adjustment. For different problems, need to propose different technical solutions, feeds generation is also so, if necessary, we will also make adjustments to these programs, the purpose is the same, it is fast and stable.
If there is anything wrong, I hope the great God points out. If there is a better plan or suggestion, welcome to communicate.edited on 2015-12-31 12 reviews ThanksShareCollection • No help • Report • Author retention rights 8 Approval objection, will not show your nameknow the user , focus on network security, focus on PHP technology. Xiahai, Zhang Hao, sleeping gods and other people agree Just recently want to write a feed mechanism of the article, just to answer it.
Upstairs you actually put the general situation has been said very clearly, I would like to share our current online feed implementation mechanism, has been running in production environment for half a year. Theoretically, millions users have no problem.
In order to save everyone flow, the whole picture (in fact, I do not bother to draw-), I hope that we can save the flow of money to send me a red envelope, applause.
First of all, throw away the database this piece, the database I think you certainly know how to design, but the query must be a troublesome thing, so I used redis for a redundant design, to start the introduction, need to understand what is the push-pull mode, upstairs said two articles Sina Weibo architecture and feed architecture analysis---- paper0023_ Sina Blog, as well as the push-pull mode and the time partition pull mode analysis of the push mode and pull mode of the micro-blog feed system and the time partition pull mode architecture is actually enough to understand, please crossing if you do not know the push-pull mode, then please read the article First, To see my answer again, my answer is just to elaborate on the implementation, thank you.
So now I'm going to talk about how Redis is doing the push-pull mode, as well as saving as much as possible, as fast as possible (of course, I understand that saving memory and speed increases, if the crossing have other views, I have two requirements, one, light spray, two, say finish again spray).
1. Realize
First of all, to solve the issue of publishing and receiving, there are generally the following ways:
1. Push mode
What is push mode? Push mode is that user A is concerned about User B, user B each send a dynamic, background traversal User B fans, to their fans feed inside push a dynamic.
2. Pull mode
In contrast to push mode, the pull mode is that each time the user refreshes the first page of the feed, they go through the people concerned and pull back the latest dynamic.
However, regardless of the push mode or pull mode are there if the number of attention or excessive number of fans, resulting in too long traversal time problem, how to solve? Here comes the third mode, push-pull mode.
3. Push-Pull Mode
This is a compromise solution, that is, online push, offline pull. Fans hundreds of tens of millions, with you to publish dynamic at the same time online is certainly only so the sky hundreds of thousands of tens of thousands of, not to mention this kind of big V very few, only pushed to the online fans, offline fans on-line, then to pull the dynamic can! However, regardless of the mode, each user will maintain a similar Outbox and Inbox, save their own dynamic and feed dynamic (see below), to complete the push and pull.
And here, is definitely the push-pull mode, User A is concerned about User B, User B release dynamic will be the dynamic promotion of User A's feed, where the use of Redis zset implementation, sort for time (remember in milliseconds as a timestamp, the second level in the amount of data to a certain extent, there will be no read problems, For example, the time stamp for pagination page number), value for the specific dynamic ID (why is the dynamic ID, in fact, is very simple, because the dynamic content can be cached, in the Redis all walk ID, modify dynamic content also need to modify a place, dynamic content can be stored in the hash structure), Each user maintains a zset save my published dynamic, a zset save my feed dynamic, expiration time 3-7 days to see the situation depends. Why design expiration time will be followed.
OK, the global maintenance of an online user list, how to design this on their own, in order to prevent users from hanging background and the server is offline, so it is best to 1-3 hours or offline time is not more than 3 hours, all as online processing, anyway this depends on the situation.
Then, when the user sends a dynamic, the background will have the following actions:
Online push: Iterate through online fans and add dynamic IDs to fans ' feeds.
Offline Pull: After the offline user opens the app, we will request a public access interface, the main to do statistics and other initialization operations, here, we also open an asynchronous thread, the user to feed update operation, prevent users from entering the app wait for too long, After all the attention of thousands of people certainly have (in fact million units below the Traverse are very fast). The pull process is actually the time stamp of their last feed, to traverse the attention of the person's feed, will be greater than the time of the ID all pull back. After the user enters the app, refresh to see the latest action.
Another: If there is a need to feed the new message number hint, you can increase the push and pull while refreshing the feed to empty.
In fact, to this point, the issue has been solved, then there is a problem, the user feed is too long, how to use memory?
I do this, a user's feed for the first time, the feed length of 500, in our app, the equivalent of 50 pages, and then the data, all go to the database. Big page pages is actually a pseudo-demand and consumption of the performance of things, users in addition to the first time with this app, will turn to the end, the first use, can have a few dynamics? And for two times use above the user, generally speaking, turn over a few pages has already reached the last time to see the place, so 500 data, in the general situation of attention, the content is enough to consume, or even to achieve fatigue, there may be a large number of users his feed daily may have a lot of dynamic, but, needless to say, Must be advertising, pay attention to a bunch of people waiting to return powder, such people will not go to consume content, 50 pages of content, turned up tired. Of course, not to say that give up these people, the feed can not find a database AH ~ ~ ~ ~ Love to walk, want to go on to me 50 pages again ~
There is also a problem, each user to maintain their own dynamic and feed queue, when the user million, the amount of memory is certainly not small, how to release memory is appropriate?
Back to the question here, why should I give the feed key design expiration time? Why is the design 3-7-day expiration time?
The reasons are as follows:
One users do not open the app for 3-7 days, may have lost interest in the app, open Chance is very small, or has been uninstalled, there is no meaning.
Two, 3-7 days did not log in the app, the attention of the person hair dynamics also a lot of, feed not pull back to the data certainly also a lot of, then this time to traverse actually pull the amount is very big, then is not as good as directly all pull aside or pull the user last login time output data.
Here, in fact, is almost there, most of the business logic is enough, and the speed is also ideal, the current line of this model for six months, the feed is generally 10~80ms response is complete.
Well, that's probably it.
In the end, say one thing:
<?php
Echo ' PHP is the best language in the world!!! ‘;
?>published on 2016-02-06 5 reviews thank youShareCollection • No help • Report • No reprint 7 Approval objection, will not show your nameUser-aware , Daze engineer Nekocode, Wang Yanhao, know the user and other people agree The physical distribution of data is not considered, only the design of the business is discussed.
There are three basic ideas:
Idea 1, all the dynamics generated, are on the same index, which hosts all the updates, as well as reads. Each user has its own filter rule, which includes "who is blocked, who is concerned, and which problem is blocked".
When the user reads the feeds, they use their own filter to traverse the index from the beginning, in chronological order, until they get enough entries.
Advantages: Business logic is the clearest, stable performance.
Disadvantage: The technology is the most difficult.
Idea 2, read diffusion.
Everyone has their own feeds queue, open their home page, according to their own watchlist and masking rules, to read the other user's feeds queue, and then summarize, sort.
Pros: The simplest to implement and the most sensitive to list of concerns.
Disadvantage: The performance is the worst, and the difference is very stable.
Idea 3, write diffusion.
Each person has its own generated feeds queue and the feeds queue to be read. Each user generates a dynamic, pressed into the followers of the waiting to read feeds queue, before pressing, you need to according to the shielding rules to determine whether to press in.
Advantage: Each user page opens at the fastest speed and has the highest performance.
Cons: Attention list changes, sensitivity is slightly lower, but read the queue, and then filter by the rules again, there is no big problem.
Personal Recommendation: Thinking 1.
According to the individual's simple guess,
know the home page, seems to be the idea of 2.
Reasons to speculate:
1-When I take off, refresh, the page will also have the user's dynamic, so the possibility of thinking 3 is very large.
2-When I new focus on a person, refresh, the page will have the user's dynamic, may be targeted at the recent attention to the behavior of a special treatment, a small amount of read diffusion.
3-b focused on a. If a account generates a dynamic, such as approval, then cancel approval. At this time the full dynamic of a account, there is no previous approval record, but on the first page of B, there will be a endorsement behavior.
-----------------------
Update at 2014-12-08
I stumbled on today, I noticed that some of the people's answers were not appearing on my homepage.
I have repeatedly compared, estimated to be about 4 days ago 4, 5 dynamic, did not appear on my home page.
According to this phenomenon, the home page should be written to spread more.
And, in the writing of proliferation, seemingly failed to write, it is down.edited on 2016-02-02 11 reviews ThanksShareCollection • No help • Report • Author retention rights 0 Approval objection, will not show your namePound, poetry, chief of the WorldEarly in everyone's fire, uchome out when everyone in the study of the realization of a variety of feed flow, to the micro-blog at the moment should have no one in the discussion, a variety of push and pull mode, online search to know.
It's another question about sorting, and everyone is exploring it.
(Personal opinion does not represent company position)posted on 2014-09-03 add Comment thanksShareCollection • No help • Report • Author retention rights 1 Approval objection, will not show your nameChu Dongfang, focus on product manager, e-commerce, Network Marketing, entrepreneurship ...Lu Guo agreed Here is a Sina Weibo to do technology sharing, it is worth a look. Although there is not much technical implementation details, the big framework has already been described. Or a few years ago to share, it should be a lot of changes in the structure, but a small site reference, these are enough. Sina Weibo architecture and feed architecture analysis--Renren architecture _paper0023_ Sina Blog
Another article, a netizen on the push-pull mode and the time partition pull mode analysis of push (push) mode and pull mode and time partition pull mode architecture of micro-blog feed systemposted on 2015-03-17 add Comment thanksShareCollection • No help • Report • Author retention rights 0 Approval objection, will not show your nameAnonymous UserImagine that everyone's feed is a list, generally used in the mature kv store to store, can be Redis can also Casandra or hbase or MySQL (of course sharded). All events generated by the user's front end are uniformly sent to an event bus, such as Kafka or various message queue, and then the application layer's server listens to these events and inserts the new item into the downstream user's feed according to the business logic. Although the fact is very complicated, but the implementation is a prepend operation.
This is basically the idea, you can call it push mode, but it seems to use a few pull, so push is the industry mainstream. In fact, nothing particularly iffy.
About the implementation of the app architecture! How does the feed stream in Weibo and know-how be implemented?