Fashiolista is an online fashion Exchange site where users can create their own profiles, share their own and the fashion items they see when browsing the Web. At present, Fashiolista users from more than 100 countries in the world, users up to millions, daily share more than 5 million fashion items. As a social, shared web site, the feed system occupies the core structure of the site, Fashiolista's founder and CTO Thierry Schellenbach wrote a blog to share the experience of building a website feed system, as follows:
Fashiolista was originally a project that we developed as an interest in our spare time and had never imagined it would grow into such a large online fashion Exchange site. The earliest version was developed in about two weeks, when the feed flow system was quite simple. Share some of our experience in extending your feed system here.
For many large startups, such as Pinterest, Instagram, Wanelo, and Fashiolista, a feed is a core component. The flat feeds, aggregated feeds, and notification system functions on the Fashiolista Web are supported by the feed system. This article describes the issues that we encounter in the extended feed system and the design decisions in your own scenario. As more and more applications depend on the feed system, understanding the basic workings of the feed system becomes critical.
In addition, Fashiolista's feed system Python version--feedly has been open source.
Feed Introduction
The expansion of the feed system has attracted much attention, and the solution is to build a feed page similar to Facebook's new feed, Twitter stream, or Fashiolista in the event of a network congestion. The common denominator of these systems is to show users the dynamics of the people they care about, and we are building dynamic data streams based on this standard, such as "Thierry added an outfit to the Fashiolista list" or "Tommaso released a Twitter".
There are two strategies for building this feed system:
- Fetch (pull), read the process of collecting feeds.
Push (push), write the process in advance to calculate the good feed.
Most real-time online applications use a combination of these two methods, and the process of dynamically pushing to your fans is called message distribution (fanout).
History and background
Fashiolista's feed system has undergone three major improvements. The first version is based on the PostgreSQL database, the second version uses a Redis database, and the current version uses the Cassandra database. In order to facilitate readers to better understand the time and reasons for the replacement of these versions, I will first introduce some background knowledge.
The first part--database
The first version of the database query statement is simple, similar to this:
SELECT * from Love where user_id in (...)
Surprisingly, the robustness of the system is good. When Love (like "liking" a piece of clothing) reaches millions, it works well, with more than 5 million, still no problem. We also bet that the system doesn't support tens of millions of orders, but it still works fine when Love arrives at tens of millions. This simple system supports our system to reach millions of users and billions of love, with only minor changes. Then with the increase of users, the system began to fluctuate, some users delay for several seconds, after referring to a lot of the feed system architecture design, we developed the first Redis-based feedly.
Phase II--redis and feedly
We set up a redis-stored feed for each user, and when you love a piece of clothing, this dynamic is distributed to all your fans. We have tried a few tricks to reduce the memory consumption (as I will describe below), and Redis startup and maintenance is really simple. We use Twemproxy to share on several Redis machines and use Sentinel for automated backups.
Redis is a good solution, but several reasons have forced us to look for new solutions. First, we want to support multiple document types, and Redis returns database queries more difficult and improves storage requirements. In addition, as the business grows, database rollback is becoming more and more slow. These problems can only be solved by storing more data on Redis, but the cost of doing so is too high.
Phase III--cassandra and feedly
By comparing HBase, Dynamodb, and Cassandra2.0, we finally chose Cassandra because it has several moving parts, and the database that Instagram uses is Cassandra, and DataStax supports it. Fashiolista currently takes the push stream in the flat feed, and the aggregation feed uses a combination of push and pull techniques. We keep up to 3,600 dynamic in each user's feed, currently occupying 2.12TB of storage space. System fluctuations brought by star users we have also taken a number of ways to mitigate, including: priority queue, capacity expansion and auto-scaling.
Feed design
The author thinks that the improvement process of fashiolista design is very representative, and there are several important design problems to consider when building a feed system (especially using feedly).
1. Non-normalized vs normalization
The normalization method is that the list of people you care about is each dynamic ID, and non-canonical storage is all information that is dynamic.
Storing only the ID can significantly reduce memory consumption, but this means that each time the feed is loaded, the database will be re-accessed. The choice depends on how often you replicate data when you are doing denormalized storage. For example, there is a big difference between building a message notification system and a feed system: each action in the notification system needs to be sent to several users, and every dynamic data in the feed system may be copied to thousands of fans.
In addition, how to choose depends on your storage architecture, when using Redis, memory is an issue that requires special attention, while using Cassandra to consume a lot of storage space, but it is not easy to use for normalized data.
For feed notifications and Cassandra-based feeds, I recommend that your data be denormalized. With Redis-based feeds you need to minimize memory consumption and keep your data normalized. Two options can be easily implemented with feedly.
2. Selective distribution on the basis of producers
Yahoo's Adam Silberstein, among others, has proposed a way to selectively push user feeds, and Twitter is using a similar approach. The message distribution of star users can bring sudden and enormous load pressure on the system, which means that extra space must be reserved for real-time maintenance. In this paper, it is recommended to reduce the load on these star users by selectively distributing messages. Twitter uses this method to load the tweets of these star users when the user reads them, and the performance has been greatly improved.
3. Consumer-based selective distribution
Another alternative distribution method is to distribute messages to active users, such as users who have logged in in the past week. We have modified this method for active users to store the latest 3,600 dynamic, for inactive users to store 180, read 180 after the data need to re-access the database, this way for non-active users experience is not good, but can effectively reduce memory consumption.
Silberstein and others think the scenario that best suits the selective push mode is:
- Producers occasionally produce dynamic information
- Consumers often request feeds
Unfortunately, Fashiolista does not need such a complex system, and it is curious how many orders of magnitude the business will need to achieve this solution.
4. Priority level
An alternative strategy is to take a different priority when distributing tasks, set the distribution task to a high priority for active users, and set the distribution task to a low priority for inactive users. Fashiolista reserved a large cache space for high-priority users to handle spikes at any time. For low-priority users, we rely on automatic scaling and point instances. In practice, this means that the feed of inactive users will have a certain delay. The use of priority reduces the load pressure on the system by the star user, although it does not solve the fundamental problem, but significantly reduces the magnitude of the system load peak.
5.Redis Vs Cassandra
Fashiolista and Instagram have gone through the process of starting with Redis and then moving to Cassandra. I recommend starting with Redis because Redis is easier to start and maintain.
However, there are limits to Redis, and all the data needs to be stored in RAM at a high cost. In addition, Redis does not support sharding, which means that you have to shard between nodes (Twemproxy is a good choice), which is easy, but the data processing when adding and deleting nodes is complex. Of course you can overcome this limitation by having Redis as a cache and then re-accessing the database. But as the cost of accessing the database is getting higher, I suggest using Cassandra instead of Redis.
Cassandra Python's ecosystem is changing dramatically, cqlengine and Python-driver are great projects, but they need to devote a certain amount of time to learning.
Conclusion
When building your own feed solution, there are a number of factors to consider when partitioning a node: What storage architecture is selected? How to deal with the peak load caused by star users? To what extent is non-normalized data? I hope this article can provide you with some suggestions.
Feedly won't make any choices for you, it's just a framework for building a feed system, and you can decide on your own internal technical details. You can see Feedly's introduction for an overview or see the operating manual to build an Pinterestsque application.
Please note that you will need to solve this problem only if you have reached millions of users in the database. The Fashiolista simple database solution has supported us to reach millions of users and billions of love.
More about the design of the feed system, I strongly recommend to look at these articles:
- Yahoo! Paper
- Twitter Redis based, with fallback
- Cassandra at Instagram
- Etsy Feed Scaling
- Facebook history
- Django project, with good naming conventions. (But database only)
- http://activitystrea.ms/specs/atom/1.0/(actor, verb, object, target)
- Quora post on best practises
- Quora scaling a social network feed
- Redis Ruby Example
- FriendFeed approach
- Thoonk Setup
- Twitter ' s approach
Original link: Design decisions for Scaling Your high Traffic Feeds (compilation/Zhou Xiaolu review/Zhonghao)
Millions of users fashion sharing website feed system expansion Practice