Millions of users fashion sharing website feed system expansion Practice

Last Update:2015-02-01 Source: Internet

Author: User

Tags cassandra

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Fashiolista is an online fashion Exchange site where users can create their own profiles, share their own and the fashion items they see when browsing the Web. At present, Fashiolista users from more than 100 countries in the world, users up to millions, daily share more than 5 million fashion items. As a social, shared web site, the feed system occupies the core structure of the site, Fashiolista's founder and CTO Thierry Schellenbach wrote a blog to share the experience of building a website feed system, as follows:

Fashiolista was originally a project that we developed as an interest in our spare time and had never imagined it would grow into such a large online fashion Exchange site. The earliest version was developed in about two weeks, when the feed flow system was quite simple. Share some of our experience in extending your feed system here.

For many large startups, such as Pinterest, Instagram, Wanelo, and Fashiolista, a feed is a core component. The flat feeds, aggregated feeds, and notification system functions on the Fashiolista Web are supported by the feed system. This article describes the issues that we encounter in the extended feed system and the design decisions in your own scenario. As more and more applications depend on the feed system, understanding the basic workings of the feed system becomes critical.

In addition, Fashiolista's feed system Python version--feedly has been open source.

Feed Introduction

The expansion of the feed system has attracted much attention, and the solution is to build a feed page similar to Facebook's new feed, Twitter stream, or Fashiolista in the event of a network congestion. The common denominator of these systems is to show users the dynamics of the people they care about, and we are building dynamic data streams based on this standard, such as "Thierry added an outfit to the Fashiolista list" or "Tommaso released a Twitter".

There are two strategies for building this feed system:

Fetch (pull), read the process of collecting feeds.
Push (push), write the process in advance to calculate the good feed.

Most real-time online applications use a combination of these two methods, and the process of dynamically pushing to your fans is called message distribution (fanout).

History and background

Fashiolista's feed system has undergone three major improvements. The first version is based on the PostgreSQL database, the second version uses a Redis database, and the current version uses the Cassandra database. In order to facilitate readers to better understand the time and reasons for the replacement of these versions, I will first introduce some background knowledge.

The first part--database

The first version of the database query statement is simple, similar to this:

SELECT * from Love where user_id in (...)

Surprisingly, the robustness of the system is good. When Love (like "liking" a piece of clothing) reaches millions, it works well, with more than 5 million, still no problem. We also bet that the system doesn't support tens of millions of orders, but it still works fine when Love arrives at tens of millions. This simple system supports our system to reach millions of users and billions of love, with only minor changes. Then with the increase of users, the system began to fluctuate, some users delay for several seconds, after referring to a lot of the feed system architecture design, we developed the first Redis-based feedly.

Phase II--redis and feedly

We set up a redis-stored feed for each user, and when you love a piece of clothing, this dynamic is distributed to all your fans. We have tried a few tricks to reduce the memory consumption (as I will describe below), and Redis startup and maintenance is really simple. We use Twemproxy to share on several Redis machines and use Sentinel for automated backups.

Redis is a good solution, but several reasons have forced us to look for new solutions. First, we want to support multiple document types, and Redis returns database queries more difficult and improves storage requirements. In addition, as the business grows, database rollback is becoming more and more slow. These problems can only be solved by storing more data on Redis, but the cost of doing so is too high.

Phase III--cassandra and feedly

By comparing HBase, Dynamodb, and Cassandra2.0, we finally chose Cassandra because it has several moving parts, and the database that Instagram uses is Cassandra, and DataStax supports it. Fashiolista currently takes the push stream in the flat feed, and the aggregation feed uses a combination of push and pull techniques. We keep up to 3,600 dynamic in each user's feed, currently occupying 2.12TB of storage space. System fluctuations brought by star users we have also taken a number of ways to mitigate, including: priority queue, capacity expansion and auto-scaling.

Feed design

The author thinks that the improvement process of fashiolista design is very representative, and there are several important design problems to consider when building a feed system (especially using feedly).

1. Non-normalized vs normalization

The normalization method is that the list of people you care about is each dynamic ID, and non-canonical storage is all information that is dynamic.

Storing only the ID can significantly reduce memory consumption, but this means that each time the feed is loaded, the database will be re-accessed. The choice depends on how often you replicate data when you are doing denormalized storage. For example, there is a big difference between building a message notification system and a feed system: each action in the notification system needs to be sent to several users, and every dynamic data in the feed system may be copied to thousands of fans.

In addition, how to choose depends on your storage architecture, when using Redis, memory is an issue that requires special attention, while using Cassandra to consume a lot of storage space, but it is not easy to use for normalized data.

For feed notifications and Cassandra-based feeds, I recommend that your data be denormalized. With Redis-based feeds you need to minimize memory consumption and keep your data normalized. Two options can be easily implemented with feedly.

2. Selective distribution on the basis of producers

Yahoo's Adam Silberstein, among others, has proposed a way to selectively push user feeds, and Twitter is using a similar approach. The message distribution of star users can bring sudden and enormous load pressure on the system, which means that extra space must be reserved for real-time maintenance. In this paper, it is recommended to reduce the load on these star users by selectively distributing messages. Twitter uses this method to load the tweets of these star users when the user reads them, and the performance has been greatly improved.

3. Consumer-based selective distribution

Another alternative distribution method is to distribute messages to active users, such as users who have logged in in the past week. We have modified this method for active users to store the latest 3,600 dynamic, for inactive users to store 180, read 180 after the data need to re-access the database, this way for non-active users experience is not good, but can effectively reduce memory consumption.

Silberstein and others think the scenario that best suits the selective push mode is:

Producers occasionally produce dynamic information
Consumers often request feeds

Unfortunately, Fashiolista does not need such a complex system, and it is curious how many orders of magnitude the business will need to achieve this solution.

4. Priority level

An alternative strategy is to take a different priority when distributing tasks, set the distribution task to a high priority for active users, and set the distribution task to a low priority for inactive users. Fashiolista reserved a large cache space for high-priority users to handle spikes at any time. For low-priority users, we rely on automatic scaling and point instances. In practice, this means that the feed of inactive users will have a certain delay. The use of priority reduces the load pressure on the system by the star user, although it does not solve the fundamental problem, but significantly reduces the magnitude of the system load peak.

5.Redis Vs Cassandra

Fashiolista and Instagram have gone through the process of starting with Redis and then moving to Cassandra. I recommend starting with Redis because Redis is easier to start and maintain.

However, there are limits to Redis, and all the data needs to be stored in RAM at a high cost. In addition, Redis does not support sharding, which means that you have to shard between nodes (Twemproxy is a good choice), which is easy, but the data processing when adding and deleting nodes is complex. Of course you can overcome this limitation by having Redis as a cache and then re-accessing the database. But as the cost of accessing the database is getting higher, I suggest using Cassandra instead of Redis.

Cassandra Python's ecosystem is changing dramatically, cqlengine and Python-driver are great projects, but they need to devote a certain amount of time to learning.

Conclusion

When building your own feed solution, there are a number of factors to consider when partitioning a node: What storage architecture is selected? How to deal with the peak load caused by star users? To what extent is non-normalized data? I hope this article can provide you with some suggestions.

Feedly won't make any choices for you, it's just a framework for building a feed system, and you can decide on your own internal technical details. You can see Feedly's introduction for an overview or see the operating manual to build an Pinterestsque application.

Please note that you will need to solve this problem only if you have reached millions of users in the database. The Fashiolista simple database solution has supported us to reach millions of users and billions of love.

More about the design of the feed system, I strongly recommend to look at these articles:

Yahoo! Paper
Twitter Redis based, with fallback
Cassandra at Instagram
Etsy Feed Scaling
Facebook history
Django project, with good naming conventions. (But database only)
http://activitystrea.ms/specs/atom/1.0/(actor, verb, object, target)
Quora post on best practises
Quora scaling a social network feed
Redis Ruby Example
FriendFeed approach
Thoonk Setup
Twitter ' s approach

Original link: Design decisions for Scaling Your high Traffic Feeds (compilation/Zhou Xiaolu review/Zhonghao)

Millions of users fashion sharing website feed system expansion Practice

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More