The architectural challenges behind the tumblr:150 of the million-month browsing volume

Source: Internet
Author: User
Tags apache php redis require zookeeper varnish haproxy redis server

This article is very open to the eyes of people, over reprinted.

English Original address: http://highscalability.com/blog/2012/2/13/tumblr-architecture-15-billion-page-views-a-month-and-harder.html

CSDN Translation Original address: http://www.csdn.net/article/2012-02-14/311806

(Last article)

---------------

absrtact: The reading of Tumblr is more than 15 billion times a month, he has become a popular blog community. The user may like his simplicity, his beauty, his intense attention to the user experience, or the friendly and busy way of communication, in short, people like him. Growth of more than 30% per month is unlikely to be unchallenged, with reliability issues particularly daunting. 5 million views per day, peak 40,000 requests per second, daily 3TB new data Save ...

Guide: Like many new websites, the famous Light blog service Tumblr is facing the bottleneck of system architecture in the rapid development. 500 million times a day, peak 40,000 requests per second, 3TB of new data storage, more than 1000 servers, in this case how to ensure the smooth operation of the old system, smooth transition to the new system, Tumblr is facing enormous challenges. Recently, Todd Hoff of Highscalability's website interviewed the company's distributed system engineer Blake Matheny, and the author introduced the architecture of the website with valuable content. We also very much hope that the domestic companies and teams to do a similar share, contribute to the community at the same time, more able to enhance their status, the recruitment, business development are a lot of benefits. Welcome to contribute to us by @csdn Cloud Computing Micro Blog.

the first part of the translation is as follows . The second part of the point here . (the small words in parentheses are for CSDN editor note):

Tumblr more than 15 billion page views per month, has become a popular blog community. Users may like its simplicity, beauty, a strong focus on the user experience, or a friendly and busy way of communication, in short, it won the people's favorite.

Growth of more than 30% per month is certainly not without challenges, with reliability problems particularly daunting. 500 million visits per day, 40,000 requests per second, 3TB of new data storage per day, and run on more than 1000 servers, all of which help Tumblr achieve huge scale of operation.

To succeed, startups have to go through the threshold of a dangerous, fast-growing period. Looking for talent, constantly transforming the infrastructure, maintaining the old architecture, and facing a huge increase in traffic every month, and once only 4 engineers. This means having to choose what to do and what not to do. This is the condition of Tumblr. Fortunately, there are now 20 engineers who have the energy to solve the problem and develop some interesting solutions.

Tumblr was very typical of lamp applications. is currently evolving into a distributed services model based on Scala, HBase, Redis (well-known open source k-v storage), Kafka (Apache project, distributed publish-Subscribe message system from LinkedIn), Finagle (by Twitter's open source fault-tolerant, protocol-neutral RPC system), there is also an interesting cell-based architecture to support dashboard (csdn Note: Tumblr's distinctive user interface, similar to the micro-blogging timeline).

Tumblr's biggest problem now is how to transform it into a massive web site. The system architecture is evolving from lamp to the most advanced technology portfolio, and the team also has to develop new capabilities and infrastructure from a small start-up to a fully armed, ready-standby formal development team. The following is an introduction to the Tumblr system architecture of Blake Matheny. website Address

http://www.tumblr.com/ main Data PV 500 million times per day (page traffic) more than 15 billion PV per month, approximately 20 engineers peak requests per second of nearly 40,000 daily more than 1TB data into the Hadoop cluster Mysql/hbase/redis/memcache generate several terabytes of data per day 30% per month growth of nearly 1000 hardware nodes for the production environment The average engineer is responsible for hundreds of millions of pages per month to upload about 50GB of articles every day, and the daily thread update data about 2.7TB (Csdn Note: The ratio of these two data looks unreasonable, according to Tumblr data Scientists Adam Laiacano explained on Twitter that the previous data should refer to the text content and metadata of the article, excluding the multimedia content stored on the S3. Software Environment Development using OS X, production environment using Linux (centos/scientific) Apache PHP, Scala, Ruby Redis, HBase, MySQL varnish, Haproxy, Ngin x Memcache, Gearman (Support for Multilingual Task Distribution application framework), Kafka, Kestrel (Twitter open source distributed Message Queuing system), Finagle Thrift, HTTP func--a secure, support script Remote control framework and API Git, Capistrano (multi-Server Scripting Deployment Tool), Puppet, Jenkins Hardware Environment 500 Web Server 200 database server (pool,20 shard) 30 memcache server 22 Redis Server 15 varnish Server 25 haproxy node 8 ng Inx Server 14 Work Queue Server (Kestrel + Gearman) Architecture

1. Compared to other social networking sites, Tumblr has a unique pattern of usage: More than 50 million articles are updated every day, with an average of hundreds of posts per article. Users typically have only hundreds of fans.     This is very different from the millions of followers of a few users in other social sites, making the Tumblr scalability challenging. Measured by user time, Tumblr is already the second-ranked social web site. The content is very attractive, there are many pictures and videos, the article is often not short, generally not too long, but allowed to write very long.     The content of the article is often more in-depth, users will spend more time to read.     After the user has established contact with other users, it is possible to turn back hundreds of pages to read one by one on the dashboard, which is basically different from other websites. A large number of users, the average user reach a wider range of users more frequent postings, which means that there is a huge amount of updates to deal with.

2. Tumblr is currently running in a managed data center, with geographical distribution in mind.

3. Tumblr as a platform consisting of two components: public tumblelogs and dashboard public tumblelogs similar to blogs (please tumblr user corrections), not dynamic, easy to cache dashboard is similar to Twitter time Axis, users can see the real-time updates of all the users they are concerned about. Unlike blogging, caching does not work as much, because each request is different, especially for active followers.     and need real-time and consistent, the article updates only 50GB daily, thread update 2.7TB every day, all the multimedia data are stored in the S3 above.     Most users use Tumblr as a content browsing tool, browsing more than 500 million pages a day, and 70% browsing from dashboard. The availability of dashboard is good, but tumblelog is not good enough because the infrastructure is old and difficult to migrate. Due to the shortage of manpower, hurry also ignored. The old architecture

Tumblr was first hosted on Rackspace, and each custom domain blog has a record of a. When the 2007 Rackspace unable to meet its development speed had to migrate, a large number of users need to migrate at the same time. So they have to keep the custom domain name in Rackspace, and then use Haproxy and varnish to route to the new data center. There are a lot of lingering problems like this.

The initial architectural evolution is typical of the lamp route: originally developed in PHP, almost all programmers used PHP initially as three servers: a Web, a database, a PHP to expand, start using Memcache, and then introduce front-end cache, Then add Haproxy in front of the cache, then the MySQL sharding (very effective) using the "Squeeze everything on a single server" way. Two backend services have been developed in C in the past year: ID generators and Staircar (with Redis support dashboard notifications)

Dashboard adopted a "diffusion-collection" approach. Events are displayed when a user accesses dashboard, and events from the users who are concerned are pulled and then displayed. This was supported for 6 months.  Because the data is sorted by time, the sharding mode is not very useful. the new schema

For reasons such as hiring and development speed, the JVM is the center instead. The goal is to change everything from PHP to service, making applications a thin layer of services such as request identification and rendering.

Of these, it is very important to choose Scala and Finagle. There are many people within the team who have Ruby and PHP experience, so Scala is attractive. Finagle is one of the most important factors in choosing Scala. This library from Twitter can solve most of the distributed issues, such as distributed tracking, service discovery, service registration, and so on. After going to the JVM, Finagle provides all the basic functionality that the team needs (Thrift, zookeeper, etc.) without having to develop a lot of network code, and the team members know some of the developers of the project. Foursquare and Twitter are also using Scala in Finagle,meetup. The application interface is similar to thrift and has excellent performance. The team was very fond of Netty (Java Asynchronous Network application framework, February 4 just released 3.3.1 final version), but do not want to use Java,scala is a good choice. Choose Finagle because it's cool and knows a few developers.

The reason for not choosing Node.jsis that it is easier to extend based on the JVM. Node's development is short of fashion, lack of standards, best practices, and a lot of well-tested code. With Scala, you can use all the Java code. Although there is not much to expand, it is not possible to solve the problem of 5 millisecond response time, 49 ha, 40,000 requests per second, or sometimes 400,000 requests per second. However, Java's ecological chain is much larger and there are many resources available.

The internal service is based on a shift from c/libevent to Scala/finagle.

Started to adopt new NoSQL storage scenarios such as HBase and Redis. However, a large amount of data is still stored in a large number of partitioned MySQL architecture, and does not replace MySQL with HBase. HBase mainly supports the short address production program (number of 1 billion) and historical data and analysis, very strong. In addition, HBase is also used for high write requirements scenarios, such as millions of writes in a second when dashboard is refreshed. The reason why we have not replaced the hbase is because we cannot risk business, and now we rely on people to be responsible for more insurance, first in some small, less critical projects to gain experience. The problem with MySQL and time series data sharding (fragmentation) is that there is always a slice too hot. In addition, the problem of read replication latency is encountered because you want to insert concurrency on slave.

In addition, a common service framework has been developed: It takes a lot of time to solve the operational dimension of distributed system Management. A rails scaffolding is developed for the service, and the internal template is used to start the service. All services are the same from the perspective of operational dimensions, and all services check statistics, monitor, start, and stop in the same way. Tools, the build process revolves around SBT (a Scala build tool), using Plug-ins and assistive programs to manage common operations, including tagging in git, publishing to a code base, and so on. Most programmers don't have to worry about building the details of the system anymore.

Many of the 200 database servers are designed to improve availability and use regular hardware, but MTBF (average downtime) is extremely low. In case of failure, reserve sufficient.

To support PHP applications there are 6 backend services , and there is a team dedicated to developing back-end services. The launch of the new service requires three weeks, including dashboard notifications, dashboard Two-level indexes, short address generation, memcache proxies to handle transparent fragmentation. It takes a lot of time on MySQL fragmentation. Although it's very hot in New York, but not using MongoDB, they think MySQL is scalable enough.

Gearman is used for long-running work that requires no manual intervention.

Availability is measured in terms of reaching (reach). Can users access custom fields or dashboard? The error rate is also used.

Historically, the highest priority issues are always addressed, and the failure mode is now systematically analyzed and resolved with the goal of setting success metrics from the user and application perspective. (The last sentence doesn't seem to be complete)

The first finagle was used for the actor model, but it was later abandoned. Use task queues for work that does not require human intervention after a run. And Twitter's Util tool library has future implementations, and services are implemented using future (a parameterless function in Scala that blocks callers when the parallel operation associated with the function is not completed). When a thread pool is needed, the future is passed into the future pool. Everything is committed to the future pool for asynchronous execution.

Scala advocates no shared status. Since it has been tested in the Twitter production environment, Finagle should be fine. Using the architecture in Scala and Finagle needs to avoid mutable states and not use long-running state machines. The state is pulled from the database, used, and then written back to the database. The advantage of doing this is that developers don't need to worry about threads and locks.

22 Redis servers, each with 8-32 instances, so more than 100 Redis instances are used on the line. Redis is primarily used for back-end storage of dashboard notifications. Notice means that a user like an article such an event. The notification is displayed in the user's dashboard, telling him what other users have done with their content. The high write rate makes MySQL impossible to handle. Notifications are fleeting, so redis is the right choice for this scenario, even if the omission does not have a serious problem. This also gives the development team an opportunity to learn about Redis. There was absolutely no problem with Redis and the community was great. Developed a Redis interface based on Scala futures, which is now incorporated into the cell architecture. Short address generators use Redis as a level cache,hbase for permanent storage. The two-level index of dashboard is developed on a redis basis. Redis is also used as a persistent storage layer for Gearman, using the Memcache proxy developed by Finagle. Is slowly turning from memcache to Redis. Hope to end up with only one cache service. Performance on Redis and memcache equivalent.

(First here, please look forward to the next article, including how to use Kafaka, Scribe, thrift to achieve internal activities flow, dashboard cell architecture, development process and experience and lessons such as wonderful content. )

(Next article)

---------------------

Internal Firehose (communication pipeline) internal applications require an active flow of information channels. This information includes user-created/deleted information, Liking/unliking prompts, and so on. The challenge is that these data be distributed and processed in real time. We want to be able to detect the internal operating conditions, the application of the ecosystem can be reliable growth, but also need to build a distributed system control center. Previously, this information was based on the scribe (Facebook Open source distributed logging system). )/hadoop distributed systems. The service is first recorded in the Scribe, and continues to be written in long tail form, and then the data is transported to the application. This pattern can stop scaling immediately, especially at peak times to create thousands of messages per second. Don't expect people to publish files and grep. The internal firehose is like a bus loaded with information, and services and applications are communicated through thrift with fire pipelines. (A scalable, Cross-language service development framework.) LinkedIn's Kafka is used to store information. Internal personnel firehose via HTTP link. Often faced with a huge data impact, the adoption of MySQL is clearly not a good idea, zoning implementation is increasingly common.     Firehose's model is very flexible, unlike Twitter's firehose, where data is assumed to be lost. The Firehose information flow can be replayed in a timely manner.     He retains the data for a week and can pull up any data at any point in time. Supports multiple client connections and does not see duplicate data. Each client has an ID. Kafka supports the customer base, each customer in a group uses the same ID, they do not read duplicate data. You can create multiple clients using the same ID, and you will not see duplicate data. This will guarantee the independence of the data and parallel processing. Kafka uses the Zookeeper (Apache-launched open source Distributed Application Coordination Service). Regularly check how much users have read.

cell architecture designed for dashboard InboxThe decentralized-centralized architecture that supports dashboard is now very limited, and this situation will not last long.     The solution is to use the cell-based Inbox model, which is very similar to the Facebook messages. The inbox is antithetical to the decentralized-centralized architecture.     Each user's dashboard are made up of their followers ' speeches and actions, and are stored in chronological order. It solves the problem of decentralization and centralization because it is a inbox. You can ask what you put in your Inbox to make it so cheap. This way will run for a long time. Rewriting dashboard is very difficult.     The data has been distributed, but the quality of the data exchange generated by the user's local upgrades has not yet been fully handled. The amount of data is staggering. The average message is sent to hundreds of different users, which is a lot more difficult than Facebook faces.     Large data + high distribution rate + multiple data centers. Million writes per second, 50,000 reads.     Data without duplication and compression grew to 2.7TB, with millions of writes per Second coming from 24-byte row keys. Already popular applications run in this way. Each cell of the cell is independent and holds all the data of a certain number of users.     All the data displayed in the user's dashboard is also in this cell. The user maps to the cell.     A data center has a lot of cell.     Each cell has a hbase cluster, a service cluster, and a Redis cache cluster.     Users belong to the cell, and all the cell's together provide support for the user to speak. Each cell is based on the Finagle (Twitter-launched asynchronous remote Procedure Call library), built on HBase, thrift used to develop links to Firehose and various requests and databases. (Please correctA user enters the dashboard, whose followers belong to a specific cell, which reads their dashboard through hbase and returns data.     The background dashboard the followers into the current user's table and processes the request. The Redis cache layer is used for cell internal processing users to speak. Request Flow: The user publishes the message, the message is written to firehose, all the cell processes the message and writes the text to the database, the cell finds whether all the followers of the post are in this cell, and if so, all the followers ' inboxes will update the user's ID. (Please correctThe advantages of cell design: Large-scale requests are processed in parallel, and components are isolated from each other without interference.     The cell is a parallel unit, so you can adjust the specification to fit the growth of the user base. The fault of the cell is independent.     The failure of a cell does not affect other cell. The cell behaves very well, can perform various upgrade tests, implement rolling upgrades, and test different versions of the software.     The key idea is easy to miss: All speeches can be copied to all the cell. A single copy of all statements stored in each cell. Each cell can fully satisfy the dashboard rendering request. The application does not need to request the ID of all speakers, only the IDs of those users. ("Those users" are not clear, please correct me. He can return content in dashboard.     Each cell can meet all the needs of dashboard without having to communicate with other cell. Use two HBase table: A table is used to store a copy of each statement, which is relatively small. Within the cell, the data will be stored with each speaker ID. The second table tells us that the user's dashboard does not need to display all followers. When a user accesses a speech through a different terminal, it does not mean that it has been read two times.     The Inbox model ensures that you can read. Speaking does not go directly into the inbox, because it is too big. Therefore, the speaker's ID will be sent to the Inbox, and the speech content will be entered into the cell. This model effectively reduces storage requirements, and only needs to return the user's time to browse through the inbox. The disadvantage is that every cell keeps a copy of all the speeches. Surprisingly, all statements are smaller than the mirrors in the Inbox. (Please correctEach cell's speech per day grew 50GB, and the inbox grew by 2.7TB every day.     Users consume far more resources than they make. The user's dashboard does not contain the content of the speech, only displays the speaker's ID, and the main growth is from the ID. (please tumblr user error correctionThis design is also safe when followers change. Because all the speeches are kept in the cell.     If only the followers ' statements are kept in the cell, then when the followers change, some backfill work will be required. Another design option is to use an independent speech storage cluster. The disadvantage of this design is that if the cluster fails, it can affect the entire Web site. Therefore, a very powerful architecture is created using the design of the cell and the way it is copied to all the cell. A user who has millions of followers, this poses a very large difficulty, with the selective processing of followers of the user and their access mode (see feeding Frenzy) different users adopt different and appropriate access patterns and distribution models, and two different distribution modes include: one for the popular user,     One uses the public. Depending on the type of user to adopt different data processing methods, the active user's speech will not be really released, the speech would be a choice of embodiment. (indeed.)please tumblr user error correctionUsers who follow millions of users will be treated like users with millions of followers. The size of the cell is very difficult to determine. The size of the cell directly affects the success or failure of the site. The number of users per cell is one of influence. You need to weigh what user experience you accept and how much you invest in it. Reading data from Firehose will be the biggest test of the network. Within the cell the network traffic is manageable. When more cell is added to the network, they can go into the cell group and read the data from the firehose. A hierarchical data replication schedule. This can help migrate to multiple data centers.

start-up operation in New York New York has a unique environment with plenty of money and advertising. Recruitment is very challenging because of lack of entrepreneurial experience. In the past few years, New York has been working to promote entrepreneurship. New York University and Columbia University have programs that encourage students to practice in start-ups, not just on Wall Street. The mayor set up a college with a focus on technology.

Team Frame

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.