This article is very eye-opening, reproduced over.
English Original address: http://highscalability.com/blog/2012/2/13/tumblr-architecture-15-billion-page-views-a-month-and-harder.html
Csdn translated original address: http://www.csdn.net/article/2012-02-14/311806
(Previous article)
---------------
Summary: reading Tumblr's monthly page views more than 15 billion times, he has become a popular blog community. The user may like his simplicity, his beauty, his intense attention to the user experience, or the friendly and busy way of communicating, in short, people like him. Growth of more than 30% per month is unlikely to be a challenge, and reliability issues are particularly daunting. 5 million visits per day, peak 40,000 requests per second, 3TB new data storage per day ...
reading: As with many emerging websites, the famous light blogging service Tumblr is facing a system architecture bottleneck in its rapid development. 500 million visits per day, 40,000 requests per second, 3TB of new data storage per day, more than 1000 servers, in such a situation how to ensure the smooth operation of the old system, smooth transition to the new system, Tumblr is facing a huge challenge. Recently, Highscalability website Todd Hoff interviewed the company's distributed system engineer Blake Matheny, the author system introduced the structure of the site, the content is very valuable. We also very much hope that domestic companies and teams to do similar sharing, contribute to the community at the same time, to enhance their own status of the lake, the recruitment, business development benefits a lot. You are welcome to contribute to us through @csdn cloud computing Weibo.
The first part of the translation is as follows . The second part of the point here . (the small words in parentheses are the CSDN Editor's note):
Tumblr has been a popular blogging community for more than 15 billion pages per month. Users may like its simplicity, beauty, a strong focus on user experience, or a friendly and busy way of communicating, in short, it won the love of people.
More than 30% of the monthly growth is certainly not without challenges, and reliability issues are particularly daunting. With 500 million views per day, spikes of 40,000 requests every second, 3TB of new data storage per day, and runs on more than 1000 servers, all of which help Tumblr achieve a huge scale of operations.
For startups to succeed, they have to go through the threshold of a dangerous, rapid development period. Look for talent, constantly transform the infrastructure, maintain the old architecture, and face the monthly traffic increase, and once only 4 engineers. This means having to choose what to do and what not to do. This is the condition of Tumblr. Fortunately now there are 20 engineers, can have the energy to solve the problem, and develop some interesting solutions.
Tumblr was a very typical lamp application at the very beginning. is currently evolving to a distributed service model based on Scala, HBase, Redis (the famous open source K-v storage scheme), Kafka (Apache project, a distributed publish-subscribe messaging system from LinkedIn), Finagle (a fault-tolerant, protocol-neutral RPC system powered by Twitter), plus an interesting cell-based architecture to support dashboard (csdn Note: Tumblr's feature-rich user interface, like the timeline of Weibo).
Tumblr's biggest problem now is how to transform it into a massive website. The system architecture is evolving from lamp to a state-of-the-art technology portfolio, and the team is evolving from a small startup to a fully armed, ready-to-stand, regular development team, creating new features and infrastructure. The following is an introduction to the Tumblr system architecture of Blake Matheny. website Address
http://www.tumblr.com/ main Data 500 million PV per day (page views) More than 15 billion PV per month approximately 20 engineers peak requests nearly 40,000 times per second daily more than 1TB of data into the Hadoop cluster Mysql/hbase/redis/memcache to generate several terabytes of data per day 30% per month growth of nearly 1000 hardware nodes for the production environment average each engineer is responsible for hundreds of millions of pages per month to upload about 50GB of articles per day, the daily thread update data about 2.7TB (Csdn Note: The ratio of these two data looks unreasonable, according to Tumblr data Scientists Adam Laiacano explained on Twitter that the previous data should refer to the textual content and metadata of the article, excluding multimedia content stored on the S3. Software Environment Development using OS X, production environment using Linux (centos/scientific) Apache PHP, Scala, Ruby Redis, HBase, MySQL Varnish, HAProxy, Ngin x Memcache, Gearman (support multi-lingual Task Distribution application framework), Kafka, Kestrel (Twitter open source distributed Message Queuing system), Finagle Thrift, HTTP func--a secure, support script Remote control framework and API Git, Capistrano (multi-server script Deployment tool), Puppet, Jenkins Hardware Environment 500 Web Server 200 database server (pool,20 shard) 30 memcache server 22 Redis Server 15 varnish Server 25 haproxy node 8 units ng Inx Server 14 Work Queue servers (Kestrel + Gearman) Architecture
1. In contrast to other social networking sites, Tumblr has its own unique usage pattern: More than 50 million articles are updated every day, and hundreds of threads are on average per article. Users generally have only hundreds of fans. This is very different from the few millions of fans on other social sites, making Tumblr's scalability extremely challenging. In terms of user time, Tumblr is already the second-ranked social site. The content is very attractive, there are many pictures and videos, the article is often not short, generally not too long, but allowed to write very long. The content of the article tends to be deeper, and the user spends more time reading it. Once the user has established contact with other users, they may turn back hundreds of pages per page on dashboard, which is basically different from other websites. The large number of users, the average user reach a wider range of users more frequent postings, which means that there is a huge amount of updates to be processed.
2. Tumblr is currently running in a managed data center and is already considering geographical distribution.
3. Tumblr is a platform consisting of two components: public tumblelogs and dashboard public tumblelogs similar to blogs (this is the Tumblr user correction), not dynamic, easy to cache dashboard is similar to Twitter time Axis, users can see real-time updates for all users they care about. Unlike blogs, caching does not work as much, because each request is different, especially active followers. and need real-time and consistent, the article only update 50GB daily, thread update 2.7TB every day, all the multimedia data are stored on S3. Most users browse more than 500 million pages per day with Tumblr as a content-browsing tool, and 70% of their browsing is from dashboard. The availability of dashboard has been good, but Tumblelog has not been good enough because the infrastructure is old and difficult to migrate. Owing to the shortage of manpower, womb. The old architecture
Tumblr was initially hosted on Rackspace, and every custom domain name blog has a record. When the 2007 Rackspace was unable to meet its development speed and had to migrate, a large number of users would need to migrate at the same time. So they had to keep the custom domain name in Rackspace, and then use Haproxy and varnish to route to the new datacenter. There are a lot of problems like this.
The beginning of the architectural evolution is a typical lamp route: Originally developed in PHP, almost all programmers use PHP originally three servers: A Web, a database, a PHP in order to expand, start using Memcache, and then introduce the front-end cache, Then add haproxy to the cache and then MySQL sharding (very effective) using the "Squeeze everything on a single server" approach. Over the past year, two backend services have been developed with C: ID generator and Staircar (Redis support dashboard notifications)
Dashboard uses a "diffusion-collect" approach. When a user accesses dashboard, an event is displayed, and the event from the user concerned is then displayed by pulling. This was supported for 6 months. Because the data is sorted by time, sharding mode does not work well. The new architecture
For reasons such as recruiting and development speed, the JVM is the center . The goal is to change everything from PHP applications to services, making the application a thin layer of services that require authentication, rendering, and so on.
In this, it is very important to choose Scala and Finagle. There are many people within the team who have Ruby and PHP experience, so Scala is very attractive. Finagle is one of the most important factors in choosing Scala. This library from Twitter can solve most distributed problems, such as distributed tracking, service discovery, service registration, and so on. After going to the JVM, Finagle provides all the basic functionality required by the team (Thrift, zookeeper, etc.) without having to develop many network code, and the team members know some of the developers of the project. Foursquare and Twitter are using Scala in Finagle,meetup. The application interface is similar to thrift and has excellent performance. The team would have liked Netty (the Java Asynchronous Network application framework, which was released on February 4 3.3.1 final version), but did not want to use Java,scala is a good choice. Choose Finagle because it's cool and you know a few developers.
node. JS is not selectedbecause it is easier to scale on a JVM basis. Node has developed into a short fashion, with a lack of standards, best practices, and a lot of well-tested code. In Scala, you can use all of the Java code. Although there is not much to scale, there is no solution to the problem of 5 millisecond response time, 49 second ha, 40,000 requests per second, and sometimes 400,000 requests per second. However, the ecological chain of Java is much larger and there are many resources available.
The internal service is based on a shift from c/libevent to Scala/finagle.
Start using new NoSQL storage scenarios such as HBase and Redis. However, a large amount of data is still stored in a large number of partitioned MySQL architectures and does not use hbase instead of MySQL. HBase mainly supports short address production procedures (1 billion) and historical data and analysis, very strong. In addition, HBase is used for high-write demand scenarios, such as millions of writes a second dashboard refresh. The reason why HBase has not been replaced is because it cannot risk business, and now relies on people to take responsibility for more insurance, first in small, less critical projects to gain experience. The problem with MySQL and time series Data sharding (sharding) is that there is always a shard that is too hot. Also, because you want to insert concurrency on slave, you experience read replication latency issues.
In addition, a common service framework was developed: It took a lot of time to solve the operational problem of distributed system management. Developed a rails scaffolding for the service, using templates to start the service. All services are the same from the operational point of view, and all services check statistics, monitor, start, and stop the same way. Tools, the build process revolves around SBT (a Scala build tool), using plug-ins and helper programs to manage common operations, including tagging in git, publishing to a code base, and so on. Most programmers don't have to worry about the details of building the system anymore.
Many of the 200 database servers are designed to improve availability, using conventional hardware, but MTBF (mean time between failures) is extremely low. In case of failure, the spare is sufficient.
To support PHP applications there are 6 backend services , and a team dedicated to developing back-end services. The release of the new service requires three weeks, including dashboard notifications, dashboard Two-level indexes, short address generation, memcache proxies that handle transparent shards. There is a lot of time spent on MySQL shards. Although it is very hot in New York, it does not use MongoDB, and they think that MySQL is extensible enough.
The Gearman is used for work that will run without manual intervention for a long time.
Availability is measured in terms of reach (reach). Can users access custom fields or dashboard? The error rate will also be used.
Historically, the highest priority issues have been resolved, and now the failure mode is systematically analyzed and resolved to determine the success metrics from the user and application perspective. (the latter sentence does not seem to be complete)
At first Finagle was used for the actor model, but later gave up. For work that does not require manual intervention after the run, use the task queue. And the Twitter Util tool Library has a future implementation, and the service is implemented using the future (the parameterless function in Scala, which blocks the caller when the parallel operation associated with the function is not completed). When a thread pool is needed, the future is passed into the future pool. Everything is submitted to the future pool for asynchronous execution.
Scala advocates no shared status. Since it has been tested in the Twitter production environment, there should be no problem with finagle. Using the architecture in Scala and Finagle requires avoiding a mutable state and not using a long-running state machine. The state is pulled from the database, used and then written back to the database. The advantage of this is that developers don't have to worry about threads and locks.
22 Redis servers, each with 8-32 instances, so more than 100 Redis instances are used on-line. Redis is primarily used for back-end storage of dashboard notifications. The so-called notice refers to a user like an article such an event. Notifications are displayed in the user's dashboard, telling him what other users have done with their content. High write rates make it impossible for MySQL to cope. Notifications are fleeting, so redis is the right choice for this scenario, even if there is no serious problem with the omission. This also gives the development team the opportunity to learn about Redis. There was no problem with Redis at all, and the community was great. Developed a Scala futures-based Redis interface, which is now incorporated into the cell architecture. The short address generator uses Redis as a primary cache,hbase for permanent storage. Dashboard's Level Two index was developed on a redis basis. Redis is also used as a persistent storage tier for Gearman, using memcache agents developed by Finagle. Is slowly moving from memcache to Redis. Hopefully end up with only one cache service. Performance on Redis is comparable to memcache.
(First come here, please look forward to the next chapter, including how to use Kafaka, Scribe, thrift to achieve internal activity flow, dashboard cell architecture, development process and lessons learned and other wonderful content. )
(Next article)
---------------------
Internal firehose (communication pipelines) internal applications require an active flow channel. This information includes user-created/deleted information, Liking/unliking prompts, and so on. The challenge lies in the real-time distributed processing of these data. We want to be able to detect internal health, the application of the ecosystem can reliably grow, but also need to build a distributed system control center. Previously, this information was based on the scribe (Facebook open source distributed log system. )/hadoop's distributed system. The service is recorded in Scribe, and is written in a continuous long-tailed form, and then the data is delivered to the application. This mode can immediately stop scaling, especially if you want to create thousands of messages per second at peak time. Don't expect people to publish files and grep in a steady-flowing style. The internal firehose is like a bus carrying information, and various services and applications communicate through thrift with the fire line. (A scalable, cross-language service development framework.) LinkedIn's Kafka is used to store information. Internal personnel are firehose via HTTP links. Often faced with huge data shocks, using MySQL is obviously not a good idea, and zoning implementations are becoming more common. Firehose's model is very flexible, and unlike Twitter's firehose, data is assumed to be lost. The information flow of firehose can be played back in time. He keeps the data for a week and can bring up data at any point in time. Multiple client connections are supported, and duplicate data is not seen. Each client has an ID. Kafka supports the customer base, and customers in each group use the same ID, and they do not read duplicate data. You can create multiple clients that use the same ID and do not see duplicate data. This will ensure data independence and parallel processing. Kafka uses the Zookeeper (Apache-launched open source distributed Application coordination Service. Periodically check how much the user reads.
cell schema designed for dashboard InboxThe decentralized-centralized architecture of the feature that supports dashboard is now very limited, and this situation will not last long. The workaround is to use the cell architecture-based Inbox model, which is very similar to Facebook messages. The inbox is antagonistic to the decentralized-centralized architecture. Each user's dashboard is made up of the speeches and actions of his followers, and is stored in chronological order. This is because the Inbox solves the problem of decentralization-concentration. You can ask what you put in your Inbox to make it so cheap. This way will run for a long time. Rewriting dashboard is very difficult. The data has been distributed, but the quality of the data exchange generated by the user's local upgrade has not been fully settled. The amount of data is very alarming. On average, each message is forwarded to hundreds of different users, which is more difficult than Facebook faces. Big Data + high distribution ratio + multiple data centers. Millions of writes per second, read 50,000 times. No duplicated and compressed data grew to 2.7TB, and millions of writes per second came from 24-byte row keys. Apps that are already popular are run this way. The cell is independent of each cell and holds all the data for a certain number of users. All data displayed in the user's dashboard is also in this cell. The user maps to the cell. A data center has a lot of cells. Each cell has an hbase cluster, a service cluster, and a Redis cache cluster. The user is attributed to the cell, and all the cells together provide support for the user to speak. Each cell is based on Finagle (the asynchronous Remote Procedure call library launched by Twitter), built on HBase, and thrift is used to develop links with Firehose and various requests and databases. (Please correctA user enters dashboard, whose followers belong to a specific cell, and the service node reads their dashboard and returns data through HBase. The background dashboard the followers into the current user's table and processes the request. The Redis cache layer is used for intra-cell processing of user statements. Request Flow: The user publishes the message, the message is written to the firehose, all the cells process the message and the text is written to the database, the cell finds out if all the post-message followers are in the cell, and if so, all the followers ' inboxes will update the user's ID. (Please correctThe advantages of cell design: large-scale requests are processed in parallel, and components are isolated from each other without interference. The cell is a parallel unit, so you can adjust the specifications arbitrarily to accommodate the growth of the user base. The fault of the cell is independent. The failure of one cell does not affect the other cells. The cell is very good, able to perform various upgrade tests, implement rolling upgrades, and test different versions of the software. The key idea is easy to miss: All statements can be copied to all cells. A single copy of all statements stored in each cell. Each cell can fully satisfy the dashboard rendering request. The app does not have to request the ID of all the speakers, only the IDs of those users need to be requested. ("Those users" are not clear, please correct me. He can return content at dashboard. Each cell can meet all of Dashboard's needs without having to communicate with other cells. Use two HBase table: A table is used to store a copy of each statement, which is relatively small. Within the cell, these data will be stored with each speaker ID. The second table tells us that the user's dashboard does not need to show all followers. When a user accesses a statement through a different terminal, it does not mean that it has been read two times. The Inbox model ensures that you can read. The speech does not go directly into the inbox, because it is too big. So, the speaker's ID will be sent to the Inbox, and the speech will go into the cell. This mode effectively reduces storage requirements by returning the user's time to browse for a statement in the Inbox. The disadvantage is that each cell saves all of the spoken copies. Surprisingly, all speeches are smaller than the mirrors in the Inbox. (Please correctEach cell's speech grew by 50GB per day, and the inbox grew by 2.7TB per day. Users consume far more resources than they make. The user's dashboard does not contain the spoken content, only shows the speaker's ID, and the main increase comes from the ID. (please tumblr user error correctionThis design is also safe when followers change. Because all the speeches are kept in the cell. If only the followers ' statements are saved in the cell, then when the followers change, some backfill work will be required. Another design option is to use a separate speaking storage cluster. The disadvantage of this design is that if the cluster fails, it affects the entire Web site. As a result, a very powerful architecture was created using the design of the cell and the way it was copied to all cells. A user has millions of followers, which brings great difficulty, selective handling of user followers and their access patterns (see feeding Frenzy) different users use different and appropriate access patterns and distribution models, and two different distribution modes include: a suitable for popular users, One uses the public. Depending on the type of user to use different data processing methods, active users of the speech will not be really published, the speech would be a choice of embodiment. (indeed.)please tumblr user error correctionUsers who have followed millions of users will be treated as if they had millions of followers. The size of the cell is very difficult to determine. The size of the cell directly affects the success or failure of the site. The number of users per cell is one of the influencers. You need to weigh what kind of user experience to accept and how much you invest in it. Reading data from Firehose will be the biggest test of the network. Intra-cell network traffic is manageable. As more cells are added to the network, they can go into the cell group and read the data from the firehose. A tiered data replication plan. This can help migrate to multiple datacenters.
launched in new York, New York has a unique environment with ample funding and advertising. Recruitment is challenging because of the lack of entrepreneurial experience. In the past few years, New York has been working to promote entrepreneurship. New York University and Columbia University have programs that encourage students to practice in start-ups, not just on Wall Street. The mayor established a college with a focus on technology.
Team Racks