August 3, 2013, "Sky City" in Japan hit a new 143,119 tweets per second Twitter peak record, Twitter is the average number of tweets per second (TPS) 5,700 of 25 times times.
It is noteworthy that, in the absence of an indication of the "flood peak", the new Twitter system platform has not been flooded with the push-text jam caused by any delay or even downtime.
The performance comparison of the old Twitter architecture with the new architecture
Just three years ago, in the 2010 World Cup, a penalty and a red card generated a "push-storm" could cause the Twitter service to temporarily lose its response, known as the Earth pulse of Twitter often "myocardial infarction." For the past three years, Twitter engineers have been working day and night trying to improve the Twitter system in a "sewing" fashion, but eventually, with Twitter's rapid growth, these methods have had a fleeting effect.
Ultimately, Twitter has made a decision to redesign the IT system, which has been greatly improved in performance and reliability after the launch of the new platform. Neither the "Sky City" hit nor the Super Bowl was able to get stuck on Twitter, and the new architecture provided a powerful backstop for Twitter to launch multimedia push cards, and to launch new features such as device message synchronization.
Recently, Raffi Krikorian, vice president of Twitter platform Engineering, shared the methodology and experience of the new Twitter architecture with the Twitter official blog, the following summary:
Reasons for the restructuring and the crux of the problem
After many jams in the 2010 World Cup, we reviewed the system and found the following:
We run the world's largest ruby on Rails application, and 200 engineers are responsible for developing this system, but with the rapid growth of user size and number of services, all of the system's database management, memcache links, and common API code belong to the same code base. This brings great difficulties to the learning, management and concurrent development of engineers.
Our MySQL storage system has experienced performance bottlenecks. The entire database is full of read-write hotspots.
Adding hardware has not been able to solve the fundamental system problem-our front-end Ruby server handles transactions per second much less than we expected and is not in proportion to its hardware performance.
From the software point of view, we fall into the "optimization trap." We are sacrificing the readability and flexibility of our code base in exchange for performance and efficiency.
Re-review the system and set three major goals/challenges
The new architecture must perform well in performance, efficiency, and reliability, reducing latency and significantly increasing the customer experience while reducing the number of servers to one-tenth; The new system can isolate hardware problems and prevent them from evolving into mass downtime.
Second, to solve the problems of a single code base, try loosely coupled service-oriented model. Our goal is to encourage best practices for encapsulation and modularity, but this time at the system level, not at the level of class libraries, modules, and packets.
Third, the most important thing is to be able to support the rapid release of new features. We hope to be able to make independent decisions by some fully empowered small teams and release some user functions independently.
We have developed some conceptual validation models in the first part of our work, and finally we have identified the principles, tools, and architectures for rebuilding.
Key measures of system reconstruction
First, front-end services: replace Ruby VM with JVM. Porting the Ruby VM service to the JVM by rewriting the code base is 10 times times more performance, and now performance is up to 10-20k request/sec/host.
Second, the programming model: According to the service type of the system structure, establish a unified Client server library and load balancing, failover policy and so on, so that engineers can be more focused on the application and service interface.
Third, the adoption of SOA-oriented service architecture makes concurrent development possible.
Iv. distributed storage of tweets. Storage remains a huge bottleneck even if the monolithic application is decomposed into different "services". In the past, Twitter used a single MySQL master database can only be linear writing tweets, Twitter decided to push the storage of the new zoning strategy, using the Gizzard framework to create fault-tolerant fragmented distributed database storage tweets, but so there is no way to use Unique ID generation feature for MySQL. Twitter solves the problem with Snowflake.
V. Monitoring and statistics. Once a single application is transformed into a complex SOA application, a matching tool needs to be purchased to be harnessed. Twitter's service is fast and needs to be backed by data-making decisions, and the Twitter runtime System team has developed two tools Viz and Zipkin for engineers.