Manhattan, Twitter real-time, multi-tenant distributed database

Source: Internet
Author: User
Keywords We can consistency this
Tags allowing application applications based client company content control

Building a fully service-oriented system has always been the goal pursued by Twitter, and from the previous article, we shared Twitter to cope with the overall 143,000 TPS peak system overview, but the details of the various service components are not involved. Fortunately, the company recently disclosed Manhattan's independently developed database system, which delivers in-service features while delivering features such as high availability and availability.

The following is the translation:

As Twitter grows into a global user exchange, expression platform, its storage needs are also growing. In the past few years, we've found that we urgently need a storage system that can do millions of queries per second and have low latency in a real-time environment. Availability and speed are crucial factors. The ideal system not only needs to be fast, it also needs to be scalable to many regions of the world.

In the past few years, we have made a significant contribution to many open source databases. But we, the real-time nature of Twitter, has led to any current lack of open source systems that meet their low-latency needs. We spent a lot of time to meet the needs of different products, providing new storage capacity, labor-intensive, process to meet the needs of use. However, this state of affairs is unsustainable in the light of our experience in the production and storage operations under development on the Twitter scale. All of us are trying to build the next generation of Twitter distributed systems - what we call Manhattan. Manhattan not only needs to meet existing needs, but also needs to meet future needs.

Twitter storage system overview

Nowadays many databases have many features, but based on our experience we can identify several requirements that are in line with our future expectations and are sufficient to deal with most of the situations we will face and to solve problems that may be encountered in real-time use - such as Correctness, maneuverability, visibility, performance and customer support. Our requirements are as follows:

1. Reliability: Twitter service requires a durable data storage, the data storage performance must be estimated. We demand that this data store be trusted in all sorts of failures, decelerations, expansions, hot spots, or whatever else we encounter.

2. Usability: Most of our use cases are respected for available conformance, so we need an uninterrupted, ultimately consistent database.

3. Scalability: We want applications that can handle the changing needs of the future, so we need a solid, modular foundation on which to build everything from the new storage engine to strong consistency. In addition, the unstructured key-value data model is best suited to the needs of customers, and the model gives the customer added structural space in the future.

4 operability: As the cluster from a few hundred nodes into thousands of nodes, even the most simple operation will be time-consuming and painful torture for the operator. In order to use human resources effectively, we optimized the operation from the very beginning. For each new feature, we have to consider operational convenience, and diagnose the problem.

5. Low latency: As a real-time service, Twitter's products require a consistent low latency, so we need to make some tradeoffs.

Scalability of the production environment: In distributed systems, scalability challenges are everywhere. Twitter needs an extensible database and every metric in the database can continue to move forward to new heights in the future - cluster size, queries per second, data size, geographic characteristics, and number of tenants - without sacrificing cost efficiency, Easy to operate under the premise.

7. Developer Productivity: The developer of the company should be able to store any content that is used to build the service. The stored procedure is based on a self-service platform that does not require intervention from storage engineers and stores data based on a "working" system.

8. The developer should be able to store any desired content on an available system.

Twitter scale reliability

As we started building Manhattan, we already had many Twitter large storage clusters, so we fully understand the challenges of running a scale system that tell us what to fight for and avoid in our new system.

A reliable system is a well-functioning system that is trusted at any operation, and this estimable performance can be very difficult to achieve. The key to assessing the system is judging the worst-case scenario; the average performance is less important. In a properly configured, well-implemented system, the average performance is seldom a problem. When we look at corporate metrics we look at metrics like p999 and p9999 delays, and we care about how slow the slow 0.01% request is. We need to plan for the worst situation. For example, if there is a cyclical, batch job with an hour of daily performance degradation, then the acceptable steady performance is not relevant.

Because of the estimable priorities, we need to be well behaved in any potential problem or failure mode. Clients are not interested in our implementation details and various excuses; our services are for them and are either available to Twitter or unavailable. Even if there are times when we need to make some tradeoffs to deal with situations that are less likely to happen, we also need to do that, and we must keep in mind that in some cases rare things can happen.

Today's scale comes not just from the number of machines, the number of requests, large-scale data, but also from the size of the workforce - the number of people using and supporting the system. We manage by focusing on the following issues:

If a user made the problem, then the problem should only exist in this user, will not spread.

It should be easy for us and our users to tell whether the problem is due to the storage system or the user.

For potential problems, once discovered, diagnosed, we will minimize the time it takes to recover the system.

We must understand how various failure modes are presented to users.

Performing routine operations, diagnosing and repairing most problems, should not require an operator with in-depth, comprehensive operating system knowledge.

Ultimately, we built Manhattan based on the experience of scale operations, complexity being our greatest enemy. In the end, simple, feasible victory over the flashy. We advocate a simple, reliable work system with good consistency and good visibility; we do not advocate theoretically perfect work that does not work properly in real time or is less visible, less workable, and incompatible with other core needs system.

Build a storage system

When building a new generation of storage systems, we decided to layer the system to get a modular, stable bottom as a building base, and then update the features thereover without major adjustments.

Here are the design goals:

Keep the core lean and simple

The sooner you realize the higher value (focus on incremental)

The first meaning: multi-tenant, QoS, self-service

Pay attention to assessable

Storage is more than just a technology, it's a service

Floor

We split Manhattan into four layers: Interface, Storage Services, Storage Engine and Kernel

Kernel

The kernel is a key part of the storage system: the kernel is highly stable and robust. The core handles failure, eventual consistency, routing, topology management, data center replication, data center replication and resolution conflicts. Through the system kernel, the key part structure is fully pluggable, so we can design and improve the rapid iteration for effective unit testing.

The operator can change the topology at any time to add or remove capacity, and our visibility and robust topology management are paramount. We store topological information in Zookeeper and although Zoopkeeper is not a critical path for reading and writing, it has strong coordination and is a management component in Twitter's infrastructure. We also made a lot of effort to ensure the best visibility into the kernel, and we learned about the correctness and performance of all the hosts with a set of Ostrich metrics.

Consistency model

Many Twitter applications are well compatible with the final consistent model. We respected the consistency of covering almost all use cases, so we have to build Manhattan into a final consensus model in the kernel. However, there are always applications that require strong self-data consistency and establishing such a system with high priority is to attract more customers. Strong consistency is an optional model, developers need to make a trade-off. In a strongly consistent system, users will have mastership over a subset of partitions. Many use cases on Twitter can not tolerate a few seconds of delay (because of failure will be eliminated). We provide developers with good default settings and help them understand the trade-off between the two models.

Achieve consistency

To achieve consistency in the eventual consensus system, developers need a mechanism known as copy reconciliation. This is an incremental mechanism, it has been running the process of reconciliation copy data. It addresses issues such as bit corruption, system bugs, write loss (long node downtime), and network partitioning between data processing centers. In addition to replica reconciliation we can also use; the other two mechanisms are optimized for faster convergence: read-repair, a mechanism based on the rate at which data is read, allowing faster access to frequently accessed data ; Hinted-handoff, based on node instability or offline for a period of time, is a secondary delivery mechanism.

Storage engine

The bottom of the storage system is how the data is stored on the hard disk and how the data structure is stored in memory. To reduce the complexity and risk associated with managing multiple data processing centers across multiple storage engines, we decided to design the original storage engine internally, with the added flexibility of plugging in external storage engines with additional requirements. This allows us to focus on the most necessary part of the audit changes should be carried out. We now have three storage engines:

seadb is a read-only file format that reads bulk data from Hadoop.

sstable is a log structure merge tree based on rewriting the workload format.

btree, format based on reread, lightweight write load.

All storage engines support block-based compression.

Storage service

We created additional services at the top of the Manhattan kernel that made Manhattan more robust, a feature developers have long been looking for. Some examples are as follows:

Hadoop Batch Import: Manhattan's initial use case is on top of Hadoop-generated data as its efficient service layer. We designed an import pipeline that allows customers to generate HDFS data sets in a simple format and specify the file location through the self-service interface. Our observers automatically choose a new dataset and convert it to a seadb file in HDFS so that they can import the data into the cluster for fast services from the SSD card or memory. We are committed to streamlining this pipeline, making it simple and easy to help developers quickly iterate on evolutionary data sets. What we know from our users is that they want to generate large data sets of about a few gigabytes, usually less than 10% to 20% change in subsequent versions. In order to reduce network bandwidth, we take optimization measures - producing a binary difference that we can use when downloading data to replicas, which in turn drastically reduces the data center's lead time.

Strong Consistency Service: Strong Consistency Service ensures that users have strong consistency in the series of operations. We use a consistent algorithm with a copy log to ensure that the sequence of events smoothly reaches all copies. This allows us to perform operations such as Check Setup (CAS), Strong Read, Strong Write. Now we support two modes: LOCAL_CAS and GLOBAL_CAS. Globall CAS gives developers strong consistency between statutory data control centers and Local CAS gives developers strong consistency within a designated data control center. Taking into account the application of the delay and the data model, each has its own advantages and disadvantages.

Time Series Counter Service: We have developed a specific service to handle the large number of time series counters in Manhattan. The "customer" that drives this demand is our observable infrastructure, which requires a system that can handle tens of thousands of increments per second. At this scale, our engineers ultimately come up with solutions through a variety of practices, weighing trade-offs such as durability, pre-visible delays to incremental alarm systems, and sub-second user-acceptable traffic patterns. The final solution is to add a lightweight, efficient computing layer to an optimized Manhattan cluster, which greatly satisfies our requirements and increases system reliability.

interface

The interface layer describes how users interact with our storage system. Now that we have a key / value interface open to the user, we are also implementing other interfaces - such as the graphical interface for edge interaction.

tool

In order to meet the needs of cluster simplicity, we need to spend more effort on how to design the best tools for day-to-day operations. We want the system to handle operations that are as complex as possible and hopefully masking these complex implementation details to the operator through high-level semantic commands. The tools we first implemented included tools that could modify the topology of the entire system simply by editing the host organization and weighting files, and tools that would restart all nodes with a single command line. When the early tools got too complicated, we built an automated proxy that took simple commands as a target for the state of the cluster and could stack, combine, and safely execute instructions without the operator's attention.

Storage service

One of our common understanding of existing databases is that these databases are set up, built, and managed for specific use cases. As Twitter grew in-house services, we realized that previous understandings were not enough to meet business needs. Our solution is to store is a service. By building a fully self-service storage system with engineers, we have significantly increased the engineering and operations team productivity. Engineers can provide what their application needs (storage size, queries per second, etc.) and start setting up storage in seconds without installing hardware or modems. Internal client operations In a multi-tenant environment, the operations team manages this environment. Cluster Management Self-service and multi-tenancy bring some challenges, so we take this service layer as a first-class feature: we provide customers with visibility of customer data and workload; we have built-in quota enforcement and speed limits, when engineers crossed They alert engineers to a threshold value; our information is managed and reported by our capacity and agile management team. By enabling engineers to run new features, we see new case-driven experiments and incremental growth. To better handle this, we have developed an internal API to publish these cost analysis data, which helps us to determine which use cases need to spend more and which are infrequently used.

Focus on customers

Although our clients are employees of Twitter, we always provide good service because they are our customers. We need to be on call, to isolate the behavior of applications, and to consider the user experience in all our work. Most developers are aware of the detailed documentation that needs to be read about the service, but caution should be placed on adding or tweaking the storage system. Seamless integration of features into self-service requires more than a few different needs. When a user has a problem, we need to design the service so that we can quickly and accurately locate the root cause, including issues and bursts of questions from different customers and applications as the engineer accesses the database. We have successfully built Manhattan into a service, not just a single technology.

Multi-tenancy and QoS

Support for multi-tenancy - allowing multiple different applications to share the same resource - which has been a critical requirement from the beginning. In the systems that Twitter used earlier, we built an external cluster for each feature. This adds operational burden, wasted resources, and hinders the customer's ability to introduce new features. As mentioned above, allowing multiple users to use the same group will enhance the competitiveness of the operating system. We must now consider isolation, resource management, multiple user capabilities models, rate limiting, QoS, quotas, and more. To give our customers the visibility they need, we've designed our own rate-limiting service to enhance user usage of resources and quotas. If needed, we can ensure that the application does not affect other programs in the system by observing whether the metric crosses the threshold.

Rate limiting is not done at coarse granularity, it is implemented to sub-second levels and tolerates real-world peaks. We should not only consider automatic execution, but also focus on what manual control operators operators should provide to solve the problem, focusing on how to reduce the negative impact on all users. We build an API that pulls data for each client and sends the data to a capacity team that ensures that our resources are available when the client has any size requirements (Twitter standard) so that those engineers can work without the need Work in the case of our extra help. The straightforward integration of all these content into the subservice system allows customers to enable new features faster in most tenant clusters and allows us to more easily absorb peak traffic because most customers do not use all their resources.

Focus on the future

We have to do a lot more. As challenges continue to increase, the built-in features of Manhattan are also rapidly increasing. Getting yourself to be better and smarter is the motivation for our kernel storage team. We're proud of our value: Everything we do will make Twitter a better place, so how can we make our customers more successful? We're going to release a white paper that covers more technical details of Manhattan and two years of experience running Manhattan. Stay tuned!

Original link: http: //www.csdn.net/article/2014-04-08/2819197-manhattan-realtime-distributed-database-at-twitter-scale

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.