Even though a system can now work reliably, it does not necessarily mean that it will work reliably in the future. A common cause of degradation is the increasing load: the number of concurrent users of the system may increase from 10000 to 100000, or from 1000000 to 10000000. It probably handles a lot more data than it did before. Scalability is what we use to describe the ability of a system to handle increased load. However, it is not a one-dimensional label that we can attach to the system: it makes no sense to say that "X is extensible" or "Y cannot expand". Of course, talking about extensibility means considering the question, "if the system grows in a specific way, what is our option to deal with the increase?" "and" How do we increase the computational resources to cope with the increased load. ”
1.3.1 Load Description
First, we need to describe the current load of the system succinctly, so that we can discuss the problem of growth (if the load doubles.) )。 We can use some numbers called load parameters to describe the load. Parameters are best chosen according to the system architecture: the number of requests per second of a network server, the percentage of reads and writes in the database, the amount of concurrent users in the chat room, the hit rate of the cache, and so on. It is generally a more important indicator for you, and in a few extreme cases it may be the dominant bottleneck.
More specifically, let's take the data that Twitter released in November 2012 as an example. Two important operations of Twitter are:
Publish Tweets
Users can post new messages to fans (averaging 4.6k of requests per second, peaking over 12k)
Home Time Line
Users can watch tweets published by the followers (300k requests per second)
It is fairly easy to deal with 12,000 write requests per second. But the challenge of Twitter's expansion is not to push the volume, but to "fan out"----each user pays attention to a lot of people, and each user is a lot of attention. Clearly, there are two ways to implement these two types of operations:
1. When publishing a tweet, simply insert a new tweet into the global collection. When the user asks for their home page timeline, look for all of his concerns, find all their tweets, and then merge. In a relational database like the one in Figure 1-2, you can write query statements like this:
SELECT tweets.*, users.* from tweets
JOIN users on tweets.sender_id = Users.id
JOIN follows on follows.followee_id = Users.id
WHERE follows.follower_id = Current_User
Figure 1-2 Simple relational architecture for implementing Twitter home timeline
2. Maintain a cache for each user's home page timeline----Similar to a tweet inbox for each recipient user (see Figure 1-3). When a user publishes a tweet, finds all who care about the user, inserts the new tweets into their home page timeline cache. In this case, the request to read the timeline of the first page will be quick, because the result has been calculated in advance.
Figure 1-3 Twitter sends tweets to the fan's data pipeline, under the November 2012 load parameter.
The first version Twitter uses is Method 1, but the system can barely handle the query load on the home page timeline, so switch to Method 2. This method works well because the average push publish rate is almost two levels smaller than the reading rate of the first page timeline, so in this case it is more likely to do more work on the "write" Do less on "read".
However, the disadvantage of Method 2 is that it requires additional work to publish a tweet now. A push-text on average to send to 75 fans, so the first page of the time line cache 4.6k write request has become 345k. But the average doesn't take into account the number of big-change fans, and some users are more than 30 million fans. This means that a simple push may result in a write request for more than 30 million first page timelines. Doing this in real time----Twitter is a huge challenge in trying to deliver tweets----to all fans within 5 seconds.
In the Twitter example, each user's distribution to the fan (which may be calculated by the frequency of the user's tweets) is a critical load parameter to discuss extensibility because it determines the "fan out" load. Your application may have very different characteristics, but you can use similar criteria to infer the load.
The final twist of Twitter's anecdote: Now that Method 2 has been brutally implemented, Twitter is planning to blend in two ways. Most users continue to use fan out to update the home timeline when they publish tweets, while a small number of users with large numbers of fans are excluded. The tweets of any celebrity that the user may be concerned with will be taken separately and merged into the user's home page timeline, similar to 1. This fusion approach can always be delivered with high performance. After we have said more technical background, we will review this example in the 12th chapter.
1.3.2 Performance Description
After describing the system load, you can study what happens when the load increases. You can consider this in two ways: when you increase the load parameters and keep the system resources unchanged, how will the system performance change. When you increase the load parameters, you need to increase the amount of resources to maintain the same performance.
Both problems require a performance value, so let's take a look at the description of the system performance.
In a batch system, such as Hadoop, we usually care about "throughput"----The number of records we can process per second, or the total time spent running a job on a dataset of a certain size. In an online system, it is often more important that the response time of a service----is the time between a client sending a request and receiving a response.
Wait time and response time
Wait times and response times are often used at the same time, but they are different. Response time is what the customer sees: it includes network latency and queue latency in addition to the actual time that the request was processed. The wait time is the interval at which a request waits to be processed----while waiting for the service.
Even if you just repeat the same request, each response time can be slightly different. In fact, response times vary widely in the system's handling of a wide variety of requests. So we can't consider response time as a simple number, but a distribution of many values that can be measured.
In Figure 1-4, each gray bar represents a service request, and its height presents the length of the response time. Most requests are fairly fast, but occasionally have a longer exception value. Perhaps those slow requests would have been time-consuming, because more data would have to be processed. But even in a scenario where you think all requests should take the same time, you also get the difference: when context switches to the background process, it introduces random additional wait times, loss of network packets and TCP relaying, a garbage collection pause, a page error forcing the disk to read, and the mechanical vibration of the server rack, Or a number of other reasons.
Figure 1-4 mean and percentile in the legend: 100 example of this service request response time
The average service response time of the report is usually visible. (Strictly speaking, "average" does not involve any particular formula, but it is usually understood as an arithmetic average) However, the average is not a very good metric, if you want to know the "typical" response time, because it does not tell you how many users actually experienced that delay.
It is usually better to use percentile numbers. If you get a list of response times and sort from fast to slow, the median is that middle point: for example, if you have a median response time of 200ms, that means that half of the requests are returned within 200MS and the other half needs to be longer than 200ms.
This makes the median a good metric, if you want to know how long the user typically waits: Half the user gets the service in the median time, and the other half takes longer. This median is also known as the 50th percentile, sometimes abbreviated as P50. Note that the median is a single request, and if a user initiates several requests (more than one session, or because a single page contains several resources), the probability of at least one request being slower than the median is much greater than 50%.
To get a sense of how bad the anomaly is, you can look at the higher percentile: generally the 95th, 99th, and 99.9个百 decimal points. They are 95%,99% or 99.9% of requests that have faster response time thresholds. For example, if the 95th percentile response time is 1.5 seconds, this means that 95 of the 100 requests are less than 1.5 seconds, 5 are 1.5 seconds, or more. This is illustrated in Figure 1-4.
The high percentile of response time is also considered to be the "tail latency", which is important because it can directly affect the user's service experience. For example, Amazon describes the response time requirements for Internet services based on the 9,990 decimal point, although it affects only 1 per thousand of requests. This is because the slowest customers usually have the most data on their accounts because they have bought them many times, that is to say, they are the most valuable customers. It's important to keep these customers happy by getting them to visit the site faster: Amazon also observes that each 100ms increase in response time reduces sales by 1%, and other reports point to a 1 slowdown that will reduce customer satisfaction by 16%.
On the other hand, the optimization of the 9,999 decimal (the slowest 1 per thousand request) was considered too expensive to achieve Amazon's expected benefits. Reducing response times for high percentile is very difficult because they are susceptible to random events outside your control, and the benefits are diminishing.
For example, the percentile is frequently used in service level objectives (SLO) and service level agreements (SLAs), which are contracts that define target performance and service capabilities. An SLA may specify that the service is considered to have a median response time of less than 200ms and that the No. 9900 decimal point is within 1s (if the response time is longer, it should be reduced) and that the service should be requested at least 99.9% of the time. These metrics set the customer's expectations for the service, allowing the customer to claim compensation if the SLA is not met.
Queue latency can often explain the response time for most high percentile digits. Because a server can only handle a small number of things at once (with limited, for example, CPU core numbers), it can only use a small number of slow requests to block the processing of request queues----an effect called "thread blocking." Even if the subsequent requests are processed quickly, the customer will still see a slow global corresponding time, as the priority request is to be completed first. Because of this effect, it is important to measure response time on the client side.
When manufacturing the load for the purpose of testing the scalability of the system, the client of the manufacturing load needs to continuously send the request independently of the response time. If the client waits for the previous request to complete before sending the next request, that behavior is affected by the shorter human queue in the test, which in effect affects the measurement results.
The percentile number in the actual situation
High percentile numbers are particularly important in multiple time background services called as part of a single end-user request. Even if you send the call in parallel, the interrupt user request still needs to wait for the slowest one in the parallel call to finish first. It only uses a slow call to slow down the entire end-user request, as shown in Figure 1-5. Even if a small number of background calls are slow, the chances of getting a slow call will also be large, if one interrupts the user request multiple background calls, and a large percentage of the interruption user requests end very slowly (one called "tail latency Amplification" effect).
If you want to increase the response time percentile for the service monitor dashboard, you need to quickly compute them on a continuous basis. For example, you may want to maintain a scrolling window for the last 10 minutes of request response time. Every minute you calculate the median and the percentile of the various values in the window, and draw those metrics on the graph.
The naïve way to do this is to keep a list of all requested response times in the time window and sort them every minute. If this is not very efficient for you, there are algorithms that can spend the least amount of CPU and memory computing percentile approximations, such as forward decay (Forward decay), T-digest, Hdrhistogram, and so on. Note that the average percentile, in order to reduce the solution time or to merge the data of multiple machines, is mathematically meaningless, and the correct way to aggregate response time data is to add a histogram.
Figure 1-5 When a service request requires multiple background calls, a slow background request slows down the entire interrupt user request
Processing method of 1.3.3 load
Now that we've talked about metrics describing load parameters and measuring performance, we can start to talk about scalability seriously: how we maintain good performance when the load increases by several orders of magnitude.
An architecture that adapts to a certain load level does not necessarily handle 10 times times the load. If you're working on a fast-growing service, you probably need to rethink the architecture to handle the load growth for each level, or more.
There are usually two ways of talking about scaling up (vertical scaling, moving to a higher-configured machine) and scaling out (scaling horizontally, distributing the load across smaller machines). The latter is also considered to be a nonshared architecture. Systems that can run on a single machine are usually simpler, but high-end machines are expensive, so a very centralized load usually does not avoid scaling horizontally. In fact, a good architecture blends a variety of methods: for example, using several fairly powerful machines is still simpler and cheaper than a large number of small virtual machines.
Some systems are resilient, meaning that they automatically increase computing resources when the load is increased, but other systems are artificially increased (some have analyzed capacity and decided to add more machines to the system). An elastic system is useful when the load is highly uncertain, but an artificially expandable system is simpler and may have fewer operational surprises (see "Rebalancing split" on page 209).
It is fairly straightforward to distribute stateless services across multiple machines, but there are many additional complexities involved in moving stateful data systems from single nodes to distributed installations. For this reason, until recently, it is usually best to say that the database remains on a single node until the cost of expansion or high performance requirements cause you to become distributed.
As distributed tools and conceptual systems become better, at least for certain types of applications, this may change. As you can imagine, future distributed data Systems will become the default choice, even for use cases that do not need to deal with large amounts of data or transactions. In the remainder of this book, we'll talk about a variety of distributed data systems and discuss how they handle scalability and how to mitigate usage and maintainability.
A large-scale application of the system architecture is usually a specific----no universal, one applicable extensible architecture. The problem may be a mix of read volume, write volume, data storage, data complexity, response time requirements, access patterns, or even more problems.
For example, a system designed to process 100,000 requests per second, each 1KB, looks like a design for 3 requests per minute, and each 2GB system varies greatly----even if the two systems have the same size of throughput.
A better architecture for specific applications is built on assumptions that operate normally with fewer load parameters. If these assumptions are wrong, then those who design the workload is good to be wasted, bad circumstances will backfire. In early installation or unproven production, it is more important to be able to quickly iterate over the production characteristics than to expand to a future-assumed load.
Even so, they are still specific to special applications, and the extensible architecture is still structured from a generic module and arranged in a familiar pattern. In this book we will discuss those structural modules and patterns.