Reliable, scalable, and maintainable Data Systems------Designing data-intensive applications Reading notes 1

Source: Internet
Author: User

Frankly, it is also a chance coincidence, in the postgraduate stage into the field of distributed systems learning. Whether it's large-scale storage or computing, its core is the need to use distributed technologies to leverage parallelism to address data-intensive applications. Recently began to chew this "designing Data-intensive Applications" Tome, the author Martin Kleppmann in the field of distributed data systems have a deep foundation, and in this book complete comb all kinds of complex design behind the technical logic , compromise and transcendence between different architectures is worth reading by developers and architects.
It is a pity that there is no corresponding Chinese version at present, this series is a reading sentiment, but also entrainment Sihuo, elaborated some of their own understanding and views, to stimulate, hope that we exchange more learning. There are 12 chapters in this book, and then I'll update a reading note in a chapter. (囧rz, feel oneself opened a hole again) at the same time also hope that the domestic publishing house can introduce copyright as soon as possible, I also want to participate in translation work AH (,,???,,)!!

1. Data-intensive applications

As a developer, the vast majority of applications are now data-intensive, not computationally intensive. The computational power of CPUs is no longer a limiting factor for these applications, and the more pressing problem is the complexity of data, the complexities of the structure, and the performance of the application.

Let's look at the data systems we often deal with:

    • Store data so that they or other applications can find it later ( database )
    • Remember the results of expensive operations to speed up reading. ( cache )
    • Allows users to search for data by keyword or filter data in various ways ( search index )
    • Send a message to another process, asynchronous processing ( stream processing )
    • Periodically compresses a large amount of accumulated data ( batch processing )

Most of the time, the great job of our so-called application is to combine these data systems and add our operational logic , but how to integrate these data systems more rationally is still a question worth learning and thinking about. Data systems are becoming more and more similar, and different data systems are learning each other's merits. Caching systems such as Redis can support data landing, and in many cases we can replace traditional RDBMS. Data queues such as Kafka can also support data landing to store messages. More profound understanding of these data systems, to better balance the architectural design, is a very profound topic.

is a typical application composed of a variety of data systems, and with the complexity of data volume and data logic, it becomes a data-intensive application.

2. Three principles for designing data-intensive applications
    • Reliability
      With fault tolerance (in the face of hardware or software failure, or even human error), the system should continue to function properly (performing the correct function at the desired performance level).
    • Scalability
      As the system grows (in data volume, traffic, or complexity), there should be a reasonable way to handle this growth.
    • Maintainability
      Over time, many different people will be working to improve the data system (both to maintain current behavior and to adapt the system to the new environment), and they should all be able to work productively.

Clearly, these three principles are not just the principles that data-intensive applications should follow, but also important in most software systems, and then we'll comb through them.

(1) Reliability
    • Hardware failure
      Hard drive crashes, memory failure, power grid power outage, someone unplugged the network cable, almost hardware failure in the data center is always uninterrupted appearance.
      Solution :

      • Redundancy is considered at the software and hardware level to ensure that hardware failure does not evolve into a system failure.
    • Man-made mistakes
      People are very unreliable, from the evolution of driving technology can be seen, human negligence will bring great disaster. Moreover, people often make mistakes.
      Solution :

      • The way to minimize the chance of error design system. For example, a well-designed abstraction, API, and management interface can easily do "the right Thing" and block "the wrong thing".
    • People make the most mistakes and decouple where they can lead to failure.

    • Comprehensive testing, from unit testing to complete system integration and manual testing.

    • Allows quick and easy recovery from human error to minimize impact in case of failure. For example, make it quick to roll back the change configuration, step through the new code (so any unexpected error affects only a small subset of users), and provide tools to recalculate the data (if the original old calculation is incorrect).

(2) Scalability

Even if a system works reliably today, it does not mean that it will work reliably in the future. A common cause is increased load: Perhaps the system has grown from 10,000 concurrent users to 100,000 concurrent users, or from 1 million to 10 million.

"If the system grows in a certain way, what are our options for growth?" " How can we increase compute resources to handle the extra load?" "

    • Describe the load
      First, we need to succinctly describe the current load of the system, and the load can be described by several numbers, which we call load parameters.
      The choice of parameters depends on the architecture of the system, such as:
    • Requests to the Web server per second
    • Read-write ratio in database
    • Number of active users in the chat room
    • Cache Hit Ratio

    • Describe performance
      Once the load on the system is described, it is possible to discuss what happens when the load increases. Can be seen in two ways:
      1. How does the performance of the system be affected when the load parameters are increased and the system resources (CPU, memory, network bandwidth, etc.) are kept constant?
      2. How much additional resources will be required if you want to keep performance constant while increasing the load?

So we need a ruler that describes performance:

    • Average response time: the arithmetic mean of a given n value, all added up, divided by N. However this is not a good indicator because it does not tell you how many users really experience the delay.
    • Percent Response time: The list of response times, from fastest to slowest, the median is the midpoint: for example, if the average response time is 200 milliseconds, that means half of the requests are returned in less than 200 milliseconds, and half the requests take longer than that.
    • High percentage of response time: You can look at the high percentile: 95th,99th, and 99.9th percentile is common (referred to as p95,p99, and p999), to refer to threshold values for response time.

Load conditions and performance are important, sometimes the bottleneck of the system is caused by a few extreme situations. The author gave a Twitter example, and I think it's good to share this example in detail here:

The story of Twitter

Data released by Twitter on November 16, 2012.
The two main operations of Twitter are:

    • Send a Tweet
      Users can post a tweet to their subscribers. (Average 4.6k Requests/second, spikes exceeding 12,000 requests/sec).
    • Get tweets
      Users can view tweets posted by their followers. (About 300K Requests/sec).

Twitter's scalability challenges are largely not due to the number of tweets, but mainly to the fact that every user has a lot of subscribers, and each user has a lot of followers. There are basically two ways to do both of these operations:

    • 1, release a tweet, simply insert the new tweet into the global tweet collection. when users request tweets from their followers, they can find everyone they care about, find all the tweets for each user, and merge them (sorted by time). In a relational database, you can write the following query, for example:
      java SELECT tweets.*, users.* FROM tweets JOIN users ON tweets.sender_id = users.id JOIN follows ON follows.followee_id = users.id WHERE follows.follower_id = current_user
      As shown in the following:
    • 2. Maintain a cache for each tweet subscribed to by each user, just like every recipient's Twitter mailbox. when a user posts a tweet, look for all the people who are interested in the user and push the new tweets into their cache. So reading a tweet list is a good deal, because its results are calculated in advance.

As shown in the structure is obviously more appropriate for the release of tweets, because the published tweets are almost two orders of magnitude lower than the read operation, so in this case it is better to do more work at the time of writing than to do more work while reading. But Method 2 does not apply to accounts with a large number of followers, assuming that someone has a 3000W fan, the write operation generated by a tweet may be huge. So now Twitter is mixing the two approaches in Twitter's tweet system. Most users ' tweets will still be extended to tweet caches when they are published, but only a handful of users have a large number of followers (i.e. celebrities). Users can track the tweets of any celebrities and are read separately and merged with the user's tweet cache. This hybrid approach consistently provides good performance.

( This example is a very concise description of architectural design compromise and subtlety, based on business characteristics, maximize the performance of the Data System optimization.) I admire the skill of Twitter engineers in architecture design. At the same time, it is also very curious, such as Weibo, is not a similar architecture to design. )

    • How to expand
      Zoom In (vertical scale, move to a more powerful machine) and scale (scale horizontally, distribute loads on more than one smaller machine) between two options. In fact, a good architecture usually involves a practical hybrid approach: for example, using several powerful machines is still simpler and cheaper than a large number of small virtual machines. Uncontrolled distribution adds complexity to the system, which is a dangerous part of software engineering, although it is fairly straightforward to distribute stateless services on multiple machines, but transferring stateful data systems from a single node to a distributed installer can bring a lot of additional complexity.
      There is no such thing as a generic one, a scalable architecture for all applications. ( It's good to write )
(3) maintainability

This section teaches some ways to build maintainable systems. Most of the cost of software is not in the initial development, but in ongoing maintenance to fix bugs, keep the system running, adapt it to new business, add new features.

    • operability
      Let operation and maintenance team keep the system running smoothly.

    • Simple
      Make it easy for new engineers to understand the system by removing as much complexity as possible from the system.

    • Evolutionary nature
      It is easy for engineers to make changes to the system in the future to accommodate unexpected use cases when demand changes. Also known as extensibility, modifiable, and malleable.

(It is really painful to maintain the mess that others leave behind, documents, notes are really the most important thing!!!) )

Reliable, scalable, and maintainable Data Systems------Designing data-intensive applications Reading notes 1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.