Scalable Web Architecture and Distributed System Design

Source: Internet
Author: User
Keywords web architecture scalable website architecture website architecture how to create a website architecture

Open source software has become the basic building block of some of the largest websites. As these sites have evolved, best practices and guiding principles surrounding their architecture have emerged. This chapter aims to introduce some of the key issues that need to be considered when designing a large website, as well as some of the building blocks used to achieve these goals.

Alibaba Cloud Simple Application Server:  Anti COVID-19 SME Enablement Program
$300 coupon package for all new SMEs and a $500 coupon for paying customers.

This chapter focuses mainly on Web systems, although some materials are also applicable to other distributed systems.

1.1 Principles of Web Distributed System Design
What does it mean to build and operate a scalable website or application? At the primitive level, it simply connects users to remote resources through the Internet-the part that makes it scalable is the resources or access to these resources distributed on multiple servers.

Like most things in life, taking time to plan ahead when building web services can help in the long run; understanding some of the considerations and trade-offs behind large websites can make more informed decisions when creating smaller websites. The following are some of the key principles that influence the design of large Web systems:

Usability: The uptime of a website is critical to the reputation and functionality of many companies. For some large online retail sites, being unavailable for several minutes may result in thousands or millions of dollars in lost revenue, so designing their systems to be always available and resilient to failure is a basic business and technical requirement . High availability in distributed systems requires careful consideration of the redundancy of key components, rapid recovery in the event of partial system failures, and graceful degradation in the event of problems.
Performance: Website performance has become an important consideration for most websites. The speed of a website affects usage and user satisfaction, as well as search engine rankings, which are factors directly related to revenue and retention. Therefore, creating a system optimized for fast response and low latency is key.
Reliability: The system needs to be reliable so that requests for data will always return the same data. If the data is changed or updated, the same request should return the new data. Users need to know that if something is written to the system or storage, it will continue to exist and can rely on future retrieval.
Scalability: When it comes to any large distributed system, size is only one aspect of scale that needs to be considered. Equally important is the effort required to increase the capacity to handle larger loads, commonly referred to as system scalability. Scalability can refer to many different parameters of the system: how much additional traffic it can handle, how easy it is to add more storage capacity, and even how many transactions it can handle.
Manageability: Designing a system that is easy to operate is another important consideration. The manageability of the system is equivalent to the scalability of operations: maintenance and update. The things that need to be considered for manageability are the ease of diagnosis and understanding when a problem occurs, the ease of updating or modification, and the simplicity of system operation. (That is, does it run frequently without failures or exceptions?)
Cost: Cost is an important factor. This may obviously include hardware and software costs, but it is also important to consider other aspects required to deploy and maintain the system. The developer time required to build the system, the operational workload required to run the system, and even the amount of training required should all be considered. Cost is the total cost of ownership.
Each of these principles provides a basis for the decision to design a distributed Web architecture. However, they may also be inconsistent with each other, so achieving one goal comes at the cost of another. A basic example: choosing to solve capacity by simply adding more servers (scalability) may come at the cost of manageability (you have to run additional servers) and cost (the price of the servers).

When designing any type of web application, it is important to consider these key principles, even if it is to acknowledge that the design may sacrifice one or more of them.

1.2 Basic
In terms of system architecture, there are several considerations: what is the right part, how these parts fit together, and what is the right trade-off. Investing in expansion before it is needed is usually not a wise business proposition; however, some foresight of the design can save a lot of time and resources in the future.

This section focuses on the core factors of almost all large-scale web applications: services, redundancy, partitioning, and handling failures. Each of these factors involves choices and compromises, especially in the context of the principles described in the previous section. To explain this in detail, it is best to start with an example.

Example: Image hosting application
At some point, you may have posted a picture online. For large sites that host and provide a large number of images, there are challenges in building a cost-effective, high-availability, and low-latency (fast retrieval) architecture.

Imagine a system where users can upload their images to a central server, and can request images through a web link or API, just like Flickr or Picasa. For simplicity, we assume that this application has two key parts: the ability to upload (write) images to the server, and the ability to query images. Although we certainly want to upload efficiently, what we are most concerned about is the very fast delivery of images when someone requests them (for example, images from web pages or other applications can be requested). This is very similar to the functionality that a web server or content delivery network (CDN) edge server (server CDN is used to store content in many locations, so the content is geographically/physically closer to the user, leading to faster performance).

Other important aspects of the system are:

There is no limit to the number of images to be stored, so storage scalability needs to be considered in terms of the number of images.
Image download/request requires low latency.
If the user uploads an image, the image should always exist (data reliability of the image).
The system should be easy to maintain (manageability).
Since the profit margin of image hosting is not high, the system needs to be cost-effective

In this image hosting example, the system must be appreciably fast, its data stored reliably, and all these properties are highly scalable. Building a small version of this application will be trivial and can easily be hosted on a single server; however, this chapter will not interest people. Let's suppose we want to build something that can be as big as Flickr.

service
When considering a scalable system design, it helps to decouple functions and treat each part of the system as its own service with a well-defined interface. In fact, the system designed in this way is said to have a service-oriented architecture (SOA). For these types of systems, each service has its own unique functional context, and interacts with anything outside of that context through an abstract interface (usually the public-facing API of another service).

Deconstructing the system into a set of complementary services separates the operations of these parts from each other. This abstraction helps to establish a clear relationship between the service, its underlying environment, and the users of the service. Creating these clear descriptions can help isolate the problem, but it also allows each part to expand independently of each other. This service-oriented system design is very similar to object-oriented programming design.

In our example, all requests for uploading and retrieving images are handled by the same server; however, since the system needs to be expanded, it makes sense to decompose these two functions into their own services.

Fast forward and assume that the service is being heavily used; such a scenario makes it easy to see how long the write time will affect the time required to read the image (because their two functions will compete for shared resources). Depending on the architecture, this impact may be significant. Even if the upload and download speeds are the same (this is not the case for most IP networks, because most are designed for at least a 3:1 download speed: upload speed ratio), the file is usually read from the cache, and the write must eventually Go to disk (and possibly write several times if it is eventually consistent). Even if everything is in memory or read from disk (such as SSD), database writes are almost always slower than reads. (Pole Position, an open source tool for database benchmarking, http://polepos.org/ and results http://polepos.sourceforge.net/results/PolePositionClientServer.pdf.).

Another potential problem with this design is that a web server like Apache or lighttpd usually has an upper limit on the number of simultaneous connections it can maintain (the default value is about 500, but it can be higher), and at high traffic, write All of these can be consumed quickly. Since reading can be asynchronous, or using other performance optimizations (such as gzip compression or block transfer encoding), the web server can switch services faster and switch between clients, providing faster than the maximum connection per second More requests (using Apache and the maximum number of connections set to 500, it is not uncommon to provide thousands of read requests per second. On the other hand, writes tend to keep open connections during upload, so on most home networks It may take more than 1 second to upload a 1MB file, so the web server can only handle 500 such simultaneous writes.

Planning this bottleneck can well decompose image reading and writing into its own services, as shown in Figure 1.2. This allows us to expand each independently (because we may always read more than write), but it also helps clarify what is happening at each point. Finally, this will separate future issues so that troubleshooting and scaling issues, such as slow reads, can be made easier.

The advantage of this method is that we can solve the problem independently-we don't have to worry about writing and retrieving new images in the same context. These two services still use the global image corpus, but they are free to optimize their performance through appropriate service methods (for example, queuing requests or caching popular images-more on this below). From a maintenance and cost point of view, each service can be scaled independently as needed, which is good because if they are combined and mixed, they may unintentionally affect the performance of another service, just like the scenario discussed above.

Of course, the above example can work well when you have two different endpoints (in fact, this is very similar to the implementation and content delivery network of several cloud storage providers). There are many ways to solve these types of bottlenecks, and each has different trade-offs.

For example, Flickr solves this read/write problem by assigning users to different shards, so that each shard can only handle a certain number of users, and as users increase, more shards are added to the cluster . In the first example, it is easier to expand the hardware according to actual usage (the number of reads and writes of the entire system), and Flickr expands with its user base (but it is mandatory to assume that users have the same usage rate, so there can be additional capacity) . In the former, the interruption or problem of one of the services will reduce the functionality of the entire system (for example, no one can write files), and the interruption of a slice using Flickr will only affect these users. In the first example, it is easier to perform operations across the entire data set-for example, update the write service to include new metadata or search all image metadata-while with the Flickr architecture, each shard needs to be updated or searched (Or need to create a search service to sort out those metadata-this is actually what they do.

When it comes to these systems, there is no correct answer, but it helps to go back to the principles at the beginning of this chapter to determine the system requirements (large reads or writes or both, concurrency levels, queries across data sets, ranges, types Etc.), benchmark different alternatives, understand how the system will fail, and develop a reliable plan when failure occurs.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.