Practical experience in highly available and scalable architecture

Last Update:2017-03-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The maturity and development of mobile Internet, cloud computing, and big data allow more good ideas to mature and develop mobile Internet, cloud computing, and big data products in a short time, let more good ideas be implemented as products in a short time. At this time, if the user needs to be accurate, the number of users is likely to experience explosive growth, without the need to operate as carefully as before for several years. However, the rapid growth in the number of users (especially the explosive growth in a short period of time) usually causes application developers to suffer from severe technical challenges: how to avoid service unavailability caused by a single machine; how to avoid a decline in user experience when the service capacity is insufficient, and so on. High-availability and scalable architecture will be adopted at the beginning of system construction, which will effectively avoid these problems.

How can we build a highly available and scalable architecture? Li Daobing, chief architect of Qiniu Cloud storage, has combined his many years of practical experience to address some less complex business scenarios, the entry layer, service layer, cache layer, and database layer detail how to build high-availability and scalable systems. I hope that after reading this article, you will feel that high availability and scalability are not unattainable. low investment costs will allow you to incorporate high availability and scalability into the architecture design at the early stage of the project.

How to achieve high availability
Entrance layer

The entry layer usually refers to Nginx and Apache, and is responsible for the service entry of applications (whether Web applications or mobile applications. We usually locate the service in an IP address. if the server corresponding to the IP address goes down, the user's access will be interrupted. In this case, you can use keepalived to achieve high availability of the entry layer. For example, if the IP address of machine A is 1.2.3.4 and the IP address of Machine B is 1.2.3.5, apply for another IP address 1.2.3.6, which is usually bound to machine, if A is A machine, the IP address is automatically bound to Machine B. If B is A machine, the IP address is automatically bound to machine. In this way, we can bind the DNS to the heartbeat IP to achieve high availability at the entry layer.

However, this solution has a small problem. First, the switch may be interrupted for one to two seconds. that is to say, if it is not required to be very strict within milliseconds, there will be no problem. Second, it will be a waste of machines at the entrance, because the entrance of two machines may only be used by one machine. For persistent connection applications, service interruption may occur. in this case, the client needs to work with the client to re-create the connection. Simply put, this solution can solve some problems for common businesses.

Note that keepalived has some restrictions.

The two machines must be in the same CIDR block, not in the same CIDR block.
Intranet services can also perform heartbeat, but note that we previously bound intranet services to intranet IP addresses for security purposes to avoid security issues. However, to use keepalived, you must listen to all IP addresses (if the listener is on the heartbeat IP address, the service cannot be started when the machine does not hold the IP address). The simple solution is to enable iptables, prevent intranet services from being accessed from the Internet.
If the server utilization rate is lower, you can consider hybrid deployment to improve this.

A common error is that if two machines, two public IP addresses, and the domain names are located at the same time on the DNS, they feel that they are already highly available. This is not highly available at all, because if one machine is on the machine, about half of users will not be able to access it.

In addition to keepalive, LVS can also be used to solve the high availability problem at the entry layer. However, compared with keepalived, LVS is more complex and has a higher threshold.
Business layer
The business layer is usually composed of logic code written by PHP, Java, Python, and Go, and must depend on the background database and some caching aspects. How can we achieve high availability at the business layer? The core is that the business layer should not be stateful, and the state should be dispersed to the cache layer and database. At present, you usually like to put the following types of data into the business layer.

The first is the session, that is, the user logs on to the relevant data. but the best practice is to put the session in the database or a relatively stable cache system.

The second is the cache. when accessing the database, if a query is slow, you want to temporarily put these results in the process. you do not need to access the database the next time you perform the query. The problem with this approach is that when there is more than one server at the business layer, data is difficult to be consistent, and the data obtained from the cache may be wrong ..

A simple principle is that the business layer should not be stateful. When the service layer is stateless and one service layer server crashes, Nginx/Apache will automatically send all requests to another service layer server. Because there is no status, there is no difference between the two servers, so the user cannot feel it. If the session is placed in the service layer, the problem is that the user previously logged on to a machine. after the process is killed, the user will be logged out.

Friendly reminder: for a period of time, cookie sessions are popular, that is, data in sessions is encrypted, stored in customers' cookies, and then delivered to the client, so that they can be completely stateless with the server. However, there are a lot of pitfalls. if you can bypass these pitfalls, you can use them like this. The first challenge is how to ensure that the encrypted key is not disclosed. once leaked, attackers can forge the identity of anyone. The second pitfall is replay attacks. There are also some other attack methods to avoid verification codes that others keep trying by saving cookies. If there is no way to solve these two problems, the cookie session should be used with caution as much as possible. It is best to put the session in a database with better performance. If the database performance is not good, it is better to put the session in the cache than to put it in the cookie.

Cache layer

There is no cache concept in a very simple architecture. However, when the access traffic comes up, MySQL and other databases can't afford it. for example, when MySQL runs on a SATA disk and QPS reaches 200, 300, or even 500, MySQL performance will be greatly reduced, in this case, you can use the cache layer to block most service requests and increase the overall capacity of the system.

A simple way to make the cache layer highly available is to make the cache layer more detailed. For example, if the cache layer is a machine, when this machine is used, the pressure on all application layers will be placed in the database. if the database cannot handle the pressure, the entire website (or application) it will be deleted. If the cache layer is divided into four machines, there will be only 1/4 of each. when this machine falls, only 1/4 of the total traffic will be placed on the database, if the database can survive, the website will be able to wait until the cache layer is restarted. In practice, 1/4 is obviously not enough. we will take it into details to ensure that the database can survive when a single cache is on the machine. In a small or medium scale, the cache layer and service layer can be deployed in a hybrid manner, saving machines.

Database layer

High availability is achieved at the database level, usually at the software level. For example, MySQL supports Master-Slave mode and Master-Master mode. MongoDB also has the concept of ReplicaSet, which can basically meet everyone's needs.

In short, to achieve high availability, you need to do the following: the entry layer performs heartbeat, the service layer server is stateless, the cache layer reduces the granularity, and the database implements a master-slave mode. In this mode, we do not need too many servers for high availability. these services can be deployed on both servers at the same time. At this time, the two servers can meet the early high availability requirements. The user on any server is completely unaware.

How to achieve scalability
Entrance layer

To achieve scalability at the entry layer, you can directly scale up the machine horizontally and then add IP addresses to the DNS. However, although it is okay to resolve a domain name to dozens of IP addresses, many browser clients only use the first few IP addresses, some domain name providers have optimized this (for example, the IP order returned each time is random), but the optimization effect is unstable.

We recommend that you use a small number of Nginx machines as the portal, and the service servers are hidden in the intranet (most HTTP services are used in this way ). In addition, you can also send all IP addresses to the client, and then perform some scheduling on the client (especially non-HTTP services, such as games and live broadcasting ).

Business layer

How is the business layer scalable? Like a high-availability solution, it is a good way to achieve scalability at the business layer and ensure stateless. In addition, add machines to continue horizontal deployment.

Cache layer

What is troublesome is the scalability of the cache layer. what is the simplest and most crude method? When the volume is relatively low in the middle of the night, the entire cache layer is taken offline and a new cache layer is launched. After the new cache layer is started, wait for the cache to warm up slowly. Of course, here is a requirement that your database can withstand requests in an underestimating period. What if I can't help it? Depending on the cache type, we can first distinguish the cache type.

High consistency cache: unable to accept the wrong data obtained from the cache (for example, the user balance, or the downstream will continue to cache the data)
Weak consistent cache: it can accept the wrong data obtained from the cache within a period of time (such as the number of forwards on Weibo ).
Unchanged cache: the value corresponding to the cache key will not change (for example, the password from SHA1 or the calculation result of other complex formulas ).

What type of cache is more scalable? The expansion of the weak consistency and unchanged cache is very convenient, so you can use consistent Hash. the strong consistency is a little more complicated. I will discuss it later. The reason for using consistent Hash instead of simple Hash is the cache efficiency. If the cache is expanded from 9 to 10, 90% of the cache will expire immediately in the case of simple Hash. if consistent Hash is used, only 10% of the cache will be invalid.
So what are the problems with high consistency cache? The first problem is that the configuration update time of the cache client may be slightly different, and the expired data may be obtained in this time window. The second problem is that if the node is removed after expansion, dirty data will be obtained. For example, if the key a is in Machine 1, the key is in Machine 2 after expansion, and the data is updated, but the key is returned to Machine 1 after the node is removed, dirty data is obtained.

To solve problem 2, it is relatively simple to either keep the node from being reduced, or adjust the node interval to be greater than the effective time of the data. Problem 1 can be solved using the following steps:

Both hash configurations are updated to the client, but old configurations are still used;
The cache is used when only two sets of hash results are consistent on the client one by one. In other cases, the cache is read from the database but written;
Client-by-client notifications use new configurations.
Memcache is designed earlier, so it is not considerate in terms of high scalability and availability. Redis has made many improvements in this regard. in particular, the @ ngaut team developed the codis software based on redis, which solves most of the problems at the cache layer at a time. We recommend that you take a look.

Database

There are many methods to achieve scaling at the database level, and there are also many documents. I will not repeat them here. The general method is horizontal split, vertical split, and regular rolling.

In short, we can use the methods and technologies described earlier to achieve high system availability and scalability at the entry layer, business layer, cache layer, and database layer. Specifically, the entry layer is used to achieve high availability, and the service is deployed in parallel for scaling; the service layer is stateless; on the cache layer, the granularity can be reduced to facilitate high availability, consistent Hash will help to achieve the scalability of the cache layer. the master-slave mode of the database layer can solve the high availability problem, and the split and scroll can solve the scalability problem.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Practical experience in highly available and scalable architecture

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Practical experience in highly available and scalable architecture

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support