Xu Hanbin: Billion-class web system build--stand-alone to distributed cluster

Source: Internet
Author: User
Tags connection pooling redis domain name server dns entry redis cluster

The large-scale flow of the website architecture, has always been slowly "grow" and come. And in this process, there will be many problems, in the process of continuous problem-solving, the web system becomes larger and bigger. And new challenges often arise on old solutions. I hope this article can provide some reference and help for the technicians.

The following is the original

When a web system is growing from 100,000 to 10 million, or even more than 100 million, the pressure on the web system will grow, and in the process, we will encounter many problems. To address these performance pressures, we need to build multiple layers of caching mechanisms at the Web system architecture level. In different pressure stages, we will encounter different problems, through the construction of different services and structures to solve. Web Load Balancing

Web load balancing (load balancing), simply by assigning "work tasks" to our server clusters, and using the appropriate allocation method is important to protect the backend Web server.


There are a lot of load balancing strategies, and we start with a simple talk.

1. HTTP redirection

When a user sends a request, the Web server returns a new URL by modifying the location tag in the HTTP response header, and then the browser continues to request the new URL, which is actually page redirection. The goal of "load balancing" is achieved through redirection. For example, when we download the PHP source package, click on the download link, in order to solve the problem of different countries and regions download speed, it will return a close to our download address. The redirected HTTP return code is 302, as shown in the following figure:


If you use PHP code to implement this functionality, you can do this by doing the following:


This redirection is very easy to implement and can be customized to a variety of policies. However, it has a poor performance in mass traffic. Moreover, the user experience is not good, the actual request has been redirected, increased network latency.

2. Reverse Proxy load Balancing

The core work of the reverse proxy service is to forward the HTTP request and play the role of the browser-side and the background Web server. Because it works on the HTTP layer (application layer), which is the seventh layer of the Network seven-layer structure, it is also called "seven-layer load balancing". Can do a lot of reverse proxy software, the more common one is nginx.


Nginx is a very flexible reverse agent software, can customize the forwarding strategy, the allocation of the weight of the server traffic. A common problem with the reverse proxy is the session data stored by the Web server, because the general load-balancing strategy is randomly assigned to the request. The request from the same logged-on user cannot be guaranteed to be assigned to the same web machine, causing the session problem to be found.

There are two main solutions: Configure forwarding rules for reverse proxies, so that the same user's request must fall on the same machine (through the analysis of cookies), complex forwarding rules will consume more CPU, but also increase the burden of the proxy server. It is recommended that information such as session sessions be stored exclusively with a separate service, such as Redis/memchache.

Reverse proxy service, can also turn on the cache, if opened, will increase the burden of reverse proxy, need to use caution. The implementation and deployment of this load balancing strategy is simple and performance is better. However, it has a "single point of failure" problem, if hung, will bring a lot of trouble. And, as the Web server continues to grow in the late, it itself can become a bottleneck for the system.

3. IP load Balancing

IP load Balancing services are working at the network layer (modify IP) and Transport layer (modify port, layer fourth), compared to work in the application layer (layer seventh) performance is much higher. The principle is that he is to the IP layer of the packet IP address and port information to modify, to achieve the purpose of load balancing. This approach is also known as "four-tier load balancing." The common load Balancing mode is LVS (Linux virtual Server,linux), which is implemented through Ipvs (IP virtual Server,ip).


When the load balancing server receives the IP packet from the client, it modifies the IP packet's destination IP address or port, and then delivers it to the internal network intact, and the packet flows to the actual Web server. After the actual server processing is completed, the packet will be posted back to the load balancing server, which modifies the target IP address as the user IP address and eventually returns to the client.


The above way is called Lvs-nat, in addition, there are lvs-rd (direct route), Lvs-tun (IP tunnel), the three are in the way of LVS, but there is a certain difference, the length of the problem, not redundant.

IP load Balancing performance is higher than the nginx of the reverse proxy, it only processed to the transport layer of packets, do not do further packaging, and then directly to the actual server. However, it is more complex to configure and build.

4. DNS Load Balancing

DNS (domain name System) is responsible for domain name resolution services, domain name URL is actually the alias of the server, the actual mapping is an IP address, parsing process, is the DNS complete domain name to IP mapping. and a domain name can be configured to correspond to multiple IP. As a result, DNS can also be used as a load-balancing service.


This load balancing strategy is simple in configuration and excellent in performance. However, it is not possible to define rules freely, and it is troublesome to change a mapped IP or machine failure, and there is a problem with the delayed DNS entry.

5. DNS/GSLB Load Balancing

Our common CDN (Content Delivery Network) implementation, in fact, is in the same domain map for multiple IP based on the further, through the GSLB (Global Server Load Balance, Global load Balancing) maps the IP of a domain name according to the specified rule. Under normal circumstances are in accordance with geographical location, from the user near the IP return to the user, reducing the network transmission between the routing nodes in the jump consumption.


"Looking up" in the figure, the real process is that Ldns (local DNS) first acquires the root domain name server (for example,. com) to the top-level root, and then obtains the authoritative DNS of the specified domain name and then obtains the actual server IP.


In the web system, CDN is usually used to solve the loading problem of large size static resources (html/js/css/pictures, etc.), so that these more rely on the content of the network download, as far as possible from the user closer to enhance the user experience.

For example, I visited a picture on the imgcache.gtimg.cn (Tencent's own CDN, the reason for not using qq.com domain name is to prevent HTTP requests, with the extra cookie information), I obtained the IP is 183.60.217.90.


This approach, like the previous DNS load balancing, is not only excellent in performance, but also supports the configuration of multiple policies. However, the construction and maintenance costs are very high. Internet first-line companies, will build their own CDN services, small and medium-sized companies generally use third-party-provided CDN.


The establishment and optimization of caching mechanism of web system

We've just finished talking about the external network environment of the web system, and now we're starting to focus on our web system's own performance problems. Our web site with the increase in traffic, there are many challenges, to solve these problems is not only the expansion of the machine so simple, the establishment and use of appropriate caching mechanism is the fundamental.

In the beginning, our web system architecture may be like this, and each link may have only 1 machines.


We start with the most fundamental data storage.

One, MySQL database internal cache use

MySQL's caching mechanism, from from within MySQL, the following content will be the most common InnoDB storage engine.

1. Establishing an appropriate index

The simplest is to build the index, the index in the table data is relatively large, play the role of fast retrieval data, but the cost is also some. First, it takes up a certain amount of disk space, where the combination index is the most prominent and uses caution, and it produces an index that is even larger than the source data. Second, the data insert/update/delete after indexing, and so on, because of the need to update the original index, time will increase. Of course, in fact our system is generally based on select query operations, so the use of indexes still has a significant effect on system performance.

2. Database connection thread pool cache

If every database operation request needs to create and destroy a connection, it is a huge overhead for the database. To reduce this type of overhead, you can configure thread_cache_size in MySQL to indicate how many threads are retained for reuse. When the thread is not enough, then create, idle too much time, then destroy.


In fact, there is a more radical approach, using Pconnect (Database long connection), once the thread has been created for a long time to maintain. However, in the case of large number of accesses and more machines, this usage is likely to result in a "depletion of database connections" because the connection is not reclaimed, and the max_connections (maximum number of connections) of the database is eventually reached. Therefore, the use of long connections usually requires a "connection pooling" service between CGI and MySQL, controlling the CGI machine's "blind" creation of connections.


Set up a database connection pool service, there are many ways to implement, PHP, I recommend the use of Swoole (PHP, a network communication expansion) to achieve.

3. InnoDB Cache Settings (innodb_buffer_pool_size)

Innodb_buffer_pool_size This is a memory buffer for storing indexes and data, and if the machine is a MySQL-exclusive machine, it is generally recommended as 80% of the machine's physical memory. In the scenario of fetching table data, it can reduce disk IO. In general, the larger the value setting, the higher the cache hit rate.

4. Sub-Library/Sub-table/partition.

MySQL database tables generally bear the amount of data in the millions, and then up, the performance will be a significant decline, so when we anticipate the amount of data will exceed this level, we recommend that the Sub-library/table/partition operations. The best way is to build the service at the beginning of the design as a storage mode of the library, fundamentally eliminate the risk of the middle and late. However, it sacrifices some convenience, such as a list of queries, and increases the complexity of maintenance. However, when the amount of data is tens or above, we will find that they are worth it.

Second, MySQL database multiple services set up

1 MySQL machines, in fact, are high-risk single points, because if it hangs, our web services will not be used. And, as Web system access continues to grow, one day, we find that 1 MySQL servers are unsustainable and we begin to use more MySQL machines. When multiple MySQL machines are introduced, many new problems will arise.

1. Set up MySQL master, from the library as backup

This approach is purely to solve the "single point of failure" problem, when the main library fails, switch to from the library. However, this practice is actually a waste of resources, since the library is actually idle.


2. mysql read and write separation, main library write, reading from the library.

Two databases do read-write separation, the main library is responsible for writing the operation of the class, from the library responsible for read operations. And, if the main library fails, still does not affect the read operation, but also can be all read and write temporarily switch to from Couchen (need to pay attention to traffic, may be due to the traffic is too large, from the library also drag down).


3. Mutual preparation of the main owner.

Two MySQL from each other from the library, but also the main library. This scheme not only achieves the pressure diversion of traffic, but also solves the problem of "single point Failure". There is another set of services available for each failure.


However, this scheme can only be used in two machine scenarios. If business development is still fast, you can choose to separate the business and build multiple primary owners.

Iii. data synchronization between MySQL database machines

Whenever we solve a problem, the new problem must be born on the old solution. When we have multiple MySQL, at the peak of the business, it is likely that the data between the two libraries has a latency scenario. Also, network and machine load can affect the latency of data synchronization. We have encountered, in the daily traffic near 100 million in a special scenario, appear, from the library data need many days to catch up with the main library data. In this scenario, the basic loss of utility from the library.

So, to solve the synchronization problem, is the next point we need to focus on.

1. mysql Self-threading synchronization

MySQL5.6 starts to support the main library and synchronizes from the library data, taking multiple threads. However, the restrictions are also more obvious, only in the library as a unit. MySQL data synchronization is through the Binlog log, the main library to write to the Binlog log operations, is in order, especially when the SQL operation contains changes to the table structure, and other operations, for subsequent SQL statement operations are affected. Therefore, synchronizing data from a library must take a single process.

2. Implementation of the resolution Binlog, multi-threaded writing.

In the database table as the unit, parse Binlog multiple tables and do data synchronization. In doing so, the efficiency of data synchronization can be accelerated, but there is also a problem of write order if there is a structural relationship or data dependency between tables and tables. This way, it can be used for some relatively stable and relatively independent data tables.


China's first-line internet companies, most of which are in this way, to speed up data synchronization efficiency. There is also a more radical approach, which is to parse binlog directly, ignoring the direct writing of the table as a unit. However, this approach is complex and is more limited in scope, and can only be used in some special scenarios of the database (no table structure changes, no data dependencies between tables and tables).

Create a cache between the Web server and the database

In fact, to solve the problem of large traffic, you can't just focus on the database level. According to the "28 law", 80% of requests are focused on 20% of hot data. Therefore, we should establish a caching mechanism between the Web server and the database. This mechanism can be used as a disk cache, or in memory caching mode. Through them, most of the hotspot data are queried and blocked before the database.


1. Page static

When a user visits a page on a Web site, most of the content on the page may not change for a long time. A news report, for example, will almost never change the content once it is published. In this case, the static HTML page generated via CGI is cached locally to the Web server's disk. Except for the first time, the local disk file is returned directly to the user after a dynamic CGI query to retrieve the database.


This approach seems perfect when the web system is smaller in size. However, once the web system becomes larger, for example, when I have 100 Web servers. So these disk files, there will be 100, this is a waste of resources, but also difficult to maintain. This time someone will think, can concentrate on a server to save up, hehe, let's look at one of the following caching methods, it is done.

2. Single Memory cache

Through the example of static page, we can know that the "cache" built on the Web machine is not good maintenance, will bring more problems (in fact, through the APC expansion of PHP, through the key/value operation of the Web server's native memory). Therefore, we choose to build the memory caching service, also must be an independent service.

The choice of memory cache, mainly has redis/memcache. In terms of performance, the difference between the two is small, from the functional richness of the extent, Redis more than a chip.


3. Memory Cache Cluster

When we build a single memory cache, we face a single point of failure, so we have to turn it into a cluster. The simple approach is to give him a slave as a backup machine. However, if the request is really many, we found that the cache hit rate is not high, need more machine memory. Therefore, we recommend that it be configured as a cluster. For example, similar to Redis cluster.

Redis cluster cluster Redis each other as a group of master and subordinate, while each node can accept requests, in the expansion of the cluster is more convenient. The client can send a request to any node, and if it is its "responsible" content, it returns the content directly. Otherwise, find the actual responsible Redis node and then notify the client of the address, and the client requests it again.


For clients that use caching services, this is all transparent.


The memory cache service has a certain risk when switching. In the process of switching from cluster A to cluster B, must ensure that B cluster in advance to "Preheat" (b cluster in the memory of the hot data, should be as far as possible with a cluster same, otherwise, switch the moment a large number of requests, in the B cluster memory cache to find, the flow of direct impact on the back end of the database service, is likely to cause database downtime).

4. Reduce database "Write"

The above mechanism, all realize reduces the database "reads" The operation, but, writes the operation is also one big pressure. Write operations, though not reduced, can be used to reduce stress by merging requests. At this point, we need to create a change synchronization mechanism between the memory cache cluster and the database cluster.

Change the request into the cache, let the outside query appear normal, and then put these SQL modifications into a queue storage, queue full or at intervals, merge into a request to the database to update the database.


In addition to the above, by changing the architecture of the system to improve the performance of the write, MySQL itself can configure the parameter innodb_flush_log_at_trx_commit to adjust the policy to write to the disk. If machine costs allow, troubleshooting from the hardware level, you can choose older raid (redundant Arrays of independent disks, disk array) or newer SSD (Solid state drives, solid-state drive).

5. NoSQL Storage

Regardless of the database read or write, when the flow of further increase, will eventually reach the "manpower when the poor" scene. The cost of continuing to add machines is high, and it is not always possible to really solve the problem. This time, some of the core data, you can consider using NoSQL database. NoSQL storage, most of them are in the way of Key-value, here it is recommended to use the above described Redis,redis itself is a memory cache, but also as a storage to use, let it directly to the landing data to disk.

In this way, we will be some of the database is frequently read and written data, separated, placed in our newly built Redis storage cluster, and further reduce the original MySQL database pressure, and because the Redis itself is a memory level cache, read and write performance will be greatly improved.


Domestic first-line internet companies, the architecture adopted a lot of solutions are similar to the above scheme, however, the use of the cache service is not necessarily redis, they will have a richer alternative, even according to their own business characteristics to develop their own nosql services.

6. Empty node Query problem

When we have finished building all the services mentioned above, we think that the web system is already strong. We still say that, the new question will come. Empty node queries are data requests that do not exist at all in those databases. For example, I request a query without the presence of personnel information, the system will be searched from all levels of the cache, and finally to the database itself, and then come to a conclusion that is not found, returned to the front end. Because the cache at all levels is not valid for it, this request is very consuming system resources, and if a large number of empty node queries, can impact the system services.


In my past work experience, has been deeply hurt. Therefore, in order to maintain the stability of the web system, it is necessary to design an appropriate filtering mechanism for the null nodes.

The way we used to do that was to design a simple record-mapping table. Store existing records and put them into a memory cache so that if there are empty node queries, the cache layer is blocked.


Offsite deployment (geographically distributed)

Once we have completed the architecture, will our system be strong enough? The answer, of course, is no, there is no limit to optimization. Although the web system appears to be more powerful on the surface, the experience given to the user is not necessarily the best. Because the students in the northeast, visit a website in Shenzhen service, he will still feel some network distance on the slow. This time, we need to do offsite deployment, so that the web system closer to the user.

One, core concentration and node dispersion

Have played a large game of the students will know, online games are a lot of districts, are generally based on geographical points, such as the Guangdong area, Beijing area. If a player in Guangdong, go to the Beijing area to play, then he will feel significantly better than in the Guangdong area card. In fact, the name of these large areas has already explained, its server location, so, the Guangdong players to connect the server in Beijing, the network of course will be relatively slow.

When a system and service are large enough, you must begin to consider the problem of offsite deployment. Make your service as close to the user as possible. We have already mentioned the static resources of the Web, can be stored on the CDN, and then through the DNS/GSLB way, let the static resources dispersed "all over the country". However, the CDN only solves the problem of static resources, which does not solve the huge system service at the back end and only concentrates on a fixed city.

This time, offsite deployment began. Offsite deployments generally follow: core focus, nodes dispersed. Core Focus: In the actual deployment process, there is always a part of the data and services are not deployed multiple sets, or the deployment of multiple sets of costs are huge. For these services and data, there is still a set of locations where the deployment location chooses a geographically comparable center and communicates with each node through the network's internal line. Node decentralization: The deployment of some services as multiple sets, distributed in various urban nodes, so that users request to choose as close as possible node access services.

For example, we choose to deploy in Shanghai as the core node, Beijing, Shenzhen, Wuhan, Shanghai as a decentralized node (Shanghai itself is also a decentralized node). Our service architecture is as follows:


What needs to be added is that the Shanghai node and the core node in the above figure are in the same room, and the other decentralized nodes are separate rooms.
There are many large-scale online games, are generally follow the above framework. They will put a small number of users of the core account and so on the core node, and most of the game data, such as equipment, tasks and other data and services in the regional node. Of course, there is also a caching mechanism between the core node and the geographic node.

Second, node disaster tolerance and overload protection

Node disaster tolerance means that if a node fails, we need to establish a mechanism to ensure that the service is still available. There is no doubt that the most common disaster-tolerant approach here is to switch to nearby city nodes. If the Tianjin node of the system fails, then we will switch the network traffic to the nearby Beijing node. Considering load balancing, you may need to switch traffic to several nearby geographic nodes at the same time. On the other hand, the core nodes themselves also need to do their own disaster recovery and backup, the core node once the failure, will affect the national service.

Overload protection, refers to a node has reached the maximum capacity, can not continue to receive more requests, the system must have a protection mechanism. A service is already full load, also continue to accept new requests, the result is likely to be downtime, affect the entire node of the service, in order to protect at least the majority of users of normal use, overload protection is necessary.

To solve overload protection, in general 2 directions: Denial of service, after detection of full load, no longer accept new connection requests. For example, online login in the queue. Streaming to other nodes. In this case, the system is more complex and involves the problem of load balancing. Summary

The web system will grow as the size of the visit, gradually from 1 servers to meet the demand, has grown into a "giant" large cluster. And the process of the web system getting bigger is actually our problem solving process. At different stages, different problems are solved, and new problems are born on the old solutions.

There is no limit to the optimization of the system, the software and system architecture has been developing rapidly, the new solution solves the old problem and brings the new challenge.

About the author: Xu Hanbin, has been in Alibaba and Tencent has more than 4 years of technical research and development work experience, currently in Xiaoman Technology (entrepreneurship).


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.