Author: skate
Time: 2010-07-13
Internet Architecture Evolution Model
There have been some articles about the evolution of large website architectures, such as livejournal and eBay, which are very worthy of reference. However, they feel more about the results of each evolution, without a detailed explanation of why we need to make such an evolution, coupled with the recent feeling that many people are hard to understand why a website needs so complex technology, with the idea of writing this article, this article will describe a typical Architecture Evolution and a knowledge system to be mastered in the process of developing a common website into a large website, I hope to give some preliminary ideas to those who want to engage in the Internet industry ,:),
Please give me more suggestions for the mistakes in this article, so that this article can be used as an example.
Architecture Evolution Step 1: physically separate webserver and database
At the beginning, due to some ideas, a website was built on the Internet. At this time, even hosts may be rented. However, as this article only focuses on the evolution of the architecture, therefore, it is assumed that a host has been hosted at this time, and there is a certain amount of bandwidth. At this time, because the website has certain characteristics, it attracts some people to access it, gradually, you find that the system is under increasing pressure and the response speed is getting slower and slower. What is obvious at this time is that the database and application are mutually affected, and the application has problems, and the database is also prone to problems, when there is a problem with the database, the application is prone to problems, so it enters the first stage of evolution: separating the application from the database physically and changing it into two machines, there are no new technical requirements at this time, but you find that the results are indeed effective, and the system has recovered to the previous response speed and supported higher traffic, it does not affect each other because of databases and applications.
Take a look at the system diagram after this step is completed:
This step involves these knowledge systems:
The evolution of this architecture is basically not required by the technical knowledge system.
Step 2 of Architecture Evolution: add page Cache
The good news is not long. As more and more people access the database, you find that the response speed starts to slow down again. Looking for the reason, you find that there are too many operations to access the database, which leads to fierce competition in data connection, resulting in slow response, however, the database connection cannot be opened too much. Otherwise, the pressure on the database machine will be high. Therefore, we should consider using a cache mechanism to reduce the competition for database connection resources and the pressure on Database reading, at this time, you may first choose to use a similar mechanism such as squid to cache relatively static pages in the system (for example, pages updated in a day or two) (of course, you can also use the static page solution). In this way, you can reduce the pressure on webserver and reduce the competition for database connection resources without modifying the program. OK, so we started to use squid for relatively static page cache.
Take a look at the system diagram after this step is completed:
This step involves these knowledge systems:
Front-end page cache technology, such as squid. If you want to use it well, you must have a thorough understanding of squid implementation methods and cache failure algorithms.
Step 3 of Architecture Evolution: add page fragment Cache
After squid is added for caching, the overall system speed is indeed improved, and the pressure on WebServer is also decreasing. However, as the access volume increases, it is found that the system has started to slow down, after learning about the benefits of Dynamic Caching such as squid, I began to think about how to make the static parts of the dynamic pages cached, therefore, we consider using a page fragment caching policy like ESI, Which is OK, so we began to use ESI to cache relatively static parts of dynamic pages.
Take a look at the system diagram after this step is completed:
This step involves these knowledge systems:
Page fragment caching technology, such as ESI, also needs to master the implementation method of ESI if you want to use it well;
Step 4 of Architecture Evolution: data caching
After using ESI and other technologies to improve the system's cache effect again, the system pressure is indeed further reduced. However, as the access volume increases, the system continues to slow down and goes through searching, someWhere repeated data information is obtained, Such as getting user information. At this time, I began to consider whether the data information can be cached as well, so I cached the data to the local memory. After the change, it would be exactly as expected, the response speed of the system has been restored, and the pressure on the database has been reduced a lot.
Take a look at the system diagram after this step is completed:
This step involves these knowledge systems:
Cache Technology, including map data structures, cache algorithms, and implementation mechanisms of the framework itself.
Step 5 of Architecture Evolution: Add Webserver
The good news is not long. It is found that with the increase of System Access traffic, the pressure on webserver machines will rise to a relatively high level during the peak period. At this time, we began to consider adding a webserver to solve the problem of availability at the same time, it is impossible to use a single webserver if it is down. After these considerations, I decided to add a webserver and a webserver, which may encounter some problems. Typical examples include:
1. How to allocate access to these two machines? In this case, we usually consider the Server Load balancer solution that comes with Apache or software Load balancer solutions such as LVS;
2. How to keep the state information synchronized, such as user sessions, and so on. In this case, we will consider mechanisms such as writing data to the database, writing data to storage, Cookie, or synchronizing session information;
3. How to keep the data cache information synchronized, such as previously cached user data, which usually involves cache synchronization or distributed cache;
4. How to ensure that similar functions such as file uploading continue to work normally, the mechanism usually considered is to use shared file systems or storage;
After solving these problems, we finally increased the number of webservers to two, and the system finally recovered to the previous speed.
Take a look at the system diagram after this step is completed:
This step involves these knowledge systems:
Server Load balancer technology (including but not limited to hardware Server Load balancer, software Server Load balancer, load algorithms, Linux forwarding protocols, and implementation details of the selected technology) master-slave technology (including but not limited to ARP spoofing and Linux heart-beat), status information or cache synchronization technology (including but not limited to Cookie technology, UDP protocol, status information broadcast, the implementation details of the selected cache synchronization technology, etc) shared File technology (including but not limited to NFS) and storage technology (including but not limited to storage devices ).
Step 6 of Architecture Evolution: Database sharding
After enjoying the high-traffic growth of the system for a period of time, I found that the system began to slow down again. What is the situation this time, it is found that the competition for some database connection resources for database write and update operations is fierce, leading to system slowdown. What should we do now? The available solutions include database clusters and database sharding policies, in terms of clusters, some databases do not support very well. Therefore, database sharding will become a common policy. Database sharding means that you need to modify the original program and implement database sharding through one-pass modification, the target is reached, and the system recovery speed is even faster than before.
Take a look at the system diagram after this step is completed:
This step involves these knowledge systems:
This step requires a reasonable division of the business to achieve database sharding. There are no other requirements for specific technical details;
However, with the increase in data volume and database sharding, database design, optimization, and maintenance must be improved. Therefore, high requirements are raised for these technologies.
Architecture Evolution Step 7: Table sharding, Dal, and distributed cache
As the system continues to run, the amount of data starts to increase significantly. At this time, it is found that the query will still be slow after database sharding.Database shardThe ideaTable shardingOf course, this will inevitably require some modifications to the program. At this time, you may find that the application needs to care about database/table sharding rules and so on. It is still complicated, is it possible?Add a general framework to implement database/table sharding Data AccessThis corresponds to the Dal in the eBay architecture. This evolution takes a long time. Of course, it is also possible that this general framework will not be started until the sub-tables are completed. At the same time, at this stage, it may be found that the previous cache synchronization solution has problems, because the data volume is too large, as a result, it is unlikely that the cache will be stored locally, and then the synchronous method requires a distributed cache scheme. Therefore, it is an investigation and torture, finally, a large amount of data cache is transferred to the distributed cache.
Take a look at the system diagram after this step is completed:
This step involves these knowledge systems:
Table shards are also business partitions. Technically, dynamic hash algorithms and consistent hash algorithms are involved;
Dal involves many complex technologies, such as database connection management (timeout and exception), Database Operation Control (timeout and exception), and database/table sharding rule encapsulation;
Step 8 of Architecture Evolution: add more webservers
After database and table sharding, the pressure on the database has been reduced to a relatively low level. Then, I began to live a happy life of daily traffic surge. suddenly one day, we found that the system access started to slow down again. At this time, we first checked the database and the pressure was normal. Then we checked the webserver and found that Apache blocked a lot of requests, the application server is also relatively fast for each request. It seems that the number of requests is too high, leading to the need to wait in queue and slow response. This is a good solution. In general, there will be some money at this time, as a result, some webserver servers are added. In this process of adding webserver servers, there may be several challenges:
1. Apache's soft load or LVS soft load cannot handle the scheduling of huge web traffic (number of request connections, network traffic, etc.). If funds permit this, the solution is to purchase hardware loads, such as F5, netsclar, and athelon. If funds are not allowed, the solution is to logically classify applications, then distributed to different soft load clusters;
2. Some original status information synchronization and file sharing solutions may encounter bottlenecks and need to be improved. At this time, a distributed file system meeting the website business needs may be compiled as appropriate;
After completing this work, we began to enter an era of seemingly perfect unlimited scaling. When the website traffic increases, the solution is to constantly add webservers.
Take a look at the system diagram after this step is completed:
This step involves these knowledge systems:
At this point, as the number of machines continues to grow, the amount of data continues to grow, and the requirements for system availability are getting higher and higher, we need to have a deeper understanding of the technology we are using, we also need to make more customized products based on the needs of the website.
Step 9 of Architecture Evolution: data read/write splitting and low-cost storage solutions
Suddenly one day, I found that this perfect era is coming to an end, and the database's nightmare is coming soon again. Because too many webservers are added, the database connection resources are still insufficient, at this time, we have already split databases and tables. When we begin to analyze the database pressure, we may find that the database read/write ratio is very high. At this time, we usually think of the data read/write splitting solution. Of course, this solution is not easy to implement. In addition, some data may be stored in the database, which is a waste or occupies too much database resources, therefore, the architecture that may be formed at this stage is to implement data read/write splitting and write cheaper storage solutions, such as bigtable.
Take a look at the system diagram after this step is completed:
This step involves these knowledge systems:
Data read/write splitting requires an in-depth understanding of database replication, standby, and other strategies, and requires self-implemented technologies;
The low-cost storage solution requires a deep understanding and understanding of OS file storage, and a deep understanding of the implementation of the language used in the file.
Step 10 of Architecture Evolution: entering the era of large-scale distributed applications and the dream age of cheap SERVER CLUSTERS
After the long and painful process above, we finally ushered in the perfect age again. The increasing number of webservers can support the increasing access volume. For large websites, the popularity is beyond doubt, with the increasing popularity, various functional requirements began to surge. At this time, we suddenly found that the Web application originally deployed on the webserver was already very large, when multiple teams begin to modify it, it is really inconvenient,ReusabilityIt is also quite bad. Basically, every team has done more or less repetitive tasks, and deployment and maintenance are also quite troublesome, because it takes a lot of time to copy and start a large application package on N machines, it is not very easy to check when there is a problem, another worse situation is that it is likely that a bug in an application will make the entire site unavailable, there are other factors such as poor optimization (because the application deployed on the machine has to do everything, and no targeted optimization can be performed at all). Based on such analysis, I began to make up my mind, setSystem split according to dutiesAs a result, a large distributed application was born. Generally, this step takes a long time, because it will encounter many challenges:
1. A high-performance and stable communication framework should be provided after the distributed architecture is split, and different communication and remote call methods should be supported;
2. Splitting a large application takes a long time and requires business organization and system dependency control;
3. How to perform O & M (dependency management, operation status management, Error Tracking, tuning, monitoring, and alarms) for this large distributed application.
After this step, the architecture of similar systems has entered a relatively stable stage, and a large number of cheap machines can also be used to support huge volumes of traffic and data, using this architecture and the experience gained from so many evolutionary processes, we can use various other methods to support increasing access volumes.
Take a look at the system diagram after this step is completed:
This step involves these knowledge systems:
This step involves a lot of knowledge systems and requires an in-depth understanding and understanding of communications, remote calls, messaging mechanisms, etc, all requirements are clearly understood in terms of theory, hardware, operating system, and language.
O & M involves many knowledge systems. In most cases, you need to master distributed parallel computing, reports, monitoring technology, and rule policies.
It is really not very laborious to say. The classic evolution process of the entire website architecture is similar to the above. Of course, the steps for evolution in each step may be different. In addition, due to the different services of the website, there will be different technical requirements. This blog is more about explaining the evolution process from the perspective of architecture. Of course, many other technologies are not mentioned here, suchDatabase clusters, data mining, and searchBut in the real evolution process, the image will be upgradedHardware configuration, network environment, operating system transformation, CDN ImageAnd so on to support more traffic, so there will be a lot of difference in the real development process, another large website to do far more than the above, there is also likeSecurity, O & M, operation, service, and storageIt is really not easy to build a large website. I write this article to introduce more about the evolution of the large website architecture.
In addition, from the architectural evolution above, the cache technology plays a very important role in improving system performance ----Bringing data closer to cpu
---- End ---