There is a lot of sharing on the internet about the architecture of the website, some of which are mainly analyzed from the perspective of transport and infrastructure (heap machines, do clusters), too concerned about the implementation of technical details, ordinary developers do not understand the basic.
This article will mainly introduce the expansion of the large-scale Web site infrastructure, the next part of the focus from the application perspective to introduce the expansion and evolution of the site architecture.
Grassroots time, rapid development of the site and online. Of course, it is usually just the first test of water, the size of the user is not formed, the economic capacity and investment is very limited.
Have a certain amount of business and user size, want to improve the speed of the site, so, the cache appeared.
Market response is also good, the number of users growing every day, the database crazy read and write, and gradually found a server is not going to hold up. So, decided to separate the DB and the app.
A single database is also feeling fast, generally will try to do "read and write separation." Because most of the Internet "read more write less" the characteristics of the decision. The number of salve depends on the read/write ratio evaluated by the business.
The database level is mitigated, but the application level is also a bottleneck, due to the increase in traffic, coupled with the early programmer level of limited write code is also very bad, people mobility is very difficult to maintain and optimize. So the most common way is to "heap machines."
Add the machine who will add, the key is to add after the effect, after the addition may cause some problems. For example very common: page output caching and local cache issues, session save issues ...
Here, the DB level and the application level have been basically scaled up, and you can start to focus on other aspects, such as the accuracy of the site search, the reliance on the DB, and the introduction of full-text indexing.
The Java domain uses the more is Lucene, SOLR, and so on, while PHP domain uses more is sphinx/coreseek.
So far, a medium-sized website architecture capable of hosting daily millions visits has basically been introduced. Of course, every step of the extension there will be a lot of technical implementation of the details, the subsequent time will write the article alone to analyze those details.
After scaling up to meet basic performance requirements, we gradually focus on "usability" (that is, the SLA that we usually listen to when people brag, a few 9). How to ensure the true "high availability" is also a problem.
Almost the mainstream of large and medium-sized internet companies, will be useful to similar architectures, but the number of nodes is different.
There is another way to use more, that is the separation of motion and movement. Developers can be required to cooperate (put static resources under the stand-alone site), or do not require the developer to work with (7-layer reverse proxy to handle, based on information such as suffix name to determine the resource type). With a separate static file server, storage is also an issue and needs to be extended. How do multiple server files stay consistent and can't afford to buy shared storage? The Distributed File system also comes in handy.
There is also a very common technology CDN acceleration at home and abroad. At present, the field of competition is fierce, it has been relatively cheap. Domestic and north-South Internet problem is more serious, using CDN can solve this problem effectively.
The basic principle of CDN is not complicated, it can be understood as intelligent Dns+squid Reverse proxy cache, then need to have many machine room nodes to provide access.
So far, there is no way to change the architecture of the application, or the popular point, do not need a large area of code changes.
What if the above means are exhausted, or can't support? It's not a way to keep adding machines?
As the business becomes more and more complex, the site is more functional, although the deployment layer is the cluster, but the application architecture level is still "centralized", which will lead to a lot of coupling, not easy to develop, maintenance, and easy "one wing loss." As a result, the site is usually split into separate sub-sites to host separately.
Applications are removed, due to the connection of a single database, QPS,TPS,I/O processing power is very limited, the DB level can also do vertical sub-library operations
After splitting the application and DB, there are still a lot of problems. At different sites, there may be code with the same logic and functionality. Of course, for some basic functionality we can encapsulate DLLs or jar packages to provide references everywhere, but this strong dependency can also easily cause problems (versioning issues, dependencies, etc. are cumbersome to handle). In this way, the value of the legendary SOA will be reflected.
There are still some dependencies between applications and services, when the high-throughput decoupling tool appears
Finally, it also introduces a big internet company's stunt-------------sub-database. Personal experience, not the business of the station and all aspects of the very urgent, do not take this step easily.
Because the sub-database of the table who will do, the key is to do after the demolition. At present, there is no fully open source free solution, can let you solve the database splitting problem once and for all.
Evolution of large-scale website architecture system