A brief talk on Tens PV/IP high-scale high-performance concurrent website architecture
Original: http://blog.51cto.com/oldboy/736710
Structure diagram of the article:
The core principle of high concurrent access is actually a "push all user access requests as far forward as possible."
If the visiting user is likened to the "enemy" of the invasion, we must keep them out of the 800 mile, that is, we cannot let their requests hit our headquarters (the command is the database and distributed storage).
Such as: can slow the existence of local users of the computer, do not let him to access the CDN. Can cache the CDN server, do not let the CDN to access the source (static server). If you have access to a static server, do not access the dynamic server. And so on: You can not access the database and storage must not access the database and storage.
It is easy to say, actually do it is not easy, but as long as a little effort can be done, Google's Day independent IP over billion do not do it? Our tens of millions of PV station than Google is not the hut to see the big house. Let's start with our hut! Ha ha! The introduction of the following is the starting point is tens other PV station, can also support the website architecture of billion-level PV.
High-performance and highly concurrent high-scalability site architecture access to several levels:
Some people will ask, we always say the user to the business of the access to push forward, how to push ah? Where's the push? Below, the old boy came for everyone.
First layer: first in the user browser side, using Apache mod_deflate compression transmission, again such as: Expires function, deflate and expires function of good use, will greatly enhance the user experience and reduce the bandwidth of the website, Reduce the pressure on the backend server. Of course, there are many ways to do this.
Tip: The same is true for software such as compression transfer and expires function NGINX/LIGHTTPD.
The Second layer: page elements, slices/js/css, or static data HTML, this level is the page cache layer, such as CDN (effect than the company's own deployment Squid/nginx better, they are more professional, inexpensive, such as fast net/CC, etc. (Price 80 Yuan/m/ Months or lower) and cover more city nodes), set up Squid/nginx cache to do small CDN is the second choice (super-large companies may consider the risk of the implementation of self-built plus purchase services combined), unless the front-end CDN to provide data source services, To reduce the back end of our server data and storage pressure, rather than directly provide the cache service to the end user. Taobao CDN once because a part of the image of the second-inch large CDN pressure, or even large size of the picture to small, in order to achieve the effect of reducing the flow and bandwidth.
Tip: We can also set up a layer of cache on our own, to provide data source services to the CDN we purchase, the software available is varnish/nginx/squid and other caches, to alleviate the pressure of the third layer of static data layer. In this layer of the front-end we can also set up a DNS server, to achieve cross-room business development and intelligent analysis purposes.
the third layer: static server layer is generally a picture server, video server, static HTML server. This layer is the link between the front buffer layer and the dynamic server layer behind it, the big company releases news and other content directly from the Publisher to each cache node (sina,163, etc.), which may not be the same as the general company's business. Therefore, there is no direct reference to imitation, such as everyone's SNS.
We can use the Q-Queue method to achieve asynchronous distribution access, while the dynamic publishing data (data in the database) is stored statically. That is, put to the level of access, or through other methods to publish to the cache node, rather than directly let all users to access the database, do not know that people have found no, qq.com portal News review more than hundreds of thousands of, if all users see the news to load all the comments, the database is not hanging only strange. Their comments need to be audited (in the form of the name of the United States, which is actually asynchronous, and that comments may be static or similar in the way of static or memory caches), This may be the need to 51cto.com such site learning, you open a 51CTO blog post, you will find the following comments have been shown, or may be paged. However, should be directly read the library, once the volume of access, database pressure is inevitable. This is not to say that the 51cto site is not good, all the sites are from a similar program architecture began to develop. CU may also be the case.
Hint: We can set up a layer of cache on the front end of the static data layer, provide the data source service for the CDN we purchased, and the available software has the varnish/nginx/squid and so on cache. In this layer of the front-end we can also set up a DNS server, to achieve cross-room business development and intelligent analysis purposes.
Layer Fourth: dynamic server layer: Php,java, etc., only through the previous 3 layers of access requests will be to this layer, it is possible to access the database and storage devices. Through the first three layers of access filtering to this level of access to the request is generally very small, generally is the new content and the new release content for the first time to browse such as; blog post (including Weibo, etc.), BBS posts.
Special Note: This layer can be more in the program, such as access to the cache layer, memcache,memcachedb,tc,mysql,oracle, at the program level to achieve distributed access, distributed read and write separation, and program-level distributed access to each DB cache node , or it can be a set of business or a set of business split multiple servers load balancing. Such a structure will greatly reduce the pressure on the subsequent database and storage layer, then here, the equivalent of the outer layer of the command.
Layer Fifth: Database cache layer, such as: MEMCACHE,MEMCACHEDB,TC and so on.
Choose the right database for your business, depending on your business needs. For Memcache, Memcachedb Ttserver and related NoSQL databases, you can implement distributed access to this layer at layer fourth through a program, and each distributed access node may be a set of load balancers (dozens of machines).
layer Sixth: database layer, the general is not a large site will use MySQL master-slave structure, such as: 163,sina,kaixin are so, the program layer to do the distributed database read and write separation, a master (or dual master) more from the way, access to the big, Can do the master-slave and multi-loop multi-master, and then, to achieve multiple sets of load balancing for the front-end of the distributed program calls, if the traffic is large, you need to dismantle the business, such as: I will give a company a part-time, found a similar 51cto site, the WWW service, blog services, BBS services are placed on a server, and then the master from. In this case, when the traffic volume is large, you can simply split the Www,blog,bbs service separately with a group of servers, which will not be difficult to operate. Of course the traffic is large, you can continue to split for a service such as: www library split, each library to do a set of load balancing, you can also split the table in the library. High availability is required and can be made highly available through tools such as DRBD. For Oracle to write large, can master the main or multi-master MySQL Rep mode, in the case of a group of Oracle DG (1master Multi-salve mode) is enough, 11G DG can be like MySQL rep, support read and write separation. Of course there are options, MySQL cluster and Oracle RAC, playing MySQL cluster and Oracle RAC need better hardware and a lot of maintenance costs after deployment, therefore, to take into account the number of visits here is still very large, then congratulations, At least tens of millions of or even billions of PV.
Giant companies like Baidu, in addition to the regular MySQL and Oracle database Library, will be in a higher performance requirements of the field, a large number of NoSQL databases, and then the front-end in the DNS, load balancing, distributed read and write separation, and finally still dismantling the business, the demolition, ... Gradually refine, and then each point can be a group or groups of machines.
Special Note: The database layer of hardware will also determine the number of visits, especially to consider the problem of disk IO, large companies tend to make a fuss on the cost-effective, such as the core business using hardware NETAPP/EMC and San fiber architecture, for resource data storage, video, will use SAS or SSD disk, If the data is large, you can take a hot spot to split the method: such as: the most frequently accessed 10-20% using SSD storage, the middle of the 20-30% using SAS disk, the final 40-50% can be used cheap sata.
Seventh Floor: Tens PV Station if the design is reasonable, 1, 2 NFS servers will suffice. I maintain (part-time) or experienced over the tens of millions of PV with NFS and ordinary servers to do storage of a lot of, more disks, such as SAS 15k*6, or with dell6850, a few groups of NFS storage, small and medium-sized website enough. Of course, can be made drbd+heartbeat+nfs+a/a way.
If you can achieve the design requirements of this article, the medium-sized website, the back end of the database and storage pressure will be very small. Like the portal level, such as XX, will adopt hardware NETAPP/EMC and so on hardware storage device or San fiber, even in the cost-effective, such as the core business using hardware NETAPP/EMC and San fiber architecture, for resource data storage, video, will use SAS or SSD disk, if the data is too high, you can take a hot spot to divide the method: such as: the most frequently accessed 10-20% use SSD storage, the middle of the 20-30% using the SAS disk, the last 40-50% can use cheap sata.
Giant companies like XX will use distributed storage architectures such as Hadoop, with multi-layer cache and multi-load balancing on the front end, which will also be split according to the business, such as crawler storage, index storage, service layer storage ... can be finer and finer ... In order to cope with the pressure, what means are used.
Special business, such as some SNS portals, including the portal comments, Weibo, mostly asynchronous writing, that is, both read and write, concurrent access to the database is very small.
Above 1-7 layers, if all set up, so slip into the fourth layer of dynamic server layer access, there is not much. The average medium site, will never cause too much pressure on the database. The distributed access of the program layer is from tens of millions and PV to billion-level PV, of course, special business needs special architecture to make reasonable use of database and storage.
(turn) A brief talk on Tens PV/IP scale high performance high concurrency website architecture