Good things to share: Talk about the system architecture of large high concurrent high load Web sites

Source: Internet
Author: User

Reprint please retain the Source: June-Lin Michael's blog (http://www.toplee.com/blog/?p=71)
Trackback url:http://www.toplee.com/blog/wp-trackback.php?p=71

I have done in the Cernet dial-up access platform, and then in yahoo&3721 engaged in the search engine front-end development, but also in the mop deal with large-scale community mop a hodgepodge of the structure of the upgrade and other work, at the same time they have contacted and developed a number of large and medium-sized Web site modules, Therefore, there are some accumulation and experience in coping with high load and concurrent solutions for large Web sites, and we can discuss them with you.


A small site, such as personal site, you can use the simplest HTML static page to achieve, with some pictures to achieve beautification effect, all the pages are stored in a directory, such a site on the system architecture, performance requirements are very simple, with the continuous enrichment of internet business, Site-related technology through the development of these years, has been subdivided into very fine aspects, especially for large web sites, the technology is involved in a very wide range, from hardware to software, programming language, database, WebServer, firewalls and other fields have a high demand, is not the original simple HTML static site can match.

A large web site, such as a portal site. In the face of a large number of user access, high concurrent requests, the basic solution focused on a number of links: the use of high-performance servers, high-performance databases, efficient programming languages, and high-performance web containers. But in addition to these, there is no way to solve the high load and high concurrency problems faced by large web sites.

Some of the solutions provided above also mean a greater amount of input, and the solution has bottlenecks and no good scalability, and I'm going to say some of my experiences from a low cost, high performance and high scalability perspective.

1, HTML static
In fact, we all know that the most efficient and consumption of the smallest is the pure static HTML page, so we try to make our site on the page to use static page to achieve, the simplest method is actually the most effective way. But for a lot of content and frequently updated Web sites, we can not all manually to achieve one by one, so there is our common Information distribution system CMS, like we often visit the various portals of the news channels, and even their other channels, are through the information publishing system to manage and implement, Information Publishing system can realize the simplest information input automatically generate static pages, but also have channel management, authority management, automatic crawl functions, for a large web site, with a set of efficient, manageable CMS is essential.

In addition to portals and information publishing types of Web sites, for highly interactive community type sites, as far as possible static is also a necessary means to improve performance, the community of posts, articles in real time static, there is an update when the static again is a large number of use of the strategy, A hodgepodge of MOP is the use of such strategies, NetEase community and so on. At present, many blogs are also implemented static, I use this blog program WordPress has not static, so if the face of high load access, www.toplee.com must not bear

At the same time, HTML static is also the use of some caching strategies, for the system frequently use database query but the content update is very small application, you can consider the use of HTML static to implement, such as forum forum in public settings information, This information is currently the mainstream forum can be managed and stored in the database, the information is actually a large number of the foreground program calls, but the update frequency is very small, you can consider this part of the content to be updated in the background when the static, so as to avoid a large number of database access requests.

In the HTML static can use a compromise method, is the front-end use dynamic implementation, under a certain strategy for timing static and timing judgment call, this can achieve a lot of flexibility of operation, I developed the billiard site of the people of the Home (www.8zone.cn) is the use of such a method, I set some HTML static time interval to the dynamic Web content caching, to share most of the pressure on the static page, can be applied to the architecture of small and medium sized websites. The address of the home site: http://www.8zone.cn, by the way, there are friends who like billiards support me this free website:

2, Picture server separation
As you know, for the Web server, whether it is Apache, IIS or other containers, the picture is the most resource-consuming, so we need to separate the picture and the page, which is basically a large site will adopt a strategy, they have a separate picture server, and even many of the image server. Such a framework reduces the pressure on the server system that provides page access requests, and ensures that the system does not crash due to picture problems.

In the application server and picture server, can be different configuration optimization, such as Apache in the configuration of contenttype can be as little as possible to support, as little loadmodule, to ensure higher system consumption and execution efficiency.

My billiard site 8zone.cn also used the image server architecture separation, is currently just a separation of the architecture, there is no physical separation, because there is no money to buy more servers:), You can see that the image connection on the people's home is similar to img.9tmd.com or img1.9tmd.com URLs.

In addition, in the processing of static pages or images, JS and other access, you can consider using LIGHTTPD instead of Apache, it provides a more lightweight and more efficient processing capabilities.

3, database cluster and library table hash
Large Web sites have complex applications, these applications must use the database, then in the face of a large number of accesses, the database bottleneck will soon emerge, when a database will soon not meet the application, so we need to use a database cluster or library table hash.

In the database cluster, many databases have their own solutions, Oracle, Sybase, and so have a good solution, the common MySQL provided by the Master/slave is similar to the solution, you use what kind of db, refer to the corresponding solution to implement it.

The database cluster mentioned above is constrained by the DB type used in architecture, cost, and extensibility, so we need to consider improving the system architecture from an application perspective, which is a common and most effective solution. We install the business and application in the application or function module to separate the database, different modules corresponding to different databases or tables, and then according to a certain strategy for a page or function of a smaller database hash, such as user table, according to User ID table hash, This will improve the performance of the system at a low cost and have a good scalability. Sohu's forum is to adopt such a framework, the Forum users, settings, posts and other information for the database separation, and then the posts, users in accordance with the plate and ID hash database and table, the final configuration file can be a simple configuration can make the system at any time to add a low-cost database to supplement the system performance.

4, caching
The word cache has been approached with technology, and many places use caching. The Web site architecture and caching in Web development are also very important. Here we first describe the two most basic caches. The advanced and distributed caching is described later.

Architecture of the cache, more familiar to Apache people can know that Apache provides its own Mod_proxy cache module, can also use the addition of squid for caching, both of which can effectively improve Apache access response capabilities.

Web site program development cache, Linux on the memcached is commonly used caching scheme, many Web programming languages provide memcache access interface, PHP, Perl, C and Java, can be used in web development, can be real-time or cron data, objects, and policies are flexible. Some large communities use such a framework.

In addition, in the use of web language development, all languages have their own caching modules and methods, PHP has pear cache module and eaccelerator acceleration and cache module, but also well-known APC, XCache (people developed, support. PHP Cache module, Java is more,. NET is not very familiar, I believe there must be.

5, Mirror
Mirroring is a large web site often used to improve performance and data security, mirroring technology can solve the different network access and geographical user access speed difference, such as the difference between chinanet and Edunet has prompted many sites in the education network to build mirror sites, The data is scheduled to be updated or updated in real time. In the details of mirroring technology, here does not elaborate too deep, there are many professional off-the-shelf solution architecture and product optional. There are also inexpensive ways to implement the software, such as the Linux on the rsync and other tools.

6. Load balance
Load balancing will be the ultimate solution for large web sites to address high load access and a large number of concurrent requests.

Load balancing technology has developed for many years, there are many professional service providers and products to choose from, I personally contacted a number of solutions, of which two of the framework can be used for reference. In addition, the primary load balancing DNS round robin and the more professional CDN architecture is not much to say.

6.1 Hardware four-layer exchange
The fourth layer Exchange uses the header information of the third layer and the fourth Layer information packet, according to the application interval to identify the traffic flow, the entire interval segment of the traffic flow to the appropriate application server for processing. Layer Fourth switching functions are like virtual IP, pointing to the physical server. It transmits a variety of business compliance protocols, with HTTP, FTP, NFS, Telnet, or other protocols. These services require a complex load balancing algorithm based on the physical server. In the IP world, the business type is determined by the terminal TCP or UDP port address, and the application interval in layer fourth switching is determined by the source and terminal IP addresses, TCP, and UDP ports.

In the hardware four-tier switching product area, there are some well-known products to choose from, such as Alteon, F5 and so on, these products are expensive, but value for money, can provide very good performance and very flexible management capabilities. Yahoo China in the beginning of nearly 2000 servers using three or four units Alteon was done.

6.2 Software four-layer exchange
You know, after the principle of the hardware layer four switch, the software four layer exchange based on the OSI model comes into being, the principle of this solution is consistent, but the performance is slightly poor. But to meet a certain amount of pressure or easy, some people say that the software implementation is actually more flexible, processing ability completely look at your configuration of the familiar ability.

Software four-tier exchange we can use the Linux on the commonly used LVS to solve, LVs is Linux Virtual Server, he provides a real-time disaster response based on the heartbeat line heartbeat solution, improve the robustness of the system, At the same time can provide flexible virtual VIP configuration and management functions, can meet a variety of application requirements, which is essential for distributed systems.

A typical strategy for using load balancing is to in the software or hardware four-tier exchange based on squid cluster, this idea in many large Web sites including search engines are adopted, such a low-cost architecture, high-performance and strong expansion, at any time to the structure of the add and subtract nodes are very easy. This architecture I am ready to clean up and discuss with you.

Summarize:
For a large web site, each of the above mentioned methods may be used at the same time, Michael introduced here is more superficial, the specific implementation of a lot of details need to be familiar with and experience, sometimes a very small squid parameters or Apache parameter settings, the impact on the system performance will be very large , I hope we can discuss together to achieve the effect of the discussion.

Reprint please retain the Source: June-Lin Michael's blog (http://www.toplee.com/blog/?p=71)
Trackback url:http://www.toplee.com/blog/wp-trackback.php?p=71 This entry is filed under + +/other technology, technical exchange. You can follow any responses to this entry through the RSS 2.0 feed. Can leave a response, or trackback from your own site.

( 2 votes, average: 6.5 out of) Loading ... Responses to "Talk about the system architecture of large high concurrent high load Web sites" 1 Pi1ot says:
April 29th, 2006 at 1:00 pm Quote

Each module or process communication between the general asynchronous queuing is also very important, can take into account the light load response performance and system pressure, the database pressure can be decomposed to file system through files, file system IO pressure through mem cache decomposition, the effect is very good. 2 Exception says:
April 30th, 2006 at 4:40 pm Quote

Well written. Now, online like this article is not much, after reading a lot of 3 guest says:
May 1st, 2006 at 8:13 am Quote

Totally nonsense!
"You know, for Web servers, whether it's Apache, IIS, or any other container, the picture is the most resource-consuming", and you think it's dynamically generating images in memory. "Whatever the file, the output of the container is just read the file, output to the response, and what is the file relationship.

The key is that there should be different strategies between static files and dynamic pages. If the static file should be cached as much as possible, because no matter how many times you request the output is the same, if there are 20 users on the page there is no need to request 20 times, but should use the cache. and dynamic pages each request output is different (otherwise it should be static ), so it should not be cached.

So even on the same server can be static and dynamic resources to do different optimization, dedicated image server that is for the convenience of resource management, and you say that the performance does not matter. 4 Michael says:
May 2nd, 2006 at 1:15 am Quote

Dynamic cache Case estimates upstairs friends have not encountered, in the processing of Inktomi search results in the case, we use all the face of dynamic caching, for the same keywords and query conditions, such a cache is very important for dynamic content caching, The use of reasonable header parameters in programming can facilitate the management of caching policies, such as expiration times.

We talk about the performance of the picture affect the problem, generally comes from most of our access page pictures are often more than the HTML code occupy the flow, in the same network bandwidth, picture transmission takes longer, because the transmission needs to spend a lot of money in the establishment of the connection, This will extend the length of the HTTP connection to the server side of the user's client side, which will certainly degrade for Apache, unless your return is all static, you can reduce the connection processing time by httpd.conf the keepalives in the , but if there are too many pictures, the number of connections is increased and performance is also consumed.

In addition, we refer to the theory is more for the case of large clusters, in such an environment, image separation can effectively improve the architecture, and then affect the performance of the promotion, you need to know why we talk about the framework. Architecture may be for security, for resource allocation, and for more scientific development and management, but it's all about performance.

It is also easy to find descriptions of the MIME type and content length sections in the HTTP protocol documentation for RFC1945, which is easy to understand about the performance impact of the image.

Upstairs friends are completely villain as, I hope not to use the guest with me, men are afraid of others know your name. Besides, even if you're wrong, you're not going to use nonsense to pick a fault. Everyone heavy in communication and learning, I am not a high man, at most is a common programmer just. 5 Ken Kwei says:
June 3rd, 2006 at 3:42 pm Quote

Michael Hello, this article I read several times, there is a problem, your article mentioned in the following paragraph:

"For highly interactive community-type sites, as far as possible static is also to improve the performance of the necessary means, the community posts, the article for real-time static, there is an update when the static again is a large number of use strategies, such as MOP is a hodgepodge of the use of such strategies, NetEase community and so on. ”

For a large site, his database and Web Server are generally distributed, in a number of areas are deployed, when a user access to a region will correspond to a node, if the post in the community in real time static, there are updates and then again static, then how to sync between the nodes immediately. How to implement the database end. If the user does not see the words will think the post failed. Cause repeated hair, then how to lock the user on a node, how to solve these. Thank you. 6 Michael says:
June 3rd, 2006 at 3:57 pm Quote

For a user to lock on a node is achieved through a four-tier exchange, usually so, if the application of relatively small can be implemented through the program code. Large applications typically manage user connections through a four-tier switch like LVS and hardware, making it possible for a user's connection to remain on a node during the lifetime.

There are more static and synchronized strategies, the general approach is centralized or distributed storage, but static is achieved through centralized storage, and then use the front end of the proxy group to achieve caching and sharing pressure. 7 Javaliker says:
June 10th, 2006 at 6:38 pm Quote

Hope to have time to learn from you to consult the website load problem. 8 Barrycmster says:
June 19th, 2006 at 4:14 pm Quote

Great website! bookmarked! I am impressed at your work! 9 Heiyeluren says:
June 21st, 2006 at 10:39 am Quote

Generally for a medium-sized website, interactive operation is very much, PV million around, how to do a reasonable load. Michael says:
June 23rd, 2006 at 3:15 pm Quote

Heiyeluren on June and 2006 at 10:39 AM said:

Generally for a medium-sized website, interactive operation is very much, PV million around, how to do a reasonable load.

If the interaction is very much, you can consider using the cluster plus memory cache, the constantly changing and need to synchronize the data into the memory cache to read, the specific scheme needs to be combined with specific circumstances to analyze. One Donald says:
June 27th, 2006 at 5:39 pm Quote

Excuse me, if a website is in the technology development period, then these optimization means should first implement which and then implement which.
or from the cost (technology, human and financial costs), which first implementation can achieve maximum effect. Michael says:
June 27th, 2006 at 9:16 pm Quote

Donald on June-2006 at 5:39 PM said:

Excuse me, if a website is in the technology development period, then these optimization means should first implement which and then implement which.
or from the cost (technology, human and financial costs), which first implementation can achieve maximum effect.

First from the server performance optimization, code performance optimization, including webserver, dbserver optimization configuration, HTML static, such as easy to start, these links to the first squeeze to maximize utilization, and then consider from the framework to increase input, such as clustering, load balancing, etc. These need to have a certain development after the accumulation of a more appropriate consideration. Donald says:
June 30th, 2006 at 4:39 pm Quote

Well, thanks for Michael's patience. Ade says:
July 6th, 2006 at 11:58 am Quote

Good writing and good for people. Ssbornik says:
July 17th, 2006 at 2:39 pm Quote

Very good site. For author! Echonow says:
September 1st, 2006 at 2:28 pm Quote

Praise a first, is a very good article, but to really grasp the contents of the things I am afraid it will take time and practice.

Let me ask you a question about the picture server.

My billiard site 9tmd.com also used the image server architecture separation, is currently just a separation of the architecture, there is no physical separation, because there is no money to buy more servers:), You can see that the image connection on the people's home is similar to img.9tmd.com or img1.9tmd.com URLs.

This, the landlord this img.9tmd.com is a virtual host bar, that is, a service that Apache provides, so that the performance of the improvement also very meaningful. Or just bedding, in order to facilitate the physical separation later. Michael says:
September 1st, 2006 at 3:05 pm Quote

Echonow on September 1, 2006 at 2:28 PM said:

Praise a first, is a very good article, but to really grasp the contents of the things I am afraid it will take time and practice.

Let me ask you a question about the picture server.

My billiard site 9tmd.com also used the image server architecture separation, is currently just a separation of the architecture, there is no physical separation, because there is no money to buy more servers:), You can see that the image connection on the people's home is similar to img.9tmd.com or img1.9tmd.com URLs.

This, the landlord this img.9tmd.com is a virtual host bar, that is, a service that Apache provides, so that the performance of the improvement also very meaningful. Or just bedding, in order to facilitate the physical separation later.

This friend is right about that. Because there is only one server, so physically unable to achieve real separation, temporary use of virtual host to achieve, is to design and Web site architecture flexibility, if there is a new server, I just need to mirror the past or sync the past, Then the img.9tmd.com DNS resolution to the new server will naturally achieve separation, if not now implemented from the architecture and procedures, the future such separation will be more painful: !--gravata-->

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.