High-performance website performance optimization and system architecture (zt)

Source: Internet
Author: User
Tags website performance wordpress blog
High-performance website performance optimization and system architecture (zt)

System Architecture of large, high-concurrency and high-load websites
Reprinted please keep Source: junlin Michael's blog (http://www.toplee.com/blog? P = 71)
Trackback URL: http://www.toplee.com/blog/wp-trackback.php? P = 71

 
I have set up a dial-up access platform in CERNET, And then I have worked in front-end development of search engines in Yahoo & 3721, and have processed architecture upgrades of large-scale community sows in mop.
At the same time, I have worked with and developed many large and medium-sized website modules. Therefore, I have accumulated some experience in large websites to cope with high-load and concurrent solutions, we can discuss it with you.

 
A small website, such as a personal website, can be implemented using the simplest HTML static page. With some images for beautification, all the pages are stored in a directory, such a website
The requirements for unified architecture and performance are very simple. With the increasing diversity of Internet services, website-related technologies have been subdivided into many aspects after years of development, especially for large websites, technologies used
It involves a wide range of applications, from hardware to software and programming.LanguageDatabase, Webserver, firewall, and other fields all have high requirements, which are not comparable to the original simple HTML static website.

Large websites, such as portal websites. In the face of a large number of user access and high concurrency requests, the basic solution focuses on the following aspects: use high-performance servers, high-performance databases, high-efficiency programming languages, and high-performance WEB containers. However, in addition to these aspects, it is impossible to fundamentally solve the high load and high concurrency problems faced by large websites.

The solutions provided above also mean a greater investment to a certain extent, and these solutions have bottlenecks and do not have good scalability, I will talk about some of my experiences from the perspectives of low cost, high performance, and high scalability.

1. HTML static
 
As we all know, the most efficient and least-consumed HTML pages are purely static html pages, so we try our best to make the pages on our website adopt static pages. This simplest method is actually
The most effective method. However, for websites with a large amount of content and frequent updates, we cannot manually implement them all. As a result, the common information publishing system CMS appears, such as the portal sites we visit.
News channels, even other channels, are managed and implemented through the information publishing system. The information publishing system can achieve the simplest information input to automatically generate static pages, you can also have channel management and permissions.
Management, automatic crawling, and other functions are essential for a large website to have an efficient and manageable CMS.

In addition to portal and information publishing websites
For community websites with high requirements, static content as much as possible is also a necessary means to improve performance, and real-time static content of posts and articles in the community, there are also a lot of strategies to use static data again when there is an update.
Slightly, MoP's hodgedge uses such a strategy, and so does the Netease community. At present, many blogs are also static. The Wordpress blog program I use is not static.
Therefore, www.toplee.com cannot handle high-load access.

 
At the same time, HTML static is also a method used by some cache policies. For applications that frequently use database queries but have little content updates in the system, HTML static can be used, such as forums.
Public configuration information of the Forum. Currently, mainstream forums can manage the background and store the information in the database. In fact, this information is frequently called by foreground programs, but the update frequency is very small, you can consider
Some content is static during background updates, which avoids a large number of database access requests.

When making HTML static, you can use a compromise method, that is, the front-end uses dynamic implementation to regularly perform static and timed judgment calls under certain policies, this can achieve a lot of flexible operations. I developed the billiards website www.8zone.cn to use this method, I set some HTML static time intervals to cache dynamic website content, so as to share most of the pressure on static pages, which can be applied to the architecture of Small and Medium websites. Address of the Origin Site: http://www.8zone.cn, by the way, have friends like billiards a lot of support me this free website :)

2. image server Separation
 
As we all know, for Web servers, images are the most resource-consuming, whether it is Apache, IIS or other containers, so we need to separate images from pages, which is basically large
All websites adopt an independent image server or even many image servers. This architecture reduces the pressure on the server system that provides page access requests, and ensures that the system does not
Image crashes.

Different configuration optimizations can be performed on the application server and image server. For example, Apache may support as few loadmodules as possible when configuring contenttype, ensures higher system consumption and execution efficiency.

My billiards website's hometown 8zone.cn also uses the separation of image server architecture. Currently, it is only the separation of architecture and physical isolation, because there is no money to buy more servers :), you can see that the image links on the old people are URLs similar to img.9tmd.com or img1.9tmd.com.

In addition, Lighttpd can be used to replace Apache for processing access to static pages, images, and JS. It provides more lightweight and efficient processing capabilities.

3. Database Cluster and database table hash
Large websites have complex applications, and these applications must use databases. In the face of a large number of accesses, database bottlenecks will soon become apparent. At this time, a database will soon fail to satisfy applications, therefore, we need to use a database cluster or database table hash.

In terms of database clusters, many databases have their own solutions, and Oracle and Sybase all have good solutions. The commonly used MySQL Master/Slave is also a similar solution, you can refer to the corresponding solutions to implement the database.

 
As the database cluster mentioned above is limited by the DB type used in terms of architecture, cost, and expansion, we need to consider improving the system architecture from the perspective of applications, database and table hash are commonly used and
And the most effective solution. We install the business and application or function modules in the application to separate the database. Different modules correspond to different databases or tables, and then use certain policies to access a page or function.
Feature for smaller database hash, such as user tables, table hash by user ID, which can improve system performance at a low cost and have good scalability. Sohu's Forum adopts this approach.
Structure: Database separation of Forum users, settings, posts, and other information, and hash databases and tables for posts and users according to sections and IDs, in the end, you can simply configure the system in the configuration file.
At any time, a low-cost database is added to supplement the system performance.

4. Cache
The word cache has been used in many areas. The cache in website architecture and website development is also very important. Here we will first describe the two most basic caches. Advanced and distributed caching are described later.

For architecture caching, anyone familiar with Apache can know that Apache provides its own mod_proxy cache module, or use an additional squid for caching, both methods can effectively improve Apache's access response capabilities.

The memcached provided on Linux is a common cache solution. Many Web Programming Languages provide memcache access interfaces, such as PHP, Perl, C, andJavaThey can be used in Web development. They can cache data, objects, and other content in real time or cron, with flexible policies. Some large communities use this architecture.

 
In addition, when using Web language development, various languages basically have their own cache modules and Methods. php has the pear cache module and eaccelerator acceleration and
Cache module and well-known APC and xcache (developed by Chinese people, supported !) PHP cache module, Java is more,. NET is not very familiar with, I believe there are certainly.

5. Images
 
Images are often used by large websites to improve performance and data security. The image technology can solve the differences in user access speed caused by different network access providers and regions, such as Chinanet and
The difference between EduNet has prompted many websites to set up mirror sites in CERNET, and regularly update or update data in real time. In terms of image details and technologies, I will not elaborate too deeply here, but there are many professional
Optional solutions and products. There are also low-cost software implementation ideas, such as rsync on Linux and other tools.

6. Server Load balancer
Server Load balancer is the ultimate solution for large websites to solve high-load access and a large number of concurrent requests.

Server Load balancer has been developing for many years. There are many professional service providers and products to choose from. I personally have some solutions, including two architectures for your reference. In addition, we will not talk much about the primary Server Load balancer DNS round robin and the professional CDN architecture.

6.1 hardware layer-4 Switching
 
The layer-4 Exchange uses the header information of the layer-3 and layer-4 information packets to identify business flows based on the Application interval and distribute the business flows of the entire interval segment to appropriate application servers for processing. The layer-4 switching function is like
It is a virtual IP address that points to the physical server. Its transmission services are subject to a variety of protocols, including HTTP, FTP, NFS, telnet, or other protocols. These services must be restored based on physical servers.
Miscellaneous Load Balancing algorithms. In the IP address world, the service type is determined by the TCP or UDP port address of the terminal. In the layer-4 switch, the application interval is determined by the source and terminal IP addresses, the TCP and UDP ports.
Decide.

In the field of hardware layer-4 switching products, there are some well-known products to choose from, such as Alteon and F5. These products are expensive, but value for money, it provides excellent performance and flexible management capabilities. Yahoo China used three or four Alteon servers on nearly 2000 servers.

6.2 software layer-4 Switching
After learning about the principle of the hardware layer-4 switch, the four-layer switch based on the OSI model came into being. Such a solution achieves the same principle, but has a poor performance. However, it is easy to meet a certain amount of pressure. Some people say that the software implementation method is actually more flexible, and the processing capability depends entirely on the familiarity of your configuration.

For layer-4 software exchange, we can use LVS, which is commonly used in Linux. LVS is a Linux virtualServerHe provides a heartbeat-based Real-time disaster response solution to improve the robustness of the system. It also provides flexible virtual VIP configuration and management functions to meet multiple application needs at the same time, this is essential for distributed systems.

A typical load balancing strategy is to build a squid Cluster Based on layer-4 software or hardware exchanges. This idea is adopted on many large websites, including search engines, this architecture is low-cost, high-performance, and highly scalable. It is easy to increase or decrease nodes in the architecture at any time. I have prepared a special detail for this architecture and will discuss it with you.

Summary:
 
For large websites, each method mentioned above may be used at the same time. Michael is a simple introduction here. Many details of the implementation process require you to be familiar with and understand it.
A very small squid parameter or Apache parameter setting will have a great impact on the system performance.

Reprinted please keep Source: junlin Michael's blog (http://www.toplee.com/blog? P = 71)
Trackback URL: http://www.toplee.com/blog/wp-trackback.php? P = 71

This
Entry is filed under C/C ++/other technologies, technical exchange. You can follow any responses
To this entry through the RSS 2.0 feed. You can leave a response, or
Trackback from your own site.

(2 votes, average: 6.5 out of 10)
Loading...
65 responses to "talk about the system architecture of large, high-concurrency and high-load websites"
1
Pi1ot says:

2017l 29th, 2006 at pm
Quote
Inter-module or inter-process communication is also very important in asynchronous queuing. It can take into account the response performance and system pressure when heavy loads are lightweight. Database pressure can be decomposed into file systems through File Cache, the file system Io pressure is then decomposed by MEM cache, which is very good.

3
Guest says:

May 1st, 2006 at 8: 13 am
Quote
All nonsense!
"We all know that for Web servers, images are the most resource-consuming, whether it is Apache, IIS, or other containers." You think images are generated dynamically in the memory. no matter what the file is, when the container outputs the file, it only reads the file and outputs it to the response. What is the relationship with the file.

The key is that different policies should be used between static files and dynamic pages. For example, static files should be cached as much as possible, because no matter how many times your request outputs the same content, if there are 20 user pages, there is no need to request 20 times, but the cache should be used. dynamic Pages have different request outputs (otherwise they should be static), so they should not be cached.

Therefore, you can optimize static and dynamic resources on the same server. The dedicated image server facilitates resource management and has nothing to do with your performance.

4
Michael says:

May 2nd, 2006 at AM
Quote
Dynamic
It is estimated that we have never met a friend upstairs. In the case of processing Inktomi search results, all we use is dynamic cache. For the same keywords and query conditions, this way
Caching is very important. For dynamic content caching, you can use reasonable header parameters during programming to conveniently manage cache policies, such as expiration time.

Let's talk about images.
The performance impact problem is generally caused by the fact that most of the images on our access pages are usually more traffic than HTML code. In the case of the same network bandwidth, image transmission takes a longer time, because
Transmission requires a lot of effort to establish a connection, which will prolong the HTTP connection duration between the user client and the server, which will definitely degrade the concurrency performance for Apache, unless you
The returned results are all static. You can set keepalives in httpd. conf to off.
In this way, the connection processing time can be reduced, but if there are too many images, the number of established connections will increase, and performance will also be consumed.

In addition, the theory we mentioned is more about cases of large clusters. In such an environment, image separation can effectively improve the architecture, and thus affect the performance improvement, you need to know why we want to talk about architecture? The architecture may be designed for security, resource allocation, and more scientific development and management, but the ultimate goal is performance.

In addition, it is easy to find descriptions about MIME type and content length in rfc1945's HTTP document, which makes it easy to understand the effect of images on performance.

The friends on the floor are really little people. I hope you don't use guest to fool me. Men are afraid that others will know what your name is? Besides, even if I make a mistake, I won't use nonsense to find it! Everyone focuses on communication and learning. I am not a senior either. At most I am a common programmer.

5
Ken Kwei says:

June 3rd, 2006 at pm
Quote
Hello, Michael. I have read this article several times and I have a question. The following section is mentioned in your article:

"For community-type websites with high interaction requirements, static content as much as possible is also a necessary means to improve performance, and real-time static content of posts and articles in the community, there are also a lot of strategies to use the static method again when there are updates, such as the mop hodgedge is to use such a strategy, Netease community and so on."

Pair
For a large site, its database and web server
Generally, they are distributed and deployed in multiple regions. users in a region access a node. If they are static posts in the Community in real time, if there is an update and then the data is static again
How to synchronize between them immediately? How is the database implemented? If the user does not see it, the post will fail? If repeated sending occurs, how can users be locked to a node? How can this problem be solved? Thank you.

6
Michael says:

June 3rd, 2006 at pm
Quote
Locking a user on a node is implemented through layer-4 Switching. In general, if the application is small, it can be implemented through program code. Large applications generally manage user connections through layer-4 exchanges like LVS and hardware. policies can be formulated to keep user connections on a node during the life cycle.

There are many static and synchronous strategies. The general method is centralized or distributed storage, but static storage is implemented through centralized storage, then, the front-end proxy group is used to implement caching and relieve pressure.

Generally, for a medium-sized website, there are a lot of interactive operations, with a daily PV of about 1 million. How can we make a reasonable load?

If there are many interactions, you can consider adding memory cache to the cluster to put the constantly changing and synchronized data into the memory cache for reading, the specific solution must be analyzed based on the specific situation.

11
Donald says:

June 27th, 2006 at pm
Quote
If a website is in the technical development stage, which of the following optimization methods should be implemented first?
In terms of cost (technical, human, and financial cost), which implementation can achieve the maximum effect?

12
Michael says:

June 27th, 2006 at pm
Quote
Donald on June 27,200 6 at PM said:

If a website is in the technical development stage, which of the following optimization methods should be implemented first?
In terms of cost (technical, human, and financial cost), which implementation can achieve the maximum effect?

First
Starting from the aspects of server performance optimization and code performance optimization, including Webserver, dbserver configuration optimization, HTML static, and other easy-to-start operations, these steps strive to extract
To maximize the utilization rate, and then consider increasing investment in the architecture, such as clusters and Server Load balancer. These need to be considered properly after a certain amount of development accumulation.

16
Echonow says:

September 1st, 2006 at pm
Quote
Lianxiao is a good article, but it may take time and practice to grasp the content!

Let's first ask a question about the image server!

My billiards website's hometown 9tmd.com also uses the image server architecture separation. Currently, it is only the architecture separation, but there is no physical separation, because there is no money to buy more servers :), you can see that the image links on the old people are URLs similar to img.9tmd.com or img1.9tmd.com.

In this case, img.9tmd.com is a virtual host, that is, a service provided by Apache. Does this make sense for performance improvement? Or is it just a foreshadowing to facilitate future physical separation?

17
Michael says:

September 1st, 2006 at pm
Quote
Echonow on September 1, 2006 at PM said:

Lianxiao is a good article, but it may take time and practice to grasp the content!

Let's first ask a question about the image server!

My billiards website's hometown 9tmd.com also uses the image server architecture separation. Currently, it is only the architecture separation, but there is no physical separation, because there is no money to buy more servers :), you can see that the image links on the old people are URLs similar to img.9tmd.com or img1.9tmd.com.

In this case, img.9tmd.com is a virtual host, that is, a service provided by Apache. Does this make sense for performance improvement? Or is it just a foreshadowing to facilitate future physical separation?

This
A friend is right. Because there is only one server currently, physical isolation cannot be achieved physically, and virtual hosts are used for the time being to achieve flexible programming and website architecture, if you have a new
I only need to pass the image or synchronize it, And then resolve the DNS of img.9tmd.com to the new server to implement the separation.
Implementation. In the future, such separation will be more painful :)

18
Echonow says:

September 7th, 2006 at 4: 59 pm
Quote
Thanks for the reply from LZ. Now the main implementation problem is how to directly upload the materials to the image server when uploading the materials, instead of uploading the materials to the Web every time and then synchronizing them to the image server.

19
Michael says:

September 7th, 2006 at 11: 25 pm
Quote
Echonow on September 7, 2006 at PM said:

Thanks for the reply from LZ. Now the main implementation problem is how to directly upload the materials to the image server when uploading the materials, instead of uploading the materials to the Web every time and then synchronizing them to the image server.

Samba or NFS is a simple method. Then, Squid cache is used to reduce access load, improve disk performance, and prolong disk service life.

20
Echonow says:

September 8th, 2006 at AM
Quote
Thanks for your patience and guidance. I will study it first. It is a good idea to store data in a shared area!

21
Michael says:

September 8th, 2006 at AM
Quote
Echonow on September 8, 2006 at am said:

Thanks for your patience and guidance. I will study it first. It is a good idea to store data in a shared area!

You are welcome!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.