I set up a dial-up access platform in CERNET, and then set up the front-end platform of the load search engine in yahoo3721. Development And has processed large data volumes at the maopu. Community The architecture upgrade and other work of the maopu hodgedge, as well as their own access to and development of many large and medium-sized website modules, so they have some accumulation and experience in coping with high-load and concurrent solutions for large websites, we can discuss it with you.
A small website, such as a personal website, can be implemented using the simplest HTML static page. With some images for beautification, all the pages are stored in a directory, such a websiteSystemThe architecture and performance requirements are very simple.InternetConstantly enriched business and website-relatedTechnologyAfter years of development, we have already subdivided into several very detailed aspects. Especially for large websites, the technology we use covers a wide range of areas, from hardwareSoftware,ProgrammingLanguage,DatabaseWebserver, firewall, and other fields all have high requirements and are no longer comparable to the original simple HTML static website.
Large websites, such as portal websites. In the face of a large numberUserIn terms of access and high-concurrency requests, the basic solution focuses on the following steps: Using High-PerformanceServer, High-performance databases, efficientProgramming LanguageAnd high-performance WEB containers. However, in addition to these aspects, it is impossible to fundamentally solve the high load and high concurrency problems faced by large websites.
The solutions provided above also mean a greater investment to a certain extent, and these solutions have bottlenecks and do not have good scalability, I will talk about some of my experiences from the perspectives of low cost, high performance, and high scalability.
1. HTML static
As we all know, the most efficient and least consumed HTML pages are purely static html pages, so we try our best to make the pages on our website adopt static pages, this simplest method is actually the most effective method. However, for websites with a large amount of content and frequent updates, we cannot manually implement them all, so we have a common information publishing system CMS, news channels such as the portals we often visit, and even other channels, are managed and implemented through the information publishing system, the information publishing system can automatically generate static pages based on the simplest information input. It can also provide channel management, permission management, automatic crawling, and other functions. For a large website, having an efficient and manageable CMS is essential.
In addition to portal and information publishing websites, static websites with high interaction requirements are also a necessary means to improve performance,ArticleReal-time static, and re-static when there are updates is also a lot of use strategy, such as the mop hodgedge is the use of such a strategy, Netease community and so on.
Meanwhile, HTML static is also a method used by some cache policies. For systems that frequently use database queries but have little content updatesApplication, You can consider using HTML static, such as the public setting information of forums in the forum. Currently, mainstream forums can be managed in the background andStorageIn the database, this information is often used by the front-endProgramCalling, but the update frequency is very small. You can consider static content during background updates to avoid a large number of database access requests.
2. image server Separation
As you know, for Web servers, whether it is Apache,IISFor other containers, images consume the most resources. Therefore, we need to separate images from pages. This is basically a policy adopted by large websites. They all have independent image servers, even many image servers. This architecture can reduce the pressure on the server that provides page access requests, and ensure that the system does not crash due to image problems. Different configurations can be optimized on the application server and image server, for exampleApacheYou can configure contenttype with as few loadmodules as possible to ensure higher system consumption and execution efficiency.
3. DatabaseClusterAnd database table hash
Large websites have complex applications, and these applications must use databases. In the face of a large number of accesses, database bottlenecks will soon become apparent. At this time, a database will soon fail to satisfy applications, therefore, we need to use a database cluster or database table hash.
In terms of database clusters, many databases have their own solutions,OracleThere are good solutions for Sybase and so on. The common Master/Slave provided by MySQL is also a similar solution. You can refer to the corresponding solution for implementation of the database.
As the database cluster mentioned above is limited by the DB type in terms of architecture, cost, and expansion, we needApplicationsTo improve the system architecture, database table hashing is a common and most effective solution. We install business and application or function modules in the application to separate the database. Different modules correspond to different databases or tables, then, according to a certain policy, conduct a smaller database hash for a page or function, such as a user table and table hash by user ID, in this way, the system performance can be improved at a low cost and the scalability can be improved. Sohu's Forum adopts this architecture to separate the database of Forum users, settings, posts, and other information, and then hash the databases and tables of posts and users according to sections and IDs, you can configureFileWith simple configuration, the system can add a low-cost database at any time to supplement the system performance.
The word cache has been used in many areas. The cache in website architecture and website development is also very important. Here we will first describe the two most basic caches. Advanced and distributed caching are described later.
For architecture caching, anyone familiar with Apache can know that Apache provides its own cache module, or use the plus squid module for caching, both methods can effectively improve Apache's access response capabilities.
The memorycache provided on Linux is a common cache interface.WebIt is used in development. For example, memorycache can be called for data caching and communication sharing during Java Development. This architecture is used in some large communities. In addition, when using Web language development, various languages basically have their own cache modules and Methods. php has a pear cache module, and Java has more ,. net is not very familiar with, I believe there must be.
Images are commonly used by large websites to improve performance and data.SecurityImage technology can solve different problems.NetworkThe differences in access speeds between access providers and users in different regions, such as the differences between Chinanet and EduNet, have prompted many websites to build mirror sites in CERNET, and regularly update or update data in real time. In terms of image details, I will not elaborate too deeply here. There are many professional off-the-shelf solutions and product options. There are also low-cost software implementation ideas, such as rsync on Linux and other tools.
6. Server Load balancer
Server Load balancer is the ultimate solution for large websites to solve high-load access and a large number of concurrent requests.
Server Load balancer has been developing for many years. There are many professional service providers and products to choose from. I personally have some solutions, including two architectures for your reference.
Hardware layer-4 Switching
The layer-4 Exchange uses the header information of the layer-3 and layer-4 information packets to identify business flows based on the Application interval and distribute the business flows of the entire interval segment to appropriate application servers for processing. The layer-4 switching function is like a virtual IP address pointing to a physical server. Its transmission services are subject to a variety of protocols, including HTTP, FTP, NFS, telnet, or other protocols. These services require complex load balancing based on physical serversAlgorithm. In the IP address world, the service type is determined by the TCP or UDP port address of the terminal. The application interval in the layer-4 switch is jointly determined by the source and terminal IP addresses, TCP and UDP ports.
In the field of hardware layer-4 switching products, there are some well-known products to choose from, such as Alteon and F5. These products are expensive, but value for money, it provides excellent performance and flexible management capabilities. Yahoo China used three or four Alteon servers on nearly 2000 servers.
Layer-4 software exchange
We know the four layers of hardware.VswitchAfter the concept, the four-layer software exchange based on the OSI model came into being. The implementation principles of such solutions are consistent, but the performance is slightly poor. However, it is easy to meet a certain amount of pressure. Some people say that the software implementation method is actually more flexible, and the processing capability depends entirely on the familiarity of your configuration.
We can use LVS, which is commonly used in Linux for software layer-4 Switching. LVS is the Linux virtualserver. It provides a real-time disaster response solution based on heartbeat to improve system robustness, at the same time, it provides flexible virtual VIP configuration and management functions to meet a variety of application needs at the same time, which is essential for distributed systems.
A typical load balancing strategy is to build a squid Cluster Based on layer-4 software or hardware exchanges. This idea is adopted on many large websites, including search engines, this architecture is low-cost, high-performance, and highly scalable. It is easy to increase or decrease nodes in the architecture at any time. I have prepared a special detail for this architecture and will discuss it with you.
For large websites, each method mentioned above may be used at the same time. I will introduce it more easily here. You need to be familiar with and understand many details during the implementation process, sometimes a very small squid parameter or Apache parameter setting will have a great impact on the system performance. I hope you can discuss it together to make it effective.
A very good article, basically all things that need to be done on a large website have been mentioned. I have also worked in one of the three major portals and managed more than 100 squid servers. I hope to share my experiences and opinions.
1. image server Separation
I have always supported this idea. Especially if the program and the image are placed on the same apahce server, each image request may lead to an httpd process call. If httpd contains a PHP module, it will occupy too muchMemoryAnd this is not necessary.
Using an independent image server can not only avoid the above situation, but also set different expiration times for images of different usage properties.TimeSo that the same user will not retrieve data from the server (based on the Cache Server) again when accessing the same image on different pages, not only fast, but also saves bandwidth. In addition, you can adjust the cache time.
In the image server I managed in the past, not only does it separate images from applications and pages, but it also enables different domain names for different types of images. To ease the pressure on images of different types. For example, the photo.img.domain.com domain name is used for photography services and uses five caches at ordinary times. However, after a 5.1 long holiday, it may need to be added to ten independent hosts. These 5 servers can be temporarily transferred from other low-load image servers.
2. Database Cluster
The cluster layout of a set of Oracle RAC is about 40 W. This price is unnecessary for general companies. Because the web application logic is relatively simple, and the value of Oracle large databases lies inData MiningInstead of simple storage. Therefore, MySQL or PostgreSQL is actually used.
Simple MySQL replication can achieve better results. Read from slave and update to master only when writing. In actual situations, MySQL's replication performance is very good, basically it will not bring too high update latency. Using the balance (http://www.inlab.de/balance.html) software, listening to port 3306 locally (127.0.0.1), and then ing multiple slave databases, you can achieve read load balancing.
3. Are images stored on disks or databases?
I have carefully considered this issue. If the ext3 file system is used, the limit is reached when directories are created, but XFS is not used. If you need to store a large number of images, you must separate them into many small directories. Otherwise, there will be a limit that ext3 can only create 3 W directories, in addition, too many files and directories will affect disk performance. There are no problems such as space waste.
More importantly, data backup for a large number of small files takes a lot of resources and a very long time. Before these problems, it may be another option to save images in the database.
You can try to save the image to the database, and use the PHP program to return the actual image at the front end, and then place a squid server at the front end to avoid performance problems. You can also use the data replication mechanism of MySQL to back up images. This problem can be effectively solved.
4. I will not talk about static pages, and my own WordPress will be completely static. At the same time, it can well take into account the generation of dynamic data.
I have also proposed using memcached before, but it is not very ideal in actual use. Of course, inconsistent use results may occur in different application environments, which is not important. As long as you think it is easy to use.
6. layer-4 software exchange
The LVS performance is very good. My friend's website uses LVS as the Scheduler for balancing. The data volume is very large and can be easily supported. Of course, Dr is used.
In fact, I also thought about using LVS for CDN scheduling. For exampleBeijingThe BGP data center accepts the user's request, and then dispatches the request to the actual physical server of the China Telecom or China Netcom data center through the Tun method of LVS, directly returning data to the user.
This is Wan scheduling, and F5 hardware devices also use this technology. However, the cost of using LVS is greatly reduced.