The evolution of large Web site image server architecture

Source: Internet
Author: User
Tags nginx server rsync xeon e5

In the mainstream web site, pictures are often an indispensable element of the page, especially in large sites, almost all will face "massive picture resources" storage, access and other related technical issues. In the image server for the schema extension, there will be a lot of twists and turns or even the Tears of blood lessons (especially early planning is not enough, resulting in late-stage architecture is difficult to compatible and extended).

This article will be a true vertical portal site of the development process, to everyone.

Websites built on top of the Windows platform are often considered "conservative" or even somewhat, by many technologies in the industry. A large part of the reason is due to the closure of the Microsoft technology System and the myopia of some technical staff (of course, mainly human problems). Because of the long-term lack of open source support, so many people can only "behind closed doors", so it is easy to form a limitation of thinking and short board. Take picture server as an example, if there is no capacity planning and extensible design, then with the increase of picture file and the increase of traffic, due to the lack of design in performance, fault tolerance/disaster tolerance, expansibility and so on, the follow-up will bring many problems to the development, operation and maintenance work. Serious even affects the normal operation of the website business and the development of Internet companies (this is not alarmist).

Many companies choose the windows (. net) platform to build Web sites and image servers, most of which are determined by the technical background of the founding team, and early technicians may be more familiar with. NET, or the team's owner thinks windows/. NET ease of use, "fast track approach" development model, talent costs and other aspects are more in line with the beginning of the entrepreneurial team, naturally chose windows. It is difficult to move the overall architecture to other open source platforms at a certain scale later in the business. Of course, for the construction of large-scale Internet, it is recommended that open source architecture, because there are many mature cases and the support of open source ecology (there will be a lot of pits, it is your own first to step on the pit, or after others stepped on the repair after you use), avoid duplicating wheel and spending high licensing costs. For applications that are more difficult to migrate, it is recommended that Linux, Mono, Jexus, Mysql, memcahed, Redis ... The mashup architecture also supports Internet applications with high concurrent access and large data volumes.

Picture server architecture in single-machine era (centralized)

The start-up period is due to time constraints, and the level of developers is limited. Therefore, it is usually directly in the directory where the website file is located, the establishment of 1 upload subdirectory, to save the user uploaded image files. If you subdivide by business, you can create a different subdirectory under the upload directory to differentiate. For example: Upload\qa,upload\face and so on.

A relative path such as "Upload/qa/test.jpg" is also saved in a database table.

The user is accessed in the following ways:

Http://www.yourdomain.com/upload/qa/test.jpg

program upload and Write methods:

Programmer a writes the file by configuring the physical directory D:\Web\yourdomain\upload in Web. config and then by stream;

Programmer B obtains the physical directory based on the relative path, and then writes the file through the stream by means of Server.MapPath.

Pros: The simplest implementation, without any complex technology, can successfully write a user-uploaded file to the specified directory. Saving database records and accessing them is also very convenient.

Disadvantage: The uploading method is chaotic, which is seriously unfavorable to the extension of the website.

For the most primitive architectures mentioned above, the following problems are mainly faced:

    1. With more and more files in the upload directory, partitions (for example, d) are difficult to scale up if there is insufficient capacity. Replace the larger capacity storage device only after the outage, and then import the old data.
    2. When deploying a new version (by requiring backup before deploying a new version) and daily backup of the website file, you need to operate the files in the upload directory at the same time, and if you take into account the increase in traffic, a load balancing cluster of multiple Web servers is deployed behind. It will be a challenge to synchronize the files between the cluster nodes if you do real-time synchronization.

Image server architecture in the era of clustering (real-time synchronization)

Under the website site, create a new virtual directory named upload, because of the flexibility of the virtual directory, to some extent, to replace the physical directory, and compatible with the original image upload and access methods. The way users are accessed remains:

Http://www.yourdomain.com/upload/qa/test.jpg

Advantages: More flexible configuration, can also be compatible with the old version of the upload and access mode.

Because of the virtual directory, you can point to any directory under any of the local drive characters. This allows you to expand the capacity of a single machine by accessing external storage.

Disadvantage: Deployed as a cluster of multiple Web servers, each Web server (cluster node) between (virtual directory) need to synchronize files in real-time, due to the synchronization efficiency and real-time constraints, it is difficult to ensure that at a certain time the files on each node is exactly the same.

The basic architecture looks like this:

As you can see, the entire Web server architecture is already "scalable, highly available", with major problems and bottlenecks focused on file synchronization between multiple servers.

The above architectures can only "incrementally synchronize" with each other on these Web servers, so that the files ' Delete, update ' operation is not supported for synchronization.

The early idea was that, at the application level, it was obviously not worth the hassle when the user requested that the upload be written to the WEB1 server while simultaneously calling the upload interface on the other Web server. So we choose to use the Rsync software to do timed file synchronization, thus eliminating the "re-Build wheel" cost, but also reduce the risk.

Synchronous operation inside, there are generally more classic two models, namely the push-pull model: the so-called "pull", refers to the polling to get updates, the so-called push, is the change after the initiative "push" to other machines. Of course, you can also use the Advanced event notification mechanism to accomplish this kind of action.

In scenarios where high concurrent writes occur, there are efficiency and real-time problems in synchronization, and a large number of file synchronizations are consuming system and bandwidth resources (more evident across network segments).

Image server architecture improvements in the cluster era (shared storage)

Using virtual directories to implement shared storage through a UNC (network path) (pointing the upload virtual directory to a UNC)

User access Mode 1:

Http://www.yourdomain.com/upload/qa/test.jpg

User access Mode 2 (can be configured with a separate domain name):

Http://img.yourdomain.com/upload/qa/test.jpg

Support for configuring a standalone domain name on the server on which UNC is located, and configuring a lightweight Web server to implement a standalone picture server.

Pros: You can avoid synchronization-related issues between multiple servers by using a UNC (network path) approach for read and write operations. It is relatively flexible and supports expansion/expansion. Support is configured as a standalone image server and domain name access, and is fully compatible with the old version of the access rules.

Cons: However, the UNC configuration is cumbersome and can result in some (read-write and security) performance loss. There may be a "single point of failure." Data loss can also occur if the storage level does not have raid or more advanced disaster preparedness measures.

The basic architecture looks like this:

In the early days of many sites based on the Linux open source architecture, if you don't want to sync your pictures, you might be using NFS. NFS has proven to be a problem in terms of high concurrency, read-write and mass storage, and is not the best choice, so most Internet companies do not use NFS for such applications. Of course, it can also be achieved through the Windows-brought DFS, the disadvantage is "configuration is complex, inefficient, and lack of data a large number of actual cases." In addition, there are some companies that use FTP or samba.

The above mentioned architecture, in the upload/download operation, all through the Web server (although the shared storage of this architecture, you can also configure a separate domain name and site to provide picture access, but upload write still have to go through the Web server application to process), this is undoubtedly a huge pressure on the Web server. Therefore, it is recommended to use a separate image server and a separate domain name, to provide user image upload and access.

Benefits of Standalone Image Server/standalone domain name
    1. Picture access is a drain on server resources (because it involves the context switch and disk I/O operations of the operating system). Once separated, the Web/app server can focus on the ability to perform dynamic processing more.
    2. Independent storage, more convenient to do expansion, disaster recovery and data migration.
    3. Browser (under the same domain name) concurrency policy limit, performance loss.
    4. When you access a picture, the total amount of cookie information in the request message can also cause a performance penalty.
    5. Easy to do picture access request load balancing, convenient application of various cache policies (HTTP Header, Proxy cache, etc.), but also more convenient to migrate to the CDN.

......

We can use a lightweight Web server such as LIGHTTPD or Nginx to architect a standalone picture server.

Current picture server architecture (Distributed File System +CDN)

Before you build your current picture server architecture, you can configure a separate picture server/domain name by completely setting aside the Web server. However, the following problems are faced:

    1. What about old picture data? Can I continue to be compatible with old picture path access rules?
    2. Independent image server needs to provide a separate upload and write interface (Service API release), how to guarantee security issues?
    3. Similarly, if you have more than one standalone picture server, are you using a scalable shared storage solution or a real-time synchronization mechanism?

This problem is simplified by the popularity of the application-level (non-system-level) DFS (for example, Fastdfs HDFS mogilefs moosefs, TFS): Perform redundant backups, support automatic synchronization, support for linear scaling, support for client API uploads/downloads/deletions in mainstream languages, Partial support for file indexing, partially supported by web-enabled ways to access.

Considering the characteristics of each DFS, the Client API language support (need to support C #), documentation and cases, and community support, we finally chose Fastdfs to deploy.

The only problem is that the old version of the access rule may not be compatible. If you import old pictures into Fastdfs at once, but because the old picture access path distribution is stored in various tables in different business databases, the overall update is very difficult, so you must be compatible with the old version of the access rules. Schema upgrades are often more difficult than a new architecture, because they are also compatible with previous versions of the problem. (It's much harder to change the engine in the air than to build a plane)

The solution is as follows:

First, close the upload entry for the old version (avoid data inconsistencies due to continued use). The old image data is migrated to the standalone image server (that is, the older image server described in) once through the Rsync tool. In the front-end (seven-tier proxy, such as Haproxy, Nginx), use ACLs (Access Rule control) to match the old Picture's request (regular) to the URL rule, and then forward the request directly to the specified list of Web servers. On the servers in the list, configure the sites that provide the images (Web-based) and join the cache policy. This implementation of the old image server separation and caching, compatible with the old image access rules and improve the efficiency of old image access, but also avoid the problems caused by real-time synchronization.

Overall architecture

Fastdfs-based independent picture server cluster architecture, although already very mature, but due to domestic "north-South interconnection" and IDC bandwidth cost issues (picture is very consumption of traffic), we finally chose the commercial CDN technology, the implementation is very easy, the principle is actually very simple, Here's a simple introduction:

The IMG domain name cname to the CDN vendor designated domain name, when users request access to the picture, the CDN vendor provides intelligent DNS resolution, the most recent (and of course, there may be other more complex policies, such as load conditions, health status, etc.) service node address returned to the user, A user request arrives on the specified server node, which provides a proxy caching service similar to Squid/vanish, and if the path is requested for the first time, the image resource is retrieved from the source station and returned to the client browser, and if present in the cache, it is fetched directly from the cache and returned to the client browser, completing the request/ Response process.

Because of the use of the commercial CDN service, we did not consider using Squid/vanish to build the pre-proxy cache from the line.

The entire cluster architecture above can be easily scaled out to meet the image service needs of large Web sites in general vertical areas (of course, a super-large size like Taobao may be another matter). Tested to provide image access to a single Nginx server (Xeon E5 four core CPU, 16G memory, SSD), small static pages (about 10KB after compression) can carry thousands of concurrency and no pressure. Of course, because the picture itself is much larger than a static page of plain text, the anti-concurrency capabilities of the server that provides the image access are often limited by the I/O processing power of the disk and the bandwidth provided by IDC. Nginx Anti-concurrency ability is very strong, and the resource consumption is very low, especially to deal with static resources, there seems to be no need to worry too much. According to the demand of actual traffic volume, by adjusting nginx parameters, tuning the Linux kernel, adding a tiered cache strategy can be more optimized, or by increasing the server or upgrading the server configuration to do the expansion, most directly through the purchase of more advanced storage devices and greater bandwidth, To meet the need for greater traffic.

It is worth mentioning that in the "cloud computing" popular now, also recommend the high-speed development of the site, the use of "cloud storage" such a solution, can help you solve all kinds of storage, expansion, disaster preparedness, but also to do a CDN acceleration. The most important thing is that the price is not too expensive.

Summary, about the image Server schema extension, roughly around these issues:

    1. Capacity planning and scaling issues.
    2. Data synchronization, redundancy, and disaster recovery.
    3. The cost and reliability of a hardware device (whether it's an ordinary mechanical hard drive, or an SSD, or a higher-end storage device and scenario).
    4. The choice of file system. Choose from EXT3/4 or nfs/gfs/tfs these open source (distributed) file systems based on file attributes such as file size, read-write ratio, and so on.
    5. Accelerated access to images. Use commercial CDN or self-built proxy cache, Web static cache architecture.
    6. Old picture paths and access rules for compatibility, application-level extensibility, upload and access performance and security, and more.

Reprinted from: http://www.cnblogs.com/dinglang/p/4608915.html

Large Web site Image Server Architecture Evolution (RPM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.