Although FastDFS-based independent image server cluster architecture is very mature, due to domestic problems such as "North-South interconnection" and IDC bandwidth costs (images consume traffic very much ), we finally chose the commercial CDN technology.
In mainstream Web sites, images are often an indispensable page element, especially in large websites, and almost all face technical problems such as storage and access of "massive Image resources. The architecture expansion for image servers will also go through many twists and turns, or even tears (especially the lack of early planning, making it difficult for later architecture to be compatible and scalable ).
This article will share with you the development history of a real vertical portal website.
Websites built on Windows platforms are often regarded as "conservative" by many technologies in the industry, and may even be somewhat conservative. Most of the reason is that Microsoft's technical system is closed and some technical staff are short-sighted (of course, it is mainly due to human problems ). Due to the lack of open-source support for a long time, many people can only "build cars with closed doors", which is easy to form the limitations and shortcomings of thinking. Taking image servers as an example, if there is no capacity planning or scalable design in the early stage, as image files increase and the access volume increases, due to insufficient design in performance, fault tolerance/disaster tolerance, and scalability, the development and O & M work will be faced with many problems in the future, in severe cases, it may even affect the normal operation of website services and the development of Internet companies (this is not an alarmist ).
Many companies choose Windows (. NET) platform to build websites and image servers, most of which are determined by the technical background of the founding team. early technical staff may be more familiar with it. NET, or the team owner thinks Windows /. NET's ease of use, "short and fast" development model, talent costs, and other aspects are more in line with the initial entrepreneurial team, naturally choose Windows. It is difficult to migrate the overall architecture to other open-source platforms as the business grows to a certain scale in the future. Of course, we recommend that you use the open-source architecture to build a large-scale internet. because there are many mature cases and support from the open-source ecosystem (and there are also many pitfalls, you should be the first to step on your own, you can still use it after someone else has fixed the issue) to avoid repeated wheel creation and high authorization fees. For applications that are difficult to migrate, I personally recommend Linux, Mono, Jexus, Mysql, Memcahed, Redis ...... The mashups architecture also supports Internet applications with high concurrent access and large data volumes.
Image Server architecture in the standalone era (Centralized)
During the start-up period, due to time constraints and limited levels of developers. Therefore, an upload sub-directory is created directly in the directory where the website file is located to save the image files uploaded by the user. If it is further subdivided by business, you can create different subdirectories under the upload directory to differentiate. For example, upload \ QA, upload \ Face, etc.
The relative paths such as "upload/qa/test.jpg" are also saved in the database table.
The user's access method is as follows:
Http://www.yourdomain.com/upload/qa/test.jpg
Program Upload and write methods:
Programmer A configures the physical directory D: \ web \ yourdomain \ upload in Web. config and then writes the file through stream;
Programmer B obtains the physical directory based on the relative path through Server. MapPath and then writes the file through stream.
Advantages: it is the easiest to implement and can successfully write user-uploaded files to a specified directory without any complicated technology. It is also convenient to SAVE database records and access them.
Disadvantage: the uploading method is messy, which is not conducive to the expansion of the website.
The most primitive architecture is faced with the following problems:
As the number of files in the upload directory increases, it is difficult to expand the capacity of the partition (such as the D disk. You can only replace a larger storage device after the service is stopped, and then import the old data.
When you deploy a new version (which must be backed up before the new version is deployed) and routinely back up website files, you must simultaneously operate the files in the upload directory, server load balancer clusters composed of multiple Web servers are deployed on the backend. real-time file synchronization between cluster nodes is a problem.
Image Server architecture in the cluster era (real-time synchronization)
Create a new virtual directory named upload under the website. thanks to the flexibility of the virtual directory, it can replace the physical directory to a certain extent and be compatible with the original image upload and access methods. The user's access method is still:
Http://www.yourdomain.com/upload/qa/test.jpg
Advantage: The configuration is more flexible and compatible with the upload and access methods of earlier versions.
The virtual directory can point to any directory under any local drive letter. In this way, you can expand the capacity of a single machine by connecting to external storage.
Disadvantages: deployed as a cluster composed of multiple Web servers, files need to be synchronized in real time between each Web server (cluster node) (under the virtual directory, due to the limited synchronization efficiency and real-time performance, it is difficult to ensure that the files on each node are completely consistent at a certain time point.
It can be seen that the entire Web server architecture is "scalable and highly available", and the main problems and bottlenecks are concentrated on file synchronization between multiple servers.
In the above architecture, only these Web servers can "incrementally synchronize" each other. as a result, the synchronization of "delete and update" operations on files is not supported.
The early thought was to control the application layer. when a user request was uploaded and written on the web1 server, it also synchronously called the upload interface on other web servers, this is obviously not worth the candle. Therefore, we chose to use Rsync software for timing file synchronization, which saves the "repetitive wheel" cost and reduces the risk.
In the synchronization operation, there are generally two classic models, namely the push-pull model: the so-called "pull" refers to round-robin to get updates, the so-called push, is to actively "push" the changes to other machines. Of course, you can also use an advanced event notification mechanism to complete such actions.
In highly concurrent write scenarios, synchronization may cause efficiency and real-time performance problems, and a large number of file synchronization also consumes a lot of system and bandwidth resources (cross-network segment is more obvious ).
Image Server architecture improvement in the cluster age (shared storage)
Using virtual directories, you can use UNC (network path) to implement shared storage (direct the upload virtual directory to UNC)
User Access Method 1:
Http://www.yourdomain.com/upload/qa/test.jpg
User Access Method 2 (independent domain names can be configured ):
Http://img.yourdomain.com/upload/qa/test.jpg
Supports configuring independent domain name pointing on the server where UNC is located, and configuring a lightweight web server to implement an independent image server.
Advantage: using UNC (network path) for read/write operations can avoid synchronization problems between multiple servers. It is relatively flexible and supports expansion/expansion. Supports independent image server and domain name access, and is fully compatible with the access rules of the old version.
Disadvantage: However, the UNC configuration is cumbersome and may cause some (read/write and security) performance loss. "Single point of failure" may occur ". Data loss may occur if there are no raid or more advanced disaster recovery measures at the storage level.
Shows the basic architecture:
In many early versions of websites based on the Linux open-source architecture, NFS may be used if you do not want to synchronize images. It turns out that NFS has certain efficiency problems in high-concurrency read/write and mass storage, and is not the best choice. Therefore, most Internet companies do not use NFS to implement such applications. Of course, it can also be implemented through the DFS provided by Windows. The disadvantage is that the configuration is complex, the efficiency is unknown, and there is a lack of practical cases ". In addition, some companies adopt FTP or Samba.
The architectures mentioned above have all gone through the Web server during the upload/download operations (although shared storage architecture can also be configured with independent domain names and sites to provide image access, but the upload and write operations must still be processed by applications on the Web server), which undoubtedly puts a lot of pressure on the Web server. Therefore, we recommend that you use an independent image server and an independent domain name to upload and access your images.
Benefits of independent image servers/independent domain names
Image access consumes a lot of server resources (because it involves context switching of the operating system and disk I/O operations ). After separation, Web/App servers can focus more on dynamic processing.
Independent storage makes expansion, disaster tolerance, and data migration easier.
Browser (under the same domain name) concurrency policy restrictions, performance loss.
When accessing images, the request information always includes cookie information, which may also cause performance loss.
It facilitates load balancing of Image access requests, facilitating the application of various Cache policies (such as HTTP Header and Proxy Cache), and more convenient migration to CDN.
......
We can use lightweight web servers such as Lighttpd or Nginx to construct independent image servers.
Current image server architecture (distributed file system + CDN)
Before building the current image server architecture, you can completely open the web server and directly configure a separate image server/domain name. However, it faces the following problems:
What should I do with old image data? Can I continue to be compatible with the old image path access rules?
An independent image server must provide an independent Upload and write interface (service API released). how can we ensure security?
Similarly, if there are multiple independent image servers, will they use a scalable shared storage solution or a real-time synchronization mechanism?
This problem was simplified until the application-level (non-system-level) DFS (such as FastDFS HDFS MogileFs MooseFS and TFS) became popular: supports redundant backup, automatic synchronization, linear expansion, Client api upload, download, and deletion in mainstream languages, file indexing, and Web access.
Taking into account the features of various DFS, Client API language support (C # is required), documents and cases, and community support, we finally chose FastDFS for deployment.
The only problem is that the access rules of earlier versions may be incompatible. If you want to import old images to FastDFS at one time, but because the access paths of old images are distributed and stored in various tables of different business databases, it is also very difficult to update them as a whole, so you must be compatible with the access rules of earlier versions. Architecture upgrades are more difficult than new architectures, because they must be compatible with previous versions. (Changing an airplane engine in the air is much harder than creating an airplane)
The solution is as follows:
First, close the old version Upload Portal (to avoid data inconsistency due to continued use ). Migrate the Old Image data to an independent Image Server (the Old Image Server described in) at one time using the rsync tool ). At the frontend (layer-7 proxy, such as Haproxy and Nginx), use ACL (access rule control) to match the request (regular) of the URL rule of the old image, then, forward the request directly to the specified web server list, configure the website that provides images (Web-based access) on the server in the list, and add the cache policy. In this way, the old image server can be separated and cached, which is compatible with the access rules of the old image and improves the access efficiency of the old image and avoids the problems caused by real-time synchronization.
Overall architecture
Although FastDFS-based independent image server cluster architecture is very mature, due to domestic problems such as "North-South interconnection" and IDC bandwidth costs (images consume traffic very much ), we finally chose the commercial CDN technology, which is also very easy to implement and the principle is actually very simple. here I will just give a brief introduction:
Place the cname of the img domain name to the domain name specified by the CDN Vendor. when a user requests an image, the CDN provider provides intelligent DNS resolution, return the address of the nearest (or, of course, other more complex policies, such as load conditions and health status) service node to the user, and request the user to reach the specified server node, this node provides a proxy cache service similar to Squid/Vanish. if this path is requested for the first time, image resources are obtained from the source Station and returned to the client browser. if the cache exists, the request is obtained directly from the cache and returned to the client browser to complete the request/response process.
Because the commercial CDN service is used, we have not considered using Squid/Vanish to build a front proxy cache.
The above cluster architecture can be easily scaled horizontally to meet the image service needs of medium and large websites in the general vertical field (of course, ultra-large scale such as taobao may be another question ). After testing, a single Nginx server (Xeon E5 quad-core CPU, 16 GB memory, SSD) that provides Image access, for small static pages (approximately 10 kB after compression) it can handle thousands of concurrent jobs without any pressure. Of course, because the size of the image itself is much larger than that of the static page of the plain text, the anti-concurrency capability of the server providing Image access is often limited by the I/O processing capability of the disk and the bandwidth provided by the IDC. The anti-concurrency capability of Nginx is still very strong, and the resource usage is very low, especially when processing static resources, it seems that there is no need to worry too much. You can adjust Nginx parameters, optimize the Linux kernel, and add hierarchical cache policies to optimize the Linux kernel based on actual traffic requirements, you can also scale up by adding servers or upgrading server configurations. the most direct thing is to purchase more advanced storage devices and larger bandwidth to meet the demand for more traffic.
It is worth mentioning that, in the era of "cloud computing", we also recommend websites during the rapid development period to use solutions such as "cloud storage, it not only helps you solve various storage, expansion, and disaster preparedness problems, but also enables CDN acceleration. The most important thing is that the price is not expensive.
To sum up, the image server architecture expansion focuses on these issues:
Capacity planning and expansion problems.
Data synchronization, redundancy, and disaster tolerance.
Costs and reliability of hardware devices (general mechanical hard drives, SSDs, or higher-end storage devices and solutions ).
File system selection. Select an open-source (distributed) file system, such as ext3/4 or NFS/GFS/TFS, based on File features (such as file size and read/write ratio.
Accelerated access to images. Use a commercial CDN or self-built proxy cache and web static cache architecture.
The combination of the path and access rule of the old image