secrets Taobao 28.6 billion massive image storage and processing architecture (turn)Application Server Linux Network application data structure Nginx
"IT168" August 27 afternoon, in the IT168 system Architects Conference storage and System Architecture Forum, the chairman of the Taobao Technical committee, Taobao core engineer Zhangwensong to us in detail about the Taobao image processing and storage system architecture. Dr. Zhangwensong's lecture schedule includes the entire system architecture of Taobao, the architecture of Taobao image storage System, the independent development of the TFS cluster file system in Taobao, the front-end CDN system and the application and exploration of Taobao in the energy-saving server.
This article focuses on the background of Taobao image storage System architecture, including TFS Cluster file system, as well as front-end processing server architecture.
a system nightmare for solving massive concurrent small files
Taobao for this type of access to a very high number of electronic trading sites, the picture system requirements and the daily photo sharing is not at all a level. Daily photo sharing is often concentrated in a limited number of friends, visits will not be particularly high, and Taobao shops in the merchandise photos, especially hot merchandise, the image of the traffic is actually very large. And for sellers, the picture is far better than the text description, so sellers also attach importance to the image display quality, upload time, access speed and so on. According to the flow of Taobao analysis, the entire Taobao traffic, the image of the traffic will account for more than 90%, and the main station's web page accounted for less than 10%.
Taobao Electronic Mall home screenshot, Taobao back-end system to save more than 28.6 billion picture files, Taobao overall flow, the image of the traffic to account for more than 90%. And the average size of these pictures is 17.45KB, less than 8K of pictures accounted for the overall picture number 61%, the overall system capacity of 11%
At the same time, there are some headaches in storing and reading these images: for example, these pictures require thumbnails of different sizes to be generated based on different application locations. Given the many different scenarios and the possibility of a facelift, it is possible to create more than 20 thumbnails of different size sizes for a single image.
Taobao overall picture storage system capacity of 1800TB (1.8PB), has occupied space 990TB (about 1PB). The number of saved picture files is more than 28.6 billion, and these image files include thumbnails generated from the original artwork. The average picture size is 17.45k;8k the following picture accounts for 61% of the total number of pictures, accounting for 11% of the storage capacity.
This brings a huge challenge to the Taobao system, and it is well known that for most systems, the biggest headache is the large size of small file storage and reading, because the head needs frequent search and lane, so in the reading is easy to bring a longer delay. In the case of a large number of high concurrent traffic, it is a system nightmare.
Analyze the economic benefits of independent research and development and commercial systems
Taobao was founded in 2003, the whole system in the construction and planning has done quite a lot of attempts and exploration.
The picture below is Taobao 2007 before the image storage system. Taobao has been using the commercial storage system before, applying NetApp's file storage System. With Taobao's picture files growing at a rate of twice times a year (3 times times the size of the original), Taobao's back-end NetApp storage systems migrated from low-end to high-end until 2006, when NetApp's highest-end products didn't meet the requirements of Taobao storage.
Taobao 2007 before the picture Storage System Architecture diagram, because the speed of Taobao picture has been increasing twice times per year, the commercial system has been completely unable to meet its storage needs, the current Taobao using the independent research and development of TFS cluster file system to solve a large number of small picture reading and access problems.
Dr. Zhangwensong here summarizes some of the limitations and deficiencies of the commercial storage systems:
The first is that the commercial storage system does not have a targeted optimization of the environment for small file storage and reading; second, the number of files is large, the network storage devices can not support, in addition, the entire system is connected to more and more servers, network connectivity has reached the limit of networked storage devices. In addition, commercial storage system expansion cost is high, 10T storage capacity needs millions of ¥, and there is a single point of failure, disaster tolerance and security can not be well guaranteed.
Referring to the comparative economic benefits between commercial systems and independent research and development, Dr. Zhangwensong the following experiences:
1. Commercial software is difficult to meet the application requirements of large-scale systems, whether storage or CDN or load balancing, because it is difficult to achieve such a large scale of data testing at the vendor lab side.
2. In the research and development process, will open source and the independent development unifies, will have the better controllability, the system problem, can completely solve the problem from the bottom, the system expansibility is also higher.
Comparison of economic benefits of independent research and development and commercial system adoption
3. Based on a certain scale effect, the investment in research and development is worthwhile. Above is an independent research and development and purchase of commercial systems of the input-output ratio, in fact, in the intersection of the above map to the left, the purchase of commercial systems are more practical and more economical choice, only in the case of scale over the intersection, independent research and development can receive better economic results, in fact, The scale to such a degree of companies is not much, but Taobao has far exceeded the intersection point.
4. Independent research and development of the system can be in the software and hardware at multiple levels of continuous optimization.
The TFS 1.0 version of the clustered file system
Since 2006, Taobao has decided to develop its own file system for a large amount of small file storage challenges to solve its own image storage problems. By June 2007, TFS (Taobao filesystem, Taobao file System) was officially online. The cluster size used in the production environment reached 200 PC servers (146g*6 SAS 15K Raid5), the number of files reached billion levels, system deployment storage capacity: 140 TB, actual storage capacity: TB, single support random ioPS 200+, 3MBps traffic.
Taobao cluster File System TFS 1.0 First edition of the logical architecture, the biggest feature of TFS is to hide part of the metadata to the image save file name, greatly simplifies the metadata, eliminates the management node to the overall system performance restriction, this idea and the current industry popular "Object storage" is more similar.
The map is the logical architecture of the first edition of TFS 1.0, the cluster file system: The cluster consists of a pair of name server and multiple data servers, the name server is dual-server two computers, is the cluster file System Management node concept.
• Each data server runs on an ordinary Linux host
• Storage of data files in block files (typically 64M blocks)
Block save multiple copies to ensure data security
• Use ext3 file system to store data files
• Disk RAID5 Do data redundancy
• File name Built-in metadata information, the user himself save TFS file name and the actual file control relationship-make the amount of metadata is very small.
Taobao TFS file system in the core design of the biggest tricky place on the traditional cluster system inside the metadata only 1, usually managed by the management node, and thus easily become a bottleneck. For users of Taobao, the image file is what name to save the actual user does not care, so TFS in the design plan to consider in the picture's save file name on the hidden some metadata information, such as the size of the picture, time, access frequency and so on information, including the logical block number. On the metadata, there is very little information actually saved, so the metadata structure is very simple. Just need a fileid to pinpoint where the file is.
Because a large amount of file information is hidden in the filename, the entire system discards the traditional directory tree structure because the tree is the most expensive. After taking off, the high scalability of the whole cluster is greatly improved. In fact, this design concept and the current industry "object storage" is similar to the Taobao TFS file system has been updated to version 1.3, the performance of the production system has been validated, and constantly improved and optimized, Taobao currently in the field of object storage research has been at the forefront.
the TFS 1.3 version of the clustered file system
By June 2009, the TFS 1.3 version was online, the cluster scale was greatly expanded, deployed to Taobao's picture production system, and the entire system had been amplified from the existing 200 PC servers to 440 PC server (300g*12 SAS 15K RPM) +30 PC Server ( 600g*12 SAS 15K RPM). The number of supporting files also expands to Bai; system deployment Storage capacity: 1800TB (1.8PB); current actual storage capacity: 995TB; Single data server supports random IOPS 900+, traffic 15mb+; current name The server is running a 217MB of physical memory (the servers use a gigabit NIC).
TFS 1.3 Version Logical Structure diagram
The diagram is the TFS1.3 version of the logical structure diagram, in the TFS1.3 version, Taobao software team focused on improving heartbeat and synchronization performance, the latest version of the heartbeat and synchronization in a few seconds to complete the switch, while some new optimizations: including metadata storage memory, cleaning disk space, performance has also been optimized, including: Completely flattened data organization, discarding the traditional file system directory structure. Build the own file system on the basis of block equipment, reduce the performance loss caused by EXT3 and other file system data fragments. Single-process management of a single disk, the way to exclude RAID5 mechanism. A central control node with HA mechanism is balanced between security stability and performance complexity. Minimize metadata size, load metadata into memory, and increase access speed. Load balancing and redundancy security policies across racks and IDC. Fully smooth expansion.
In the following "Picture server Deployment and Caching" section in detail in the whole Taobao picture processing system topology map. As we can see, TFS has a two-layer buffer at the front end of the Taobao deployment environment, and the request to the TFS system is very discrete, so there is no memory buffer within TFS that contains no data, including the memory buffer of the traditional file system.
The main performance parameter of TFS is not IO throughput, but a single pcserver provides random read and write ioPS. Because of the different hardware models, of course, because some technical confidentiality reasons, Taobao is difficult to give a reference value to illustrate performance. But it can basically reach about 60% of the maximum of the single disk random ioPS, and the output of the whole machine increases linearly with the number of disk increase.
TFS2.0 in development and open source TFS
TFS 2.0 is already in the process of development, the main problem to solve is large file storage challenges. TFS developed at the earliest time for the problem of frequent concurrent reading of small files, the design block size is 64MB, meaning that each file is less than 64MB, which is sufficient for general image storage, but there are some bottlenecks for large file storage.
TFS 2.0 will focus on optimizing storage for large files across blocks. In addition, the application optimization of SSD and SAS hard disk features is also included. According to the data of the Taobao, the storage cost of SSD is approximately 20¥ per GB, and the storage cost of SAS hard disk is less than 1 ¥ per gigabyte of 5-6¥ per gb,sata disk. With the improvement of application performance, the application of SSD is the future trend, it is necessary to optimize the access characteristics of different hard disk.
In addition, Zhangwensong announced thatTFS will be fully open in September , full open source means that Taobao will provide all the source code, the open source of TFS and Taobao online application of the system is exactly the same.
picture server deployment and caching
Taobao image storage and processing system global topology, image server front-end there is a level and two cache server, as far as possible to the image in the cache hit, the greatest degree to avoid the picture hotspot, actually back-end to TFS traffic has been very discrete and average
The diagram above is the topological diagram structure of the whole Taobao system. The entire system, like a large server, has a processing unit, a cache unit, and a storage unit. The TFS clustered file storage System in the background has been described in detail before, and more than 200 image file servers have been deployed in the TFS front-end, using the Apatch implementation to generate thumbnail operations.
Here you need to add that, according to the Taobao thumbnail generation rules, thumbnails are generated in real time. The advantages of doing so are two points: first, in order to avoid the back-end picture server stored on the number of pictures, greatly save the demand for the background storage space, Taobao calculation, the use of real-time generation of thumbnail model than all the early generation of a good thumbnail model to save 90% of storage space, that is, Storage space only needs 10% of the latter pattern, and the thumbnail can be generated in real time and be more flexible.
Image file Server Front-End is a first-level cache and level two cache, the front also has a global load balancing settings, to solve the problem of access to the picture. The access hotspot of the picture must exist, it is important to let the picture as far as possible in the cache hit. Taobao at present in each operator's central point with two-level cache, the overall System Center store with a first-level cache, coupled with global load balancing, passed to the back-end TFS traffic has been very balanced and decentralized, the front end of the response performance is greatly improved.
According to Taobao's caching strategy, most of the pictures are as far as possible in the cache hit, if the cache can not hit, the local server will look for the existence of the original image, and based on the original picture to generate thumbnails, if there is no hit, you will consider going back to the background TFS clustered file storage System The flow of final feedback to the TFS Cluster file storage System has been greatly optimized.
Taobao write image processing and caching as a nginx module, Taobao believes that Nginx is currently the highest performance of the HTTP server (user space), code clear, modular very good. Taobao use GraphicsMagick for image processing, using a small object-oriented cache file system, the front end has lvs+haproxy the original image and all of its thumbnail requests are scheduled to the same images server.
File positioning, memory with the hash algorithm to do the index, read the disk at most once. Write disk mode is written in append mode, and adopted the elimination strategy FIFO, mainly consider to reduce the hard drive write operations, there is no need to further improve cache hit rate, because image server and TFS in the same data center, read disk efficiency is very high.
Speaker Profile
Dr. Zhangwensong is a researcher at Taobao, mainly responsible for the development of basic core software, the promotion of network hardware and software performance optimization, build the next generation of high scalable low carbon and low-cost Taobao E-commerce infrastructure. He is also the developer of Open source and Linux kernel, the founder and main developer of the famous Linux cluster Project--lvs (Linux Virtual Server), and the LVS cluster code is in the official kernel of Linux 2.4 and 2.6. In the design and architecture of large-scale systems, system software development, Linux operating systems, system security and software development Management has a wealth of experience. He has been spending time on the development of free software and is happy with it.