Sharing photos is one of the most popular features on Facebook. Up to now, users have uploaded more than 1.5 billion photos, making Facebook the largest photo sharing website. For each uploaded photo, Facebook generates and stores four images of different sizes, which are converted to a total of 6 billion photos, with a total capacity of more than Pb. Currently, it is growing at a rate of 2.2 million new photos per week, equivalent to an additional 25 Tb of storage per week. You need to transfer 0.55 million photos per second during peak hours. These numbers pose a major challenge to Facebook's photo storage infrastructure.
Old NFS photo Architecture
The architecture of the old photo system is divided into the following layers:
The upload layer receives pictures uploaded by users and stores them in the NFS storage layer.
The photo service layer receives HTTP requests and outputs photos from the NFS storage layer.
The NFS storage layer is built on a commercial storage system.
Because each photo is stored separately in the form of a file, such a large volume of photos leads to a very large size of metadata, which exceeds the cache limit of the NFS storage layer, as a result, each Request Upload contains multiple I/O operations. The huge metadata has become the bottleneck of the entire photo architecture. This is why Facebook depends primarily on CDN. To solve these problems, they made two optimizations:
Because each photo is stored separately as a file, a large number of directories and files generate a large amount of metadata on the NFS storage layer, this size of metadata far exceeds the upper limit of the NFS storage layer cache. As a result, each recruitment request will upload multiple I/O operations. The huge metadata has become the bottleneck of the entire photo architecture. This is why Facebook depends primarily on CDN. To solve these problems, they made two optimizations:
Cachr: A cache server that caches photos of small-size Facebook users.
NFS file handle cache: deployed on the photo output layer to reduce the metadata overhead of the NFS storage layer.
New Haystack photo Architecture
The new photo architecture combines the output layer and storage layer into a physical layer and is built on an HTTP-based photo server. The photos are stored in an object library called haystack, to eliminate unnecessary metadata overhead during photo reading. In the new architecture, I/O operations only target real photo data (instead of File System metadata ). Haystack can be divided into the following functional layers:
HTTP Server
Photo storage
Haystack Object Storage
File System
Storage space
In the following introduction, we will detail each of the above functional layers.
Storage space
Haystack is deployed on the commercial storage blade server. It is typically configured as a 2U server, including:
Two 4-core CPUs
16 GB-32 GB memory
Hardware RAID, including 256-512 m nvram high-speed cache
More than 12 1 tb sata hard drives
Each blade server provides a storage capacity of about 10 TB and uses hardware RAID-6. RAID 6 achieves good performance and redundancy while maintaining low costs. Poor write performance can be solved through RAID Controller and NVRAM cache write-back. Because most of the writes are random, NVRAM cache is completely used for writing.
File System
The Haystack Object Library is built on a single file system with 10 TB capacity.
Image reading requests must be offset when the reading system calls these files. To perform the read operation, the file system must first find the data on the actual physical volume. Each file in the file system is identified by an inode structure. Inode contains a ing between the logical file offset and the physical block offset on the disk. When using a special file system, the file block ing may be quite large.
A file system-based block maps a logical block to a large file storage. This information is usually not stored in inode caches, but in indirect address blocks. Therefore, when reading files, you must follow the specific process. Here there can be multiple indirect address blocks, so one read will generate multiple I/O depending on whether the indirect address block is cached.
The system only maintains ing for blocks in a continuous range. The block ing of a large continuous file can only be identified by a range, so as to meet the requirements of the inode system. However, if the file is a cut discontinuous block, its block map may be very large. You can use the file system to allocate large volumes of space for large physical files to reduce fragments.
The current file system is XFS, which provides a highly efficient file pre-distribution system.
Haystack Object Storage
Haystack is a simple log structure (append only) that stores pointers to internal data objects. A Haystack includes two files, including pointers and indexes. The following figure describes the layout of the haystack storage file:
The top 8 K storage of haystack is occupied by Super blocks. Followed by a super block is a needle, each containing a head, data, and tail: