Store billions of photos, how does Facebook do it?

Source: Internet
Author: User
Keywords Facebook programming storage architecture

Sharing photos is already one of the most popular features on Facebook. So far, users have uploaded more than 1.5 billion photos, making Facebook the biggest photo-sharing site. For each uploaded photo, Facebook generates and stores four images of different sizes, which translates into 6 billion photos, with a total capacity of over 1.5PB. At present, the rate of 2.2 million new photos per week increases, which is equivalent to an additional 25TB of storage per week. At peak times, 550,000 photos are required to transmit each second. These numbers are a major challenge for Facebook's photo storage infrastructure.

Old NFS Photo Architecture

The old photo system architecture is divided into the following layers:

The upload layer receives photos uploaded by the user and is saved at the NFS storage layer.

The photo service layer receives HTTP requests and prints photos from the NFS storage layer.

NFS storage tiers are built on commercial storage systems.

Because each photo is stored separately as a file, such a large amount of photos results in a very large scale of metadata, exceeding the cache limit of the NFS storage layer, resulting in multiple I/O operations per request upload. The huge metadata becomes the bottleneck of the entire photo architecture. That's why Facebook relies heavily on CDN. To solve these problems, they have done two optimizations:

Because each photo is stored separately as a file, a large number of directories and files generate a large amount of metadata on the NFS storage layer, which is much larger than the cache limit for NFS storage tiers, resulting in multiple I/O operations being uploaded for each recruitment request. The huge metadata becomes the bottleneck of the entire photo architecture. That's why Facebook relies heavily on CDN. To solve these problems, they have done two optimizations:

CACHR: A caching server that caches Facebook's small user profile photos.

NFS file handle caching: Deployed at the photo output layer to reduce the metadata overhead of the NFS storage layer.

New Haystack Photo Architecture

The new photo architecture merges the output layer and storage layer into a single physical layer, built on an HTTP based photo server, where photos are stored in an object library called haystack to eliminate unnecessary metadata overhead in photo-reading operations. In the new schema, I/O operations are only for real photo data (not file system metadata). Haystack can be subdivided into the following functional layers:

HTTP Server

Photo Storage

Haystack Object Storage

File system

Storage space

In the following introduction, we will describe each of these functional layers in detail.

Storage space

Haystack is deployed on a commercial storage blade server, typically configured as a 2U server, containing:

Two 4 core CPUs

16GB–32GB Memory

Hardware RAID with 256-512m NVRAM cache

More than 12 1TB SATA hard Drives

Each blade server provides approximately 10TB of storage capacity, using hardware RAID-6, and RAID 6 delivers good performance and redundancy on a low cost basis. Poor write performance can be solved through RAID controller and NVRAM cache writeback, written because most of the reading is random, the NVRAM cache is fully used for writing.

File system

The Haystack object library is built on a single file system of 10TB capacity.

Picture-read requests need to be offset at the location where the files are called by the read system, but in order to perform the read operation, the file system must first find the data on the actual physical volume. Each file in the file system is identified by one called the inode structure. The inode contains a mapping of logical file offsets on disk and physical block offsets. Large File block mappings can be quite large when using a particular type of file system.

A filesystem based chunk saves mappings for logical chunks and large files. This information is usually not suitable for storing in the Inode's cache, but is stored in an indirect address block. Therefore, the file must be read in accordance with the specific process. There can be multiple indirect address blocks, so a read can produce multiple I/O depending on whether the indirect address block is cached.

The system only maintains mappings for contiguous ranges of blocks. A block map of a contiguous large file can be identified by only one range, which is adapted to the system requirements of the inode. However, if the file is a cut discontinuous block, his block map may be very large. The above can reduce fragmentation by proactively allocating chunks of space to large physical files through the file system.

The file system currently in use is XFS, which provides a large degree of efficient file-pre-allocation systems.

Haystack Object Storage

Haystack is a simple log structure (append only) that stores pointers to its internal data objects. A Haystack consists of two files, including pointers and indexes.

The haystack 8K storage is occupied by super blocks. The super block is followed by a pin, each stitch consisting of a head, data and tail:

A PIN is identified by his tuple, where the offset is its offset in the haystack store. Haystack is not limited to any health value, that is, a duplicate pin can be allowed. The following illustration shows the layout of the index file:

There is an index record for each stitch in the haystack storage file, and the order containing the PIN index records must match the order of the pins associated with the haystack storage file. According to the required index file the minimum requirement is to find a specific pin in the haystack to store the file's meta data. Loading and organizing index records to a valid lookup data structure is the responsibility of the Haystack program. Indexing files is not critical, as it can be used to store file rebuilds from haystack. The primary responsibility of the index is to quickly load the pin metadata into memory without having to store the files through the larger haystack. The reason is that it allows index programming to store 1% of the original.

Haystack Write operation

The Haystack write synchronization appends the pointer to the Haystack storage file, and when the pointer accumulates to a certain extent, the index is generated to write to the index file. Because indexing files are critical, writing is done asynchronously for faster performance.

To reduce the damage caused by hardware failures, the index files are also periodically written to storage space. In the event of a crash or a sudden power outage, the haystack recovers any missing pins in the processor store and truncates the last valid PIN in the haystack storage. Next, it writes the index records of the missing pins to the end of the haystack file.

Haystack does not allow you to override existing pin offsets, and if a PIN data needs to be overridden, the new version must use the same tuple. The application automatically distinguishes between the two identical keys, with the most recent one being the most offset.

Haystack Read operation

Parameters uploaded to the haystack read operation include pointer offsets, health, alternate keys, cookies, and data size. Haystack adds the length of the head and tail to the data size, and then reads the entire pointer from the file based on the data size. The key to the success of the read operation is the keys passed as a parameter, the standby key, whether the cookie matches, whether the data passes the checksum, and the needle is not deleted. (see below)

Haystack Delete operation

The deletion is simpler – you only need to mark the delete bit in the pointer field of the Haystack store. Also, the associated index records are not modified. Is that the final application refers to a deleted pin. An operation like this that reads a delete pin will return an appropriate error to the application. Space for the removed needle does not do any recycling, only in this way can make the haystack space is very compact. (see below)

Photo Storage Server

The photo storage server is responsible for accepting HTTP requests and converting them to the appropriate Haystack operations. To minimize I/O operations when the server retrieves photos, the server maintains a cache of file indexes in all Haystack. These indexes are read to the cache by the system when the server is started. Because each node has millions of photos, the capacity of the index must not exceed the physical memory of the server. In memory, you only need to save a small amount of metadata needed to find a photo.

For users to upload pictures, the system assigns a 64-bit independent ID, the photo is then scaled to 4 different sizes, each size image has the same random Cookie and 64-bit key, and the picture size description (large, medium, small, thumbnail) is present in the surrogate key. The upload server then notifies the photo storage server to store this information together with the image in the haystack.

The index cache for each picture contains the following data:

Because Google's Open-source sparse hash data structure has only 2bit overhead for each entry, Haystack uses it to keep the index cache in memory as small as possible.

Write/modify operation for photo store

Write writes the photo data to the Haystack store and updates the index in memory. If the index record contains the same key, this is an action to modify an existing photo. and simply modify the offset in the index record to reflect the location of the new image in the Haystack store file. The photo store always assumes that if there is a duplicate image (the image has the same key), the storage with a larger offset is valid.

Photo Store read operations

The parameters passed to a read operation include the Haystack ID, the Key of the photo, the size, and the Cookie. The server first looks in the cache according to the key of the photo and the offset of the desired file. If it is found, a request is made to the haystack to read the word image. As noted above, the haystack deletion does not update its index records, so the index added to memory can contain the contents of previously deleted photos. When you read a previous deleted photo, the system will have an offset of 0 in the indexed color of the memory.

Delete operation for photo store

Notifies Haystack that after a delete operation, the indexed cache in memory is updated and the offset is set to 0, indicating that the photo has been deleted.

Reorganize (compress)

A rearrangement (compression) is an online operation that reclaims the deleted and repeated pins (the pins use the same key). It creates a new haystack by skipping any duplicate or deleted entries through the copy stitch. Once this operation completes, it goes back and replaces the files and structures in memory.

HTTP Server

The Http framework uses a simple evhttp server based on the open source Libevent library. With multiple threads, each thread can handle an HTTP request individually. Because most of our system consumption is I/O operations, the performance of the HTTP server is not very important.

Concluding

Haystack is an HTTP-based object store that contains pointers to entity data that eliminates the overhead of file system metadata and enables the storage and reading of photos with minimal I/O operations by storing all indexes directly to the cache.

The author of this article is a Facebook engineer Peter Vajgel, Doug Beaver and Jason Sobel, translated by punctuation characters.

Source text link (e): HTTP://WWW.FACEBOOK.COM/NOTE.PHP?NOTE_ID=76191543919&REF=MF

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.