Understanding Small File Storage Haystack

Last Update:2017-02-10 Source: Internet

Author: User

Tags posix server memory

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a creation in Article, where the information may have evolved or changed.
Autumn Harvest Map

Perhaps living in this reinforced concrete, already light the feeling of the cycle of the season, ignoring the nature of the organism, metabolism, the endless side. The autumn is short, but is the harvest joyful season, the happiness overflowing, the wind at first, that Marlon Mai Golden.

Brushed off with distributed FS

2013 began to plan the database split, 2C of classified information site time sequence characteristics are obvious. Every day the historical data are archived, constantly slimming, typical by the time "horizontal sharding". As the business grows and changes, hot data already has hundreds of GB, especially open free port, short time data explosion, such as the secondary housing source table, single table 100G, the query by the primary key is very slow, the system is extremely unstable.

Due to system design problems, MySQL directly stores the HTML code of the Detail Page Description section. After analysis, describes the text field average of about 6KB, 100G single table, the Post property data only 20G, the other is to describe the text field, the decision Chaizi paragraph, that is, the so-called "vertical sharding". The Split TEXT field is stored with two options, MySQL or Distributed File system. After some research on the market, there is no operational experience, for stability and non-invasive considerations, choose to throw the storage in MySQL. Later do Automan SQL automatic online, support batch file upload, saved in the native file system, has not changed to storage Distributed File system, it is a pity, pass by.

Last week, I saw Mao Jian writing a haystack-based file system, just to revisit it again.

Old Face book Picture architecture

Picture Access Process

is the standard picture access process, the user uploads images, the Web Server is stored directly in the back-end store, and the picture Url is saved to the database, such as MySQL. The URL resembles the following format:

Http://<Machine Id>/file/path

When the request is issued, Web Server takes the image Url information from MySQL, wraps the CDN information, and throws it to the user's browser page. The page is rendered by the user's browser following the URL of the package, and the package may be in the following format:

Http://<CDN>/<TAG>/<Machine Id>/file/path

CDN Processing request, if the picture does not exist, then according to the protocol stripping out the real picture Url, go to the source station request Flush Cache, this is often said CDN back source, monitoring back to the source rate to continuously optimize the business.

Nfs-based Design

Naturally, the old face book architecture is also to save the picture to the class POSIX file system, the commercial shared storage + NFS mount mode. This model supports the rapid development of Facebook, more like the 05 Ali's IoE. Later, this architecture encountered a bottleneck:

1. Face Book pictures four dimensions, with the increasing volume, long tail theory highlighted that the CDN cache these long tail cost is too high.

2. The class POSIX storage mode, in order to obtain the file, accesses metadata produces many IO. such as permissions these attribute information is unnecessary, if the metadata all cached in memory, the cost is too high.

Face Book Haystack Architecture

Haystack is a 2012 paper published by Facebook, finding a needle in Haystack:facebook's photo storage, detailing their image storage architecture. So haystack is to solve these problems, there are four design goals:

1. High Throughtput and low latency. Similar to the database OLTP business, requires high throughput and low latency, on the one hand the popular picture cache in the CDN, on the other hand to reduce the number of storage disk IO, the metadata into memory, get the picture only once IO operation.

2. Fault-tolerant. High availability is an unavoidable topic, and the haystack design takes into account the geographical area of IDC disaster recovery and cabinet disaster recovery. Specifies a redundancy policy when uploading images from the server.

3. Cost-effective. Based on commercial shared storage is more expensive, ordinary PC server mounted large hard drive cheaper, but the failure rate is also relatively high, this is a more test fault-tolerant high availability.

4. Simple. Simple architecture, easier deployment and operations. The recent vitess architecture and deployment of the study is very complex, which is one reason why he is not popular.

Based on this idea, haystack designers bypassed the POSIX file system, turning haystack into a KV FS, or Nofs. Each image corresponds to a fid, which is no longer stored separately in the filesystem, but the same physical volume Volume pictures are all written to a file, maintained by Volume Server memory FID: <volume machine, Offset, size> mapping relationship, Volume Server maintains open file handles in memory and only requires an IO sequential read operation when reading pictures.

Haystack Frame composition

The architecture is relatively simple, divided into three parts: Haystack Directory, Haystack Cache, Haystack Store

Directory: The so-called Meta Server

1. Generate FID, maintain logical volume and physical volume mapping relationship, solve load balancing problem when uploading.

2. The newly added Store Server will be registered here.

3. Maintain the Read-only property of logical volume, read-only logical volume no longer accepts upload requests.

4. Decide whether to request a CDN or an internal Haystack Cache Server.

Cache: The so-called internal CDN

1. The image FID is saved by a consistent hash algorithm.

2. Cache only user requests, not requests from CDN.

3. Cache only write-enabled store images, due to the time sequence of uploading, the equivalent of caching only the latest generated images. For example, a user has just uploaded a picture that might be stored in the Cache and warmed up.

Store: Final Landing Storage Service

1. The picture order is appended to a large file, which maintains the index information of the Offset and Size of the picture in the file.

2. In order to resolve the restart fast load problem, the index information is saved to an index File separately.

Store storage format

involves two types of files, Store file and Index file.

Store File Layout

The store file is a large file, and the header is superblock to save the global version number and other information. Each picture is a Needle structure appended to the end of the file. Each Needle save the image of the Cookie, Key, Flags, Size, Data, checksum and other information, due to the face book upload image four copies, four sizes share the same Key, then the Alternate Key to make a distinction.

Index File Layout

When the machine restarts, you need to quickly rebuild the in-memory image index information using the index File. If there is no index file, sequential scan Store file can also be rebuilt, but time consuming. Assuming that Needle Index occupies 24byte, the 128G memory machine can store metadata information for 6.8 billion images. In the paper, the Index file is written asynchronously, and Ophen Photo may need to be rebuilt from the Store file after rebooting.

Image upload, update and delete

The picture time sequence characteristic is obvious, all uploads, updates and deletes are Append append operation. Web server requests Directory Server to obtain the Volume Id, key, Alternate key, and Cookie. Web Server uploads These and picture data to the specified store machine, and the redundancy of the data is done synchronously by the store, strong and consistent.

Update operation is consistent with upload, update memory index information, append data to Store File. The delete action marks the memory and flags in the Index File as deleted. For a large number of deletions, a file hole is generated and needs to be reclaimed according to a certain policy.

Picture Reading

The Volume ID of the image, key, Alternate key, Cookie information can be solved in the URL, and WEB server accesses Directory server to get the Store server where the Volume ID resides. The Store Server then queries the in-memory index information and determines whether it has been deleted according to the flags flag. If not deleted, go to store File according to Offset Size to get detailed data, to solve the stored data and cookies, to determine whether the cookie is consistent with the request, and error.

Weadfs

Weadfs is an open source version of the go implementation, the code is relatively simple and understandable. Carefully read the source code, you can deepen the understanding and understanding of haystack. Look at the idea that a lot of companies in use, more stable.

1. There are still a lot of details not disclosed in the Paper of Facebook, such as the high availability of Directory Server, if multiple data consistency guarantees. From the source view Weadfs through Raft to achieve the high availability of multiple Master servers.

2. Weedfs Padding for 8,size uses 4 bytes, then the single Volume file is 32GB maximum.

3. Compared to haystack, many more useful features: gzip compression, index information can be stored in LevelDB, multi-Master high availability, filer server, and so on.

Beansdb&fastdfs&hdfs

The market uses Fastdfs to do picture storage, Tracker server corresponds to Haystack's Directory server. But Fastdfs is using the POSIX file system, and the IO pressure is a bit large.

Beansdb is nice in the use of the storage, Memcache protocol, based on the Bitcask model, guaranteed by r+w>n to ensure consistency. Watercress has been in use, storing mp3 text picture information.

These two open source products are also optimized for small file storage, and have time to read the source to deepen understanding. Another HDFS is used to store large files, broken down by block storage, suitable for batch operations, high throughput, but minimize higher.

Conclusion

Would have to write the Redis Proxy step by Step series, the first delay of a week, continue to be good next week. Recommend a Blackstar song "selfie".

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More