LOSF (lots of small files) The problem is that many Internet enterprises will encounter, text, pictures, music is a typical small file application scenarios, such as 58 with the city, Taobao, shrimp nets, Autohome and other sites are a huge number of small file storage needs.
Small file storage issues are concentrated in the following areas:
1. Too many small files, single machine cannot store 2. Access performance of small files 3. Efficient backup and recovery of small files
For problem 1, mainly with the help of distributed technology to solve, single-machine can not be stored, the data is distributed to multiple machines, and through the software layer to provide a unified storage interface, so that the users of storage services do not need to care about how the data is stored and where.
For issue 2, a single-machine file system with disk as storage media (such as EXT4, XFS, etc.), files are organized in a directory tree structure, when the number of files, the number of files in the directory will be many, the directory level will be deep, so that path lookup (Sys_open) is very impact performance, One path name lookup may require multiple disk IO.
For problem 3, the backup and restore of small files is actually a LOSF problem, because the small file access has a performance problem, then its backup and recovery is certainly very inefficient, usually problem 1, 2 solved, this problem will be solved.
TFS (Taobao file system) is a distributed file system developed by Taobao to solve a large amount of small file storage by distributing data across multiple storage nodes to solve problem 1, by packaging and storing multiple small files to big file (block) and flat directory structure to solve the problem 2, Resolve Issue 3 by block multiple replicas and by block replication.
This article focuses on the issue of storing multiple small files to large files, and there are similar solutions for small files, such as HDFS, Fastdfs, and so on, in addition to TFS. Save small files to large files, to solve how to save, how to get the problem. At present, the typical solution is to append small files to the end of block when storing, and index small files for quick location; When accessing, the index is used to obtain the offset and size of the file within the block, and then read the file data from the block. The key question is how the index is stored.
To help explain the problem, a large file that is used to store multiple small files is called a block, usually 64MB in size, and each file stored inside the block is identified by a fileid.
Programme I
Index files are not persisted and stored in memory, organized into hash tables (or sequential tables, if the performance of binary lookups is acceptable); When the file is appended to the end of the block, the offset and file size information inside the block is inserted into the hash table, and when the file is accessed, First, according to the file ID in the hash table to locate the file offset and size, and then read the file data in the block corresponding location, because the hash table is fully memory, access to the file only once IO.
The advantage of this scheme is that each time the file is stored, only one IO operation is required, there is no inconsistency between index and block actual file data, the disadvantage is that the index exists only in memory, when the service restarts, the index information needs to be rebuilt according to block data, This requires that when files are stored in blocks, some additional header information, such as magicnum, must be stored so that the block's file has a self-describing capability to generate index at each boot by scanning the block data.
Take 2T disk, 64MB block For example, there will be about 30,000 blocks on disk, assuming that scanning a block requires 1s, then the boot time is about 500min * 0.8 (Disk utilization 80%) = 400min, obviously, The overhead of scanning blocks to generate index on each boot is unacceptable.
Programme II
On the basis of scheme one, each block corresponds to an index file, the file data is appended to the end of block, then the index of the file is inserted into the memory hash table, and the index of the file is appended to the index file, and when the storage service restarts, Each block quickly establishes a hash table of memory based on its index file.
The above scheme solves the problem of index rebuilding, but it takes two io to write the file each time, and the delay of writing the file is higher. To reduce write latency, Facebook used a compromise approach in haystack. When the file is written, the brush disk is synchronized when appending the file data to the block, then the record is inserted into the memory hash table, and when the index record is appended, the append request is considered to be successful, that is, index does not immediately brush the disk, so that the write delay is shortened to one IO.
File index asynchronous Append, may cause a problem, the file exists in the block, but there is no record of this file in the index file, in the index file to rebuild the memory hash table, the hash table is incomplete, resulting in partial file access. Haystack starts by scanning at the end of the block at rebuild, finds all files that may be missing index, and generates an index appended to the index file. (Because the file is in the same order in block and index, the missing index file must be a batch at the end of the block)
Programme III
Scenario two of the index data is actually two copies, one is the memory hash table, one is the index file, because the Linux page cache mechanism, the file index is also possible cache in memory, so the program two memory utilization is not optimal, Consider combining the index file with the index memory hash table, organizing the index file itself as a hash table, and directly using MMAP to map the index file to memory.
By combining the index file and the memory hash table, the operating memory index is the operation index file, the management index will be convenient. In order to resolve the hash conflict problem, each index entry requires an additional next field to connect the conflict chain.
This solution also has the problem of hash extension, when the number of small files stored in the block, according to the estimated number of hash barrels (the pre-valuation is usually not too large, too much space wasted), may lead to a long conflict chain, the efficiency of the hash lookup must expand the number of barrels, If you use scenario one, the extension will only cause additional memory copies, and in this scenario, the entire index file will be rewritten, resulting in an IO operation.
Reprint please specify transfer from Yun Notes, original link: Small file merge storage problem