GridFS: MongoDB-based Distributed File Storage System

Last Update:2014-06-15 Source: Internet

Author: User

Tags md5 hash mongodb driver

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

GridFS is a distributed file system on MongoDB. It uses the distributed storage mechanism of MongoDB and uses MongoDB to store file data and file metadata. It has the advantages of both a document-based database and a file system. GridFS is the product of the current big data trend and complex data analysis requirements.

To put it simply, GridFS stores file data and file metadata in MongoDB to implement the file system, and uses Replication to cope with Failover and data integration. It can also be used for read scaling, hot Backup, or as a data source for offline batch processing, achieves Automatic Data splitting through sharding, implements big data storage and load balancing, and manages and queries documents in the collection through the database (including MapReduce) provides Lightweight file system interfaces and search and analysis.

A basic idea of GridFS is to divide a large file into multiple parts, each of which is stored as a separate file, so that a large file can be stored. Because MongoDB supports storing binary data in documents, it can minimize the storage overhead of blocks. GridFS uses mechanisms such as MongoDB replication and sharding to implement distributed file storage. It uses MongoDB for management and complex analysis.

GridFS uses two documents to store files. One is used to store the block of the file, and the other is used to store the information of the block and the metadata of the file. The default set is fs. chunks and fs. files.

Chunks set:

{

"_ Id": <string>,

"Files_id": <string>,

"N": <num>,

"Data": <binary>

}

The document in the block set contains the following attributes: chunk_id: block ID. Chunks. files_id: The _ id of the document in the files set. Chunks. n: The number of Chunks. It is managed by GridFS and starts from 0. Chunks. data: file data, which is of the BSON binary type.

The Chunks set uses files_id and n as the hybrid index, and the files set:

{

"_ Id": <ObjectID>,

"Length": <num>,

"ChunkSize": <num>,

"UploadDate": <timestamp>,

"Md5":
"Filename": <string>,

"ContentType": <string>,

"Aliases": <string array>,

"Metadata": <dataObject>

}

Files contains the following attributes. An application can also create any attributes: files_id: A unique file representation. The default value of MongoDB is BOSN ObjectID. Files. length: the size of the object's bytes. Files. chunkSize: the size of each block. The default value is KB. GridFS divides the file into multiple files based on this value. Files. uploadDate: the time when GridFS stores the file for the first time. The type is ISODate. Files. md5: the md5 Hash Value of the file, which is a string. Files. filename: Optional. A readable file name. Files. contentType: Optional. Valid file MIME type. Files. aliases: Optional. The string array of the alias. Files. metadata: Optional. Custom file metadata.

GridFS can be used by using the program files tool or MongoDB driver. GridFS mainly provides five operation interfaces:

List: get the file List

Get: Get File

Put: write files

Search: Search for files by file name

Delete: Delete an object

Because the metadata of GridFS files is stored in the files set, GridFS can easily perform file management, such as querying by file name, upload time, file size, or custom file metadata, mapReduce can also be used for complex data analysis. This is one of the many benefits of GridFS's integration of traditional file systems and databases.

Advantages over traditional file systems

Distributed: GridFS is a MongoDB-based distributed file system that can directly use the MongoDB Replication and Sharding mechanisms to ensure data reliability and horizontal scalability. GridFS does not generate disk fragments because MongoDB uses 2 GB as the data file space.

MapReduce: supports complex management and query analysis.

Index and cache: the metadata is stored in MongoDB, which is very convenient for indexing. In addition, the metadata of files and files can be indexed to improve system efficiency.

Checksum: GridFS generates hash values for the file, which can be used to verify the file to check integrity.

Developer friendly: Grid can simplify requirements and reduce development costs. If MongoDB is already used, GridFS does not need to use an independent file storage architecture, and the code and data are truly separated for easy management.

Other: GridFS can avoid some problems in the file system used to store user-uploaded content. For example, it is no problem for GridFS to prevent a large number of files in the same directory. GridFS does not generate disk fragments because MongoDB uses 2 GB as the data file space.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More