GridFS: MongoDB-based Distributed File Storage System

Source: Internet
Author: User
Tags md5 hash mongodb driver

GridFS is a distributed file system on MongoDB. It uses the distributed storage mechanism of MongoDB and uses MongoDB to store file data and file metadata. It has the advantages of both a document-based database and a file system. GridFS is the product of the current big data trend and complex data analysis requirements.
 
To put it simply, GridFS stores file data and file metadata in MongoDB to implement the file system, and uses Replication to cope with Failover and data integration. It can also be used for read scaling, hot Backup, or as a data source for offline batch processing, achieves Automatic Data splitting through sharding, implements big data storage and load balancing, and manages and queries documents in the collection through the database (including MapReduce) provides Lightweight file system interfaces and search and analysis.
 
A basic idea of GridFS is to divide a large file into multiple parts, each of which is stored as a separate file, so that a large file can be stored. Because MongoDB supports storing binary data in documents, it can minimize the storage overhead of blocks. GridFS uses mechanisms such as MongoDB replication and sharding to implement distributed file storage. It uses MongoDB for management and complex analysis.
 
GridFS uses two documents to store files. One is used to store the block of the file, and the other is used to store the information of the block and the metadata of the file. The default set is fs. chunks and fs. files.
 
Chunks set:
 
{
 
"_ Id": <string>,
 
"Files_id": <string>,
 
"N": <num>,
 
"Data": <binary>
 
}
 
The document in the block set contains the following attributes: chunk_id: block ID. Chunks. files_id: The _ id of the document in the files set. Chunks. n: The number of Chunks. It is managed by GridFS and starts from 0. Chunks. data: file data, which is of the BSON binary type.
 
The Chunks set uses files_id and n as the hybrid index, and the files set:
 
{
 
"_ Id": <ObjectID>,
 
"Length": <num>,
 
"ChunkSize": <num>,
 
"UploadDate": <timestamp>,
 
"Md5":  
"Filename": <string>,
 
"ContentType": <string>,
 
"Aliases": <string array>,
 
"Metadata": <dataObject>
 
}
 
Files contains the following attributes. An application can also create any attributes: files_id: A unique file representation. The default value of MongoDB is BOSN ObjectID. Files. length: the size of the object's bytes. Files. chunkSize: the size of each block. The default value is KB. GridFS divides the file into multiple files based on this value. Files. uploadDate: the time when GridFS stores the file for the first time. The type is ISODate. Files. md5: the md5 Hash Value of the file, which is a string. Files. filename: Optional. A readable file name. Files. contentType: Optional. Valid file MIME type. Files. aliases: Optional. The string array of the alias. Files. metadata: Optional. Custom file metadata.
 
GridFS can be used by using the program files tool or MongoDB driver. GridFS mainly provides five operation interfaces:
 
List: get the file List
 
Get: Get File
 
Put: write files
 
Search: Search for files by file name
 
Delete: Delete an object
 
Because the metadata of GridFS files is stored in the files set, GridFS can easily perform file management, such as querying by file name, upload time, file size, or custom file metadata, mapReduce can also be used for complex data analysis. This is one of the many benefits of GridFS's integration of traditional file systems and databases.
 
Advantages over traditional file systems
 
Distributed: GridFS is a MongoDB-based distributed file system that can directly use the MongoDB Replication and Sharding mechanisms to ensure data reliability and horizontal scalability. GridFS does not generate disk fragments because MongoDB uses 2 GB as the data file space.
 
MapReduce: supports complex management and query analysis.
 
Index and cache: the metadata is stored in MongoDB, which is very convenient for indexing. In addition, the metadata of files and files can be indexed to improve system efficiency.
 
Checksum: GridFS generates hash values for the file, which can be used to verify the file to check integrity.
 
Developer friendly: Grid can simplify requirements and reduce development costs. If MongoDB is already used, GridFS does not need to use an independent file storage architecture, and the code and data are truly separated for easy management.
 
Other: GridFS can avoid some problems in the file system used to store user-uploaded content. For example, it is no problem for GridFS to prevent a large number of files in the same directory. GridFS does not generate disk fragments because MongoDB uses 2 GB as the data file space.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.