"Abstract" Today has entered the era of large data, especially large-scale Internet web2.0 application development and cloud computing needs of the mass storage and massive computing development, the traditional relational database can not meet this demand. With the continuous development and maturation of nosql database, it can solve the application demand of mass storage and massive computation. This paper focuses on the application of MongoDB database in the mass data storage as one of the NoSQL.
1 Introduction
NoSQL, the full name is "not just Sql", referring to the database of the non relational type. This kind of database mainly has these characteristics: non-relational, distributed, open source, level scalable. The original goal was for large-scale Web applications, a new database revolution early on, it was suggested that the trend of development to the 2009 was rising. Non-relational data storage applications such as pattern ownership, easy replication support, simple APIs, final consistency (non-acid), bulk data, etc. It's a wide variety, such as a column database (Hadoop/hbase, Cassandra, hypertable, Amazon simpledb, etc.), a document-type database (MongoDB, CouchDB, Orientdb, and so on), a key-value database (Azure Table Storage, Membase, Redis, Berkeley DB, memcachedb, etc.), graphics database (neo4j, Infinite graph, sones, bigdata, etc.), object-oriented database (DB4O, Versant, objectivity, starcounter, etc.), grid and cloud Database (Gigaspaces, Queplix, Hazelcast, etc.), XML database (Mark Logic Server, EMC Documentum XDB, Basex, Berkeley DB XML, etc.), multivalued databases (U2, openinsight, OPENQM, etc.) and other relational databases (such as Filedb).
MongoDB, one of the NoSQL data, is an open source, model free, document-oriented, distributed database provided by 10gen, a product between relational and non relational databases. Written by the C + + language, designed to provide scalable, high-performance data storage solutions for Web applications. The data structure he supports is very loose and is a JSON-like Bson format, so you can store more complex data types.
He can run on Solaris, Linux, Windows, and OS X platforms, supporting 32-bit and 64-bit applications, where a single database has a maximum capacity of 2G in 32-bit applications, and storage capacity size in 64-bit applications is only relevant to the actual storage space, and provides Java, C #, PHP, C, C + +, Javascript, Python, Ruby, Perl and other languages of the driver, the latest production version of 2.0, the official download address: Http://www.mongodb.org/downloads. Currently using his website and enterprises have more than 100, such as visual China, the public comments network, Taobao, Shanda, Foursquare, Wordnik, OpenShift, SourceForge, GitHub and so on.
With the continuous accumulation and increase of enterprise data and the continuous development of Web2.0 application, has entered the personal Information age, for large and medium-sized enterprises, may produce a large number of data every day, to all kinds of systems, such as various types of documents (OA documents, project documents, etc.), design drawings, high-definition pictures, video, etc., for employees, More concerned with the storage and calculation of personal information, when the information is large enough to extract or analyze the data in real time, the traditional centralized method is difficult to meet the demand, so the distributed storage and calculation become the inevitable choice, on the one hand, the main solution to the mass storage problem, on the other hand to solve the massive computation problem. Using MONGODB database technology can effectively solve the distributed application, this paper focuses on the application of MongoDB in the mass data storage.
2 Overview
Main features of 2.1 MongoDB
(1) The file storage format is Bson, using JSON-style syntax that is easy to grasp and understand. Bson has better performance relative to JSON, mainly for faster traversal speeds, simpler operations, and additional
The data type.
(2) Model freedom, support embedded subdocuments and arrays, no need to create data structure, belong to inverse normalization data model, which is helpful to improve query speed.
(3) Dynamic query, support rich query expression, using JSON form of tags, can easily query the document embedded objects and arrays and subdocuments.
(4) Full index support, including embedded objects and data in the document, as well as Full-text indexing, the MongoDB query optimizer analyzes the query expression and generates an efficient query plan.
(5) The use of efficient binary data storage, suitable for storing large objects (such as high-definition pictures, video, etc.).
(6) supports a variety of replication modes, providing redundancy and automatic failover. Support Master-slave, Replica Pairs/replica Sets, limited master-master mode.
(7) Support server-side scripts and map/reduce, can realize the massive data computation, namely realizes the cloud computing function.
(8) High performance and fast speed. In most cases, its query speed for MySQL is much faster, for CPU consumption is very small. Deployment is simple, almost 0 configuration.
(9) Automatic fragmentation, support for automatic fragmentation to achieve horizontal expansion of the database cluster, you can dynamically add or remove nodes.
(10) built-in GRIDFS, support mass storage.
(11) Access through the network, the use of efficient MongoDB network protocol, performance is superior to the HTTP or rest protocol.
(12) Third party support rich, MongoDB community active, more and more companies and Web sites in the production environment using MONGODB for technical framework optimization, at the same time by the 10gen company provides strong technical support.
The applicable scene of 2.2 MongoDB
The main goal of MongoDB is to set up a bridge between key/value storage (high performance and high scalability) and traditional RDBMS systems (rich functionality), combining the advantages of both.
(1) Website data: MongoDB is very suitable for real-time inserts, updates and inquiries, and has the Web site real-time data storage needs of replication and high scalability.
(2) Caching: Because of its high performance, MongoDB is also suitable as a caching layer for the information infrastructure. After the system restarts, the persistent cache layer built by MongoDB can avoid overloading the underlying data source.
(3) Large size, low value data: the use of traditional relational database to store some data may be more expensive, before this, many times the programmer will choose traditional files for storage.
(4) Highly scalable scenario: MongoDB is ideal for databases made up of dozens of or hundreds of servers. The MongoDB roadmap already contains the MapReduce
Built-in support for the engine.
(5) Storage for objects and JSON data: MongoDB's Bson data format is ideal for storing and querying in a documented format.
Architecture of 2.3 MongoDB
MongoDB is a database of a set of physical files (data files, log files, etc.) corresponding to the logical structure (collections, documents, etc.).
MongoDB's logical structure is actually a hierarchy of documents (document, which corresponds to row in relational databases), a collection (collection, which corresponds to a table in a relational database), a database ( Equivalent to database in relational databases.
A MongoDB instance supports multiple databases. Within MongoDB, each database contains an. ns file and some data files, using the mechanism of pre allocating space to always maintain additional space and spare data files, thereby effectively avoiding the excessive disk pressure caused by data explosion. Each of the preconfigured files is populated with 0, and each time the data file is assigned a new size, it will be twice times the size of the previous data file, with a maximum of 2G per data file.
2.4 MongoDB and MS SQL Server statement control
MongoDB provides a rich query expression that enables the SQL statement functionality of most relational databases to be illustrated by table employee (Id,name,age), as shown in Figure 1 below.
Figure 1 MongoDB with MS SQL Server statement
3 Process Analysis and testing
3.1 Gridfs Overview
Because the size of the Bson object in MongoDB is limited, the maximum capacity of a single Bson object is 16m[5 after a single Bson object has a maximum capacity of 4m,1.7 version before version 1.7. For general file storage, a single object of 4 to 16M of storage capacity to meet the requirements, but not enough for some large file storage, such as high-definition images, design drawings, video, etc., so in the mass data storage, MONGODB provides a built-in grid
FS, you can split a large file into several smaller documents, you can specify the file block criteria, transparent to the user. The GRIDFS uses two data structures to store it: Files (containing metadata objects), chunks (binary blocks that contain some other relevant information). In order for multiple Gridfs to be named a single database, files and blocks have a prefix, the default prefix is FS, and the user has the right to change the prefix.
GRIDFS supports program speech in Java, C #, Perl, PHP, Python, Ruby, and provides a good API interface.
3.2 Test of mass data storage based on Gridfs
This article mainly uses the latest version of MongoDB 2.0 and the official C # language driver for testing, C # driver download address: Https://github.com/mongodb/Mongo-csharp-driver.
MongoDB in the Bin directory provides a series of useful tools, can be very convenient for operation and maintenance management:
(1) Bsondump: Dumps the file in the Bson format to JSON-formatted data.
(2) MONGO: Client command line tool, support JS syntax.
(3) Mongod: Database server, each instance starts a process, can run fork for the background.
(4) Mongodump: Database Backup tool.
(5) Mongorestore: Database recovery tool.
(6) Mongoexport: Data export tool.
(7) Mongoimport: Data import Tool.
(8) Mongofiles:gridfs management tool, can realize the access of binary file.
(9) MONGOs: piecewise routing, if the sharding feature is used, the application is connected to MONGOs, not mongod.
(a) Mongosniff: This tool functions like tcpdump, except that he only monitors MONGODB-related packet requests and outputs them in the form of the specified readability.
(one) Mongostat: Real-time performance monitoring tools.
At the same time, there are several Third-party client graphics tools, such as Mongovue, Rockmongo, Mongohub and so on, to facilitate management and maintenance.
Gridfs combined with automatic fragmentation and automatic replication technology, can achieve high-performance distributed database cluster architecture, so that massive data storage, as shown in Figure 2 below.
Fig. 2 High performance distributed database cluster architecture
MongoDB Sharding cluster requires three different roles:
(1) Shard Server: A fragment that stores the actual data, each Shard can be a mongod instance, or it can be a set of replica set of Mongod instances.
(2) Config Server: used to store configuration information for all shard nodes, shard key range for each chunk, chunk distribution in Shard, collection configuration information for all DB and sharding in the cluster.
(3) Route Process: This is a front-end routing, the client this access, and then ask config servers to which shard to query or save records, and then connect the corresponding shard to operate, and finally return the results to the client, which is transparent to the client, The client does not care about which shard the record being manipulated is stored on.
To facilitate testing, a simple sharding Cluster is built on the same physical machine, as shown in Figure 3 below.
Fig. 3 Simple sharding cluster frame composition
Configure the test environment as follows:
Simulate 2 shard servers and a config server, all running on the native 127.0.0.1, except for the different ports:
(1) Shard server1:127.0.0.1:27020.
(2) Shard server2:127.0.0.1:27021.
(3) Config server:127.0.0.1:27022.
(4) Route process:127.0.0.1:27017.
To start the related service process:
C:\mongodb 2.0.0\bin>mongod--shardsvr--dbpath "C:\mongodb 2.0.0\db"--port 27020
D:\mongodb 2.0.0\bin>mongod--shardsvr--dbpath "D:\mongodb 2.0.0\db"--port 27021
E:\mongodb 2.0.0\bin>mongod--configsvr--dbpath "E:\mongodb 2.0.0\db"--port 27022
E:\mongodb 2.0.0\bin>mongos--configdb 127.0.0.1:27022
Configure sharding:
(1) E:\mongodb 2.0.0\bin>mongo
(2) Use admin
(3) Db.runcommand ({addshard: "127.0.0.1:27020", Allowlocal:1,
Maxsize:2, Minkey:1, maxkey:10})
(4) Db.runcommand ({addshard: "127.0.0.1:27021", Allowlocal:1, minkey:100})
(5) Config =connect ("127.0.0.1:27022")
(6) config = config.getsisterdb ("config")
(7) Ecdocs=db.getsisterdb ("Ecdocs")
(8) Db.runcommand ({enablesharding: "Ecdocs"})
(9) Db.runcommand ({shardcollection: "EcDocs.filedocs.chunks", key: {files_id:1}})
(a) Db.runcommand ({shardcollection: "EcDocs.filedocs.files", key: {_id:1}})
The above ecdocs refers to the database name, Filedocs refers to the user-defined Gridfs file collection name and the system default file collection named FS.
Using the official C # driver, you need to refer to MongoDB.Driver.dllMongoDB.Bson.dll in your program, looping through adding the same file to the Gridfs sample code, as shown in Figure 4 below.
Figure 4 Looping Add the same file to the GRIDFS code
The test configuration environment is as follows:
Operating system: Windows XP Professional Edition 32-bit SP3.
Processor (CPU): intel® Xeon (Xeon) W3503@2.40ghz.
Memory: 3567MB (Ddr31333mhz/flash).
Hard drive: Seagate St3250318as (250gb/7200 rpm).
Since this machine is a 32-bit operating system, a single service instance supports only the Gridfs file size of 0.9G, and because of the use of two Shard service instances, the total size of the stored file can be 1.8G or so, if 64-bit operating systems are not allowed.
In this paper, the main test Gridfs the performance and fragmentation capacity of the large-capacity file, test results, as shown in Figure 5 below.
As you can see from Figure 5, the 1th to 3rd step, when only adding a single file, Shard2 did not produce fragmented data, only the test to step 4 in the continuous addition of 100 of the same file Shard2 to produce fragmented data, and add three hundred or four hundred megabytes of a single file, just 11 seconds more to complete the operation, And even files with such a large file copy can take at least twenty or thirty seconds to complete, visible MongoDB has very high performance in bulk file storage.
The detailed fragmentation can be viewed by entering the Db.printshardingstatus () command on the client's MONGO tool, as shown in Figure 6 below.
As can be seen from fig. 6, 6 chunks are allocated in Shard1, 7 chunks are allocated in Shard2, and the fragment data is relatively homogeneous.
From the above test can be learned that the use of Gridfs can store a large number of data, and can be a low-cost server for large-scale database cluster, very easy to expand deployment, program coding is also very easy, so can effectively support the application of cloud storage, to meet the needs of large-scale data storage applications.
Figure 5 Gridfs Large-volume file test results
Figure 6 Gridfs Large-Capacity file fragmentation information
4 Conclusion
With the continuous expansion of enterprise and personal data, with the rapid development of cloud computing, more and more applications need to store large amount of data, and high concurrency and processing of massive data put forward higher requirements, traditional relational database for these scenarios difficult to meet the application needs, And as one of the NoSQL database MongoDB database can completely meet and solve in the mass data storage applications, more and more large web sites and enterprises to choose MongoDB instead of MySQL storage.