The NoSQL we use in this article is MongoDB, which is an open-source document database system with the development language C ++. It provides an efficient document-oriented storage structure.
The NoSQL we use in this article is MongoDB, which is an open-source document database system with the development language C ++. It provides an efficient document-oriented storage structure.
With the explosive growth of data produced by organizations, from GB to TB, from TB to PB, traditional databases cannot manage such big data through vertical scaling. The cost of traditional data storage and processing methods will increase significantly as the data volume increases. This allows many organizations to find an economic solution, such as NoSQL database, which provides the required data storage and processing capabilities, scalability, and cost efficiency. NoSQL databases do not use SQL as the query language. Such databases have different types, such as document structure storage, key-value structure storage, graph structure, and object database.
The NoSQL we use in this article is MongoDB, which is an open-source document database system with the development language C ++. It provides an efficient document-oriented storage structure and supports processing stored documents through MapReduce programs. It is highly scalable and supports automatic partitioning. Mapreduce can be used for data aggregation. Its data is stored in BSON (Binary JSON) format. Its storage structure supports dynamic schema and supports dynamic query. Unlike rdbms SQL queries, the Mongo query language is represented in JSON.
MongoDB provides an aggregation framework, including common functions such as count, distinct, and group. However, more advanced Aggregate functions, such as sum, average, max, min, variance (variance), and standard deviation (standard deviation), need to be implemented through MapReduce.
This article describes how to use MapReduce to implement common Aggregate functions, such as sum, average, max, min, variance, and standard deviation; typical applications of aggregation include business reports of sales data, such as grouping data in various regions to calculate the total sales volume and financial reports.
We will start from the installation of the software required for the example application in this article.
Software Installation
First, install and set up the MongoDB service on the local machine.
Download MongoDB from the Mongo website and decompress it to a local directory, such as C:> Mongo
Create a data directory in the previous folder. For example, C: \ Mongo \ Data
If the data file is stored elsewhere, you must add the -- dbpath parameter to the command line when using mongod.exe to start MongoDB.
Start the service
MongoDB provides two methods: mongod.exeand later start mongo.exe to start the command line interface, which can be used for management operations. These two executable files are located in the Mongo \ bin directory;
Go to the bin directory of the Mongo installation directory, for example, C:> cd Mongo \ bin.
There are two startup methods:
Mongod.exe-dbpath C: \ Mongo \ data or mongod.exe-config mongodb. config mongodb. config is the configuration file under the Mongo \ bin directory. You must specify the location of the Data Directory (for example, dbpath = C: \ Mongo \ Data) in this configuration file.
Connect to MongoDB. At this step, the mongo background service has been started. You can view it through: 27017. After MongoDB starts running, let's look at its aggregate functions.
Implement Aggregate functions
In relational databases, we can execute SQL statements containing predefined Aggregate functions on numeric fields, such as SUM (), COUNT (), MAX (), and MIN (). However, in MongoDB, The MapReduce function is required to implement aggregation and batch processing. It is similar to the group by clause used in SQL to implement aggregation. The next section describes the SQL-based aggregation in relational databases and the corresponding aggregation through MapReduce provided by MongoDB.
To discuss this topic, we consider the Sales table shown below, which is presented in the anti-paradigm form of MongoDB.
Sales table
#
Column name
Data Type
1
OrderId
INTEGER
2
OrderDate
STRING
3
Quantity
INTEGER
4
SalesAmt
DOUBLE
5
Profit
DOUBLE
6
CustomerName
STRING
7
City
STRING
8
State
STRING
9
ZipCode
STRING
10
Region
STRING
11
ProductId
INTEGER
12
ProductCategory
STRING
13
ProductSubCategory
STRING
14
ProductName
STRING
15
ShipDate
STRING
Implementation based on SQL and MapReduce
We provide a sample set for queries. These queries use aggregate functions, filter conditions, grouping clauses, and their equivalent MapReduce implementation. That is, MongoDB implements the equivalent group by method in SQL. It is very useful to perform aggregation operations on documents stored in MongoDB. One limitation of this method is Aggregate functions (such as SUM, AVG, MIN, MAX) you must use mapper and CER functions to customize the implementation.
MongoDB does not support UDFs. However, it allows you to use the db. system. js. save command to create and save JavaScript Functions. JavaScript functions can be reused in MapReduce. The following table shows the implementation of some common Aggregate functions. Later, we will discuss the use of these functions in MapReduce tasks.
Aggregate functions
Javascript Functions
SUM
Db. system. js. save ({_ id: "Sum", value: function (key, values) {var total = 0; for (var I = 0; I <values. length; I ++) total + = values [I]; return total ;}});
AVERAGE
Db. system. js. save ({_ id: "Avg", value: function (key, values) {var total = Sum (key, values); var mean = total/values. length; return mean ;}});
MAX
Db. system. js. save ({_ id: "Max", value: function (key, values) {var maxValue = values [0]; for (var I = 1; I
MIN
Db. system. js. save ({_ id: "Min", value: function (key, values) {var minValue = values [0]; for (var I = 1; I