now for some data analysis (OLAP)-related data, just Mondrian provides a footmart data set for testing, which mainly records a number of fact tables and dimension tables on sales data, rich in content, and provide the Foormart.xml file, this file defines all the cubes that need to use the definition set, but we now use our own development of the system defined cube, so this configuration file should not be used, the main need to generate the database data, first put it on MySQL, and then pass it through the Sqoo P Import into hive, using our OLAP report system to define the cube and report execution query, this set of processes go down basically the OLAP query basic process to go through, the next major work is how to optimize the performance of multidimensional queries. First of all, this database data is generated by program execution, first we need to download a version of Mondrian (according to official documents), in http://sourceforge.net/projects/mondrian/, But one problem is that after 3.8.0, only the jar package can be found, not the full zip package, so I discard the latest version, download a newer version of the 3.7.0 Zip package (http://sourceforge.net/projects/ Mondrian/files/mondrian/mondrian-3.7.0/mondrian-3.7.0.0-752.zip, size 119M), after decompression in testsrc/main/mondrian/test/ Loader directory has a Mondrianfoodmartloader.java file, this Java program is to generate test data, of course, the test data is stored in a. sql file, this file in the demo directory, the filename is foodmartcreatedata.z IP, after extracting it, compile the Mondrianfoodmartloader program, and then run the following command:
-verbose-tables-data-indexes-jdbcdrivers=com.mysql.jdbc.driver -outputjdbcurl=jdbc:mysql:// 172.17.3.102:16666/foodmart -outputjdbcuser=root-outputjdbcpassword=root-outputjdbcschema=foodmart- Outputjdbcbatchsize=50-inputfile=c:\users\administrator\desktop\foodmartcreatedata.sql
Depending on the parameter name you can see that the meaning of the parameter is:-verbose Verbose output log information-tables CREATE table if table does not exist-data If data exists delete all data in load , do not import data if not selected-indexes Decide whether to create an index-jdbcdrivers database Driver-outputjdbcurl The user name and password for the exported database Url-outputjdbcuser, -outputjdbcpassword Export database-outputjdbcschema target database name, If you do not specify then use the default-outputjdbcbatchsize If BULK INSERT, default is 50 record batch to insert-inputfile Imported data file, which is after the decompression provided in Mondrian. After the SQL file path creation is complete, there are some of the following tables:
Mysql> Show tables;+-------------------------------+| Tables_in_foodmart |+-------------------------------+| Account | | agg_c_10_sales_fact_1997 | | agg_c_14_sales_fact_1997 | | agg_c_special_sales_fact_1997 | | agg_g_ms_pcat_sales_fact_1997 | | agg_l_03_sales_fact_1997 | | agg_l_04_sales_fact_1997 | | agg_l_05_sales_fact_1997 | | agg_lc_06_sales_fact_1997 | | agg_lc_100_sales_fact_1997 | | agg_ll_01_sales_fact_1997 | | agg_pl_01_sales_fact_1997 | | Category | | Currency | | Customer | | Days | | Department | | Employee | | Employee_closure | | Expense_fact | | inventory_fact_1997 | | inventory_fact_1998 | | Position | | Product | | Product_class | | Promotion | | Region || Reserve_employee | | Salary | | sales_fact_1997 | | sales_fact_1998 | | sales_fact_dec_1998 | | Store | | store_ragged | | Time_by_day | | Warehouse | | Warehouse_class |+-------------------------------+
you can see the inside of thisStore_sales,Store_cost,Unit_sales can be measured as a column of measures, and then calculated using some aggregation methods. The rest for exampleproduct_id,time_id,customer_id,promotion_id,store_id is the primary key of the other dimension table, which is associated with the dimension table, for example, the table associated with time_id is time_by_day, and the table structure is as follows:
| Time_by_day | CREATE TABLE ' time_by_day ' ( ' time_id ' int (one) not null, ' the_date ' datetime DEFAULT NULL, ' the_day ' varchar (+) default null, ' the_month ' varchar (() default null, ' the_year ' smallint (6) default null, ' Day_of_ Month ' smallint (6) default null, ' week_of_year ' int (one) default null, ' month_of_year ' smallint (6) default NULL, ' quarter ' varchar (+) default null, ' fiscal_period ' varchar (+) default null) Engine=ntse default charset= UTF8 |
so we can set the time dimension, which is the dimension table associated with this dimension. Time_by_day, you can select the columns in the table that can be enumerated as levels of the dimension (for example,The_month,The_day,The_year,Day_of_month,Week_of_year, etc.), these levels can have hierarchical relationships, one level is the parent of another level, and in the current project only one level is supported with one parent and one child relationship. For example, year this level corresponds to the 'The_year' This column, its children can be 'Quarter' column as a level, for example, in the current table ' the_year ' column has only two values:
Mysql> SELECT distinct the_year from time_by_day;+----------+| the_year |+----------+| 1998 | | 1997 |+----------+
and 'Quarter' column has four values:
Mysql> select distinct quarter from time_by_day;+---------+| quarter |+---------+| Q1 | | Q2 | | Q3 | | Q4 |+---------+
When you set the relationship between this level, you can perform a drill down after the year is specified, so that the measure is calculated for each value under year=1997 quarter, so the level is set primarily to set the aggregation of different granularity for the same dimension (by year, By month or by quarter). Of course, with this level of relationship, for the roll up and drill down also provides the basis for the operation.
In the definition of cube is the definition of dimensions and measures, of course, in the table we can see some of the tables beginning with agg, these tables are created to Mondrian implementation of the time to optimize, In the Foodmart.xml file provided by Mondrian, you can see at the beginning of the file a number of aggregation tables, which are equivalent to some of the measures performed in different dimensions of the estimate, when executing the query if the match can be queried from the aggregation table instead of the dynamic calculation, of course, the calculation for the data update or At a certain price.
The biggest problem with the current Mondrian-based OLAP query engine is performance, because most MDX query operations are dynamically translated into SQL and then perform query operations from the metabase, which inevitably results in poor performance, especially for hive databases that do not target response time. Query speed that is quite slow, the current optimization method is basically the way to perform the estimate, when the cube is created in advance to the entire cube or can support the entire cube of all the measures are calculated, query time from the estimated results of the query or re-aggregation, Because the expected results can be stored in memory or higher performance Key-value database, this way than directly from the meta-database lookup performance will be a certain increase, but when the data volume is large, the estimated results can be very large, if the space saving cache is also a big problem.
Mondrian test database Footmart generation record