The project has recently needed to use Mondrian, so began to read the following Mondrian source code, overall it appears that Mondrian is a ROLAP engine, but it is located in a library as a component, rather than as a service, for example, it puts all the cache inside the process, When I create multiple cubes in a process and keep their connection will lead to Oom, so the first transformation to Mondrian as a service is to move the Mondrian cache out of the process and onto third-party storage systems such as Redis, In the NoSQL storage system such as hbase, this can solve the above mentioned in-process cache problem, on the other hand can be independent of the data can make the server stateless, can be horizontally extended to ensure high availability. In addition to caching this step, there is also the need to optimize the Mondrian and the generation of SQL this part, Mondrian generated SQL mainly includes looking at a level of all member and viewing a cell result of the two categories, the former is nothing more than select distinct xxx or select XXX Group by Xxx,mondrian generally use SELECT DISTINCT, this is nothing to optimize the place, there is another query is to query the aggregation of multiple dimensions after the value of a measure, Generally, multiple dimensions group by calculates the sum, AVG equivalent of a column, but if some elements need to be filtered out before group by, Mondrian will translate the remaining statements into where xxx in (xxx1, xxx2,...) (through the execution of the log can be seen), this SQL execution efficiency is relatively low, even if xxx This column has a primary key, the transformation scheme is as far as possible to change the in statement to more than between or more than such SQL, this priority can be lower.
The above mentioned is the Mondrian itself needs to be reformed, but the performance of a ROLAP system such as Mondrian is not very satisfactory, the industry does some of the OLAP engine (such as Baidu's Palo, ebay Kylin, etc.) always generate can query speed second level, and Mondrian cache is very slag, every time if you go to the data source to execute SQL will be very slow, so we need to pay special attention to this performance, polo own the Data warehouse, all the data needs to be queried by the system itself to maintain, Therefore, performance optimization is also very easy, and Mondrian used data in the data source (relational database or Big Data System) storage, we do not change the form of its storage, so most of the performance depends on the performance of the data source (whether the primary key, whether to use SSD storage, Number of Hadoop cluster nodes, etc.), based on this limitation I think that if you want to improve performance has only one way, cache, cache, cache, important things to say three times, but after all, cache a cube cache is too large (the following is a rough calculation of the size of a cube required cache), In fact, if we have a large enough cache, we theoretically do the data warehouse, this aspect can still be considered, since they can use the cache as a data warehouse, then we need to do similar ETL operations, pre-loading the user's data into the cache, that is, the so-called Cube precomputed, But unlike a real data warehouse, where the data Warehouse stores the global data that needs to be analyzed, and is expected to be only for the data that a cube can use, Kylin's pre-calculation method is worth learning at the moment. But the main problem with Kylin is based on data analysis problems in the Hadoop system (data using Hadoop storage, using hive queries), because hive speed is really difficult to accept (especially the load of OLAP query generated SQL), So it's very necessary to do precomputed and cache the results to hbase, but since we're using an engine like Mondrian, we want our data source to be either a relational database or a database like hive, or even use HBase (which provides the SQL interface in Phoenix). The estimate here is worth considering whether the relational database is necessary.
Since the pre-calculation is done, that is, to do the cache, the problem is the cache consistency problem, how to ensure that the data volume update in the case of cube cache data can also be updated to the latest state, of course, for OLAP systems, the real-time nature of the data is not the most important, In general, it is acceptable to update data for a few hours or to update the data once a day, and it is acceptable to update the data source without requiring the cube's data to be immediately updated and to achieve eventual consistency. But when the data source imports the new data may cause the whole cube's value to change, if quickly calculates the new cell value also to need to solve the question, some system which currently sees is uses the inverted index (not to understand) the way to realize the increment computation, the concrete implementation may refer to the Kylin code.
The next task and direction is to familiarize yourself with the implementation of both the Mondrian and Kylin systems, based on their implementation of our own OLAP engine for MDX queries and SQL queries (currently the main direction or MDX). However, Mondrian's support for MDX is flawed, and the main problem is that it does not currently support subqueries (considering the problem of intermediate results), so some advanced filtering operations are currently not implemented, such as taking out all years after the year aggregate sales of less than 1W, Then the filtered data according to the other dimensions of the MDX query, which is equivalent to a certain filtering of the data source, first use MDX to identify the year to filter, and then delete all the years in the data source equal to these data (equivalent to each query on the basis of the source fact table added a where condition), If it is too cumbersome to implement such a filter in the outer layer, there is currently no way to know how to solve such a functional flaw. In addition, Mondrian can not only support specifying a table when defining the fact table, but also can specify any view, which would be the result of any SQL statement can be used as a fact table, Use this method to meet the above requirements but for each query translated into SQL requires a multi-layered subquery, performance can be imagined. Later on, if we can cache the entire cube, it is possible to meet this requirement directly on the basis of cube.
In addition, the dimension within the Mondrian is hierarchical, and when we define a one-month level, its parent level is year, then the members of the month level need to carry the specific year, when the result is only aggregated by the month is [1997]. [1],[1998]. [1] Such a value, rather than [1],[2] Such a result, if the implementation of such a result will also need to split the hierarchy of each dimension, this is undoubtedly very complex, do not know if there is any keyword in MDX can specify a level is to break the relationship between the frame.
The next step is to look at the Mondrian code, in fact, the main interface of Mondrian is two, the first is to create connection, which includes the process of loading the cube, the other is to perform an MDX query, the query process is relatively complex, It involves caching and how to generate SQL, and it also needs to focus on the structure of Mondrian's current cache and the management of the cache.
The final rough calculation of the size of a cube, assuming that the scene is such a star structure, including a fact table and 3 dimension tables, namely time, region, product information, where the time dimension is divided into three levels, the year, Quarter and month, respectively, Assuming that the year has 10 members, the quarter has 4*10 members, the month has 10*4*3 members, the region has three levels, respectively, the country, province and city three levels, the country has 100 members, the province a total of 1000 members, the city has 5,000 members; The Product dimension contains two levels, Product Classification and product trademark, the former has 16 members, the latter a total of 500 members, then the entire cube is a possible aggregation of all dimension collections, where each dimension contains a special member of all members, This member belongs to a special level all level (according to Mondrian's thought), the all level under each dimension is the highest level, the time dimension depth is 4, the region dimension depth is 4, the Product dimension depth is 3, then we combine from the lowest level, With a total of 120 members in the lowest level of the time dimension, with 5,000 members at the bottom, and 500 members in the bottom level of the product dimension, the possible combined value is 120*5000*500=3 billion combination element, which is the bottom-most combination element. All of these combinations can be viewed as a cube with a length of 120, a width of 500, and a height of 5000, and the 300 million elements are the entire cube, each of which contains aggregated values for each combination (month =xxx, City =XXX, and product trademark =xxx). A cell contains all the metrics so that the entire cube is set up, the cube is a full value, and the other combinations can be computed by using one of the cubes as a cell, for example, we want to calculate the year 1997, the country is China, The product type is the total amount of food sold, then it is equivalent to all the months under 1997 years (12), all cities in China (assuming 100) and the product type is all the trademarks of food (assuming 50) all combinations of aggregated values, that is, 12*100*50=60w The cell is combined into a new cell as the result of the return. If we want to cache the entire cube, in the best case we still need to cache all the members of all the lowest level (because the low level of information from the high level is not available, unless it is obtained from the data source), the cost is quite large, and generally we need to index the members (subscript for each member), Then the combination of the subscript as the key, the measure value of the composition as the value of the cache, but when the data source even insert a piece of data will change the entire cube in a large number of cell values, this incremental calculation is quite scary.
In the case of incremental calculation, in fact, this is the cache and query speed of the game, if the slow memory is not enough, it is bound to cache some high-level, so for the underlying level of the query will need to go to the data source, performance is poor, if the query cache the bottom level of the composition of members so all queries do not need to go Instead, it is calculated directly in memory. Of course, the best way is to judge the most commonly used levels and the number of members at each level, if a certain level of the number of members is not suitable for caching, if the query frequency of the higher level is more suitable for caching.
Task arduous ah, or first step by step to go, first step to Mondrian cache out of the program and consider whether the cache structure is compared to optimize, the second step can analyze the Mondrian of the execution of SQL, to see if there is no optimized space, The third step is to save the pre-calculation in the cache, of course, this time does not take into account the dynamic incremental update, and finally consider how to do incremental calculation.
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
The conception of transforming Mondrian