Transform mondrian's conception and mondrian's Conception
Mondrian needs to be used recently in the project, so first read the following mondrian source code. In general, mondrian is a ROLAP engine, but its positioning is still a component provided in the form of a library, instead of serving as a service, for example, it places all the caches inside the process. When I create multiple cubes in a process and keep their connection, it will lead to OOM, therefore, the primary transformation of mondrian as a service is to move the mondrian cache out of this process to a third-party storage system, such as redis, hbase, and other NOSQL storage systems, on the one hand, this can solve the cache problem in the process mentioned above, and on the other hand, it can make the server stateless and horizontally expand the data to ensure high availability. In addition to caching, you also need to optimize mondrian by generating SQL, the SQL statements generated by mondrian mainly include viewing all member values under a level and viewing a cell result. The former is nothing more than select distinct xxx or select xxx group by xxx, mondrian generally uses the select distinct method, which has nothing to optimize. In addition, a query is used to query the value of a metric after multiple dimensions are aggregated, generally, the sum and avg values of a column are calculated after multiple dimension group by. However, if some elements need to be filtered out before group, mondrian translates the remaining statements into where xxx IN (xxx1, xxx2 ,...) (You can see through the execution log), this SQL Execution efficiency is relatively low, even if the xxx column has a primary key, the transformation scheme is to change the IN statement to multiple between as much as possible or greater than or less than this SQ L. The priority can be lower.
The above mentioned is about the need to transform mondrian's own shortcomings, but the execution performance of a ROLAP system Like mondrian is indeed not very satisfactory, some OLAP engines in the industry (such as Baidu's palo and ebay's kylin) always generate a second-level query speed, while mondrian's cache is dirty, it will be very slow to execute SQL statements from the data source every time. Therefore, we need to pay special attention to this performance. polo has made its own data warehouse, and all the data to be queried is maintained by the system itself, therefore, it is easy to optimize the performance. The data used by mondrian is stored in data sources (relational databases or big data systems) and we cannot change the storage format, therefore, most of the performance depends on the performance of the data source (whether to add a primary key, whether to use SSD storage, and the number of hadoop cluster nodes ), based on this restriction, I think there is only one way to improve the performance, namely caching, and important things, however, after all, the cache for a cube is still too large (the following rough calculation is required for a cube Cache volume). In fact, if we have a large enough cache, in theory we have made our own data warehouse, we can still consider this. Since we can use the cache as a data warehouse, therefore, it is necessary to perform ETL operations to load user data into the cache in advance, that is, cube pre-computing. However, unlike the real data warehouse, the data that needs to be analyzed globally stored in the data warehouse, while the pre-calculation only targets the data that can be used by a cube. Currently, Kylin's pre-calculation method is worth learning, however, Kylin mainly solves the problem based on the data analysis problem in the hadoop System (data is stored in hadoop and queried using hive ), because the speed of hive is really unacceptable (especially the SQL statement generated by the load OLAP query), it is necessary to perform pre-calculation and cache the results to hbase, however, since we use an engine Like mondrian, we hope that our data sources can be relational databases, databases like hive, or even hbase (phoenix provides SQL interfaces) The pre-calculation here is worth considering whether it is necessary for a relational database.
Since pre-calculation, that is, cache, the unavoidable problem is the cache consistency problem. How can we ensure that the cache data of the cube can be updated to the latest status even when the data volume is updated, of course, the real-time data is not the most important for OLAP systems. Generally, it is acceptable to update data once a few hours or once a day, after the data source is updated, it is acceptable that the cube data can be updated immediately and eventually consistent. However, when the data source imports new data, the value of the entire cube may change. If you calculate the new cell value quickly, you also need to solve the problem, currently, some systems use inverted indexes (not familiar yet) to implement incremental computing. For specific implementation, refer to the Kylin code.
The following tasks and directions will be familiar with the implementation of mondrian and kylin systems, we can implement our own OLAP engine for mdx query and SQL query based on them (currently the main direction is mdx). However, mondrian's support for mdx is still flawed, the main problem is that it currently does not support subqueries (is it because the intermediate results are too large), so some advanced filtering operations cannot be implemented at present, for example, retrieve all the years whose total sales volume is less than after aggregation by year, and then perform mdx query on the filtered data according to other dimensions. This is actually equivalent to filtering the data source, first, use mdx to find the year to be filtered out, and then delete all the data whose years are equal to these values in the data source (equivalent to adding a where condition on the basis of each query in the source fact table ), it would be too troublesome to implement such filtering in the outer layer. Therefore, we do not know how to solve such functional defects. In addition, when defining a fact table, mondrian can not only specify a table, but also specify any view. This means that the results of any SQL statement can be used as fact tables, this method can meet the above requirements. However, for each SQL statement translated into a query, multiple levels of subqueries are required, and the performance can be imagined. In the future, if we can cache the entire cube, we can directly meet this requirement on the basis of the cube.
In addition, dimensions in mondrian are hierarchical. When we define a month level and its parent level is year, the month-level members must carry the specific year, in this case, if the aggregation statistics are only performed by month, the result is [1997]. [1], [1998]. [1] Such a value, instead of a result like [1] or [2], must be split at the level of each dimension if such a result is implemented, this is undoubtedly very complicated. I don't know if there are any keywords in mdx that can break the layer-shelf relationship when specifying a level.
Next, let's take a closer look at the mondrian code. In fact, the main mondrian interfaces are two. The first is to create a connection, which includes the process of loading the cube and executing an mdx query, the query process is relatively complex and involves caching and how to generate SQL statements. In addition, the current cache structure and Cache Management of mondrian also need to be focused.
Finally, we roughly calculate the size of the next cube. Assume that the scenario is such a star structure, including a fact table and three dimension tables, which are time, region, and product information, the time dimension is divided into three levels: Year, quarter, and month. Assume that the Year has 10 members and the quarter has 4*10 members, the month has 10*4*3 members, and the region has three levels: country, province, and city. The country has 100 members, and the province has 1000 members, the city has 5000 members. The product dimension consists of two levels: Product Classification and product trademark. The former has 16 members, and the latter has 500 members, the entire cube is the possible aggregate value of ALL dimension sets. Each dimension contains a special member ALL, which belongs to a special level ALL (according to mondrian's idea ), the ALL level under each dimension is the highest level. The depth of the time dimension is 4, the depth of the region dimension is 4, and the depth of the Product Dimension is 3, then we combine them at the lowest level. The lowest level of the time dimension contains 120 members, and the lowest level includes 5000 members. The bottom-layer level of the Product Dimension contains 500 members. The possible combination value is 120*5000*500 = 0.3 billion. This is the bottom-layer combination element, all these combinations can be considered as a cube with a length of 120, a width of 500, and a height of 5000. These 0.3 billion elements are the whole cube. Each unit in this cube contains each combination (month = xxx, the aggregation value of city = xxx and product trademark = xxx). A cell contains all the measurement values, so that the entire cube is established. This cube is a full value, other combined values can be calculated by using a sub-cube of the cube as a cell. For example, if we want to calculate the year as 1997, the country is China, the product type is the total sales of food, which is equivalent to the total sales of all the months (12) in 1997 and all the cities in China (assuming 100) and the aggregate value of all the combinations of All trademarks of the product type foods (assuming 50, that is, 12*100*50 = 60 W cells are combined into a new cell as the returned result. If we want to cache the entire cube, it is best to cache the combinations of all members at the lowest level (because low-level information is not obtained from the high-level level, unless it is obtained from the data source ), this cost is quite large. Generally, we need to index members (create subscript for each member), and then use the combination of the lower mark as the key, the combination of metric values is cached as values. However, when a data source inserts a piece of data, a large number of cell values in the entire cube are changed. This incremental calculation is terrible.
Without considering incremental computing, this is actually a game between the cache volume and query speed. If the cache volume is insufficient, it is bound to cache some high-level information, in this way, the underlying query needs to go through the data source, and the performance is poor. If you query the combinations of members at the underlying level of the cache, all the queries do not need to go through the data source, but are directly calculated in the memory. Of course, the best way is to determine the most common level and the number of members at each level. If there are too many members at a level of a layer, it is not suitable for caching, if the query frequency is high, it is more suitable for caching.
This is a tough task. Let's take a step by step. First, we will remove mondrian's cache from the program and check whether the cache structure is compared before optimization, the second step can analyze the execution SQL of mondrian and check whether there is any room for optimization. The third step is to store pre-computing in the cache. Of course, dynamic incremental updates are not taken into account at this time, finally, consider how to perform incremental computing.
Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.