Today, another friend has encountered slow processing speed of the SSAS database, mainly because the data aggregation volume is indeed large and it takes more than thirty minutes to process each time, is there any way to reduce the processing time?
I have been engaged in Bi for seven years, so similar questions can definitely be listed in the top 10 FAQ in the career circle. These problems are often a little complicated. Here we will list some of our experiences based on some of the scenarios we have encountered.
Due to space limitations, we will only introduce the solutions to problems. For detailed operations, I will link to my other articles for your reference during actual operations, there are also many suggestions that detail the source of the official document as much as possible for you to get more content.
Improve the query efficiency of related tables at the data warehouse Layer
When processing an SSAs database, you need to throw an SQL query to the Data Warehouse layer. Therefore, optimizing the corresponding dimension tables and fact tables is the key to this step.
I have seen a previous situation where a fact table of a project is a view with complicated operations and connections. Therefore, each time you process a multi-dimensional dataset, you must wait for a long time to prepare for the query to start reading data. Later, I suggested to regularly put the data in the view into a table to ensure that the data in each read fact table is read directly without going through the view.
This is the simplest and most straightforward way to convert the data in a fact table to "entity", so that the data in the view is computed once and then the results are saved to the table, to ensure that the query and analysis applications can obtain results quickly.
The rest is basic database optimization, such as index optimization, and big data solutions such as hadoop or PDW. This part of content is far beyond the scope described in this article, I will not explain it in detail here.
Incremental update
This is the most common method. If the data volume generated in each cycle is 100 MB, there may be no problems in the first few processing cycles, but if your processing cycle is weekly or daily, as time passes, your historical data will become more and more, and it is not wise to process all the data each time. Therefore, we need to use incremental methods to process data.
In SSAs, you must specify the incremental query for incremental processing. That is to say, you need to have a strict data flow. First, you need to prepare incremental data before incremental processing. After incremental processing, you need to properly process incremental data (such as in tables or views ), avoid doubling data due to repeated incremental processing.
If the data warehouse is updated, you can consider the 1-1 + 1 solution when designing the data warehouse. Here is a specific method. You can design it based on your own system.
For specific reference procedures, refer to one of my previous notes:
Bi notes-incremental processing of Multi-Dimensional Datasets
This article describes how to generate test data and then use the test data to demonstrate how to perform basic incremental data updates. It also gives you an understanding of incremental updates to multidimensional datasets.
Create a partition
Like tables in a database, SSAs multi-dimensional data sets can also be partitioned. In theory, partition creation does not significantly affect the data processing speed. However, the reason for this is that you can use the partition method to indirectly implement "incremental update ".
The introduction to incremental updates in the previous step shows how complicated the operations are. With the help of partitioning, you can be too lazy. The specific idea is to partition a multi-dimensional dataset by a certain dimension, either in time or space. For example, partitions are made by month based on the time. Then, only the partition where the incremental data point is located is processed.
The key point of this method is how to automatically identify the partition to be processed. I personally think that the design of multi-dimensional data sets must follow a strict standard. For example, there is a strict naming rule for the partition name so that the code can easily find this partition.
For specific operation methods, refer to one of my previous essays:
Bi notes-a solution for cube incremental Processing
This section describes how to use programming methods to locate the partitions to be processed based on the specified rules and then process them.
When the partition size of a cube is set to a large value, this issue is frequently asked. There is a reference in this document:
Splits a large partition with more than 250 rows or greater than MB into smaller partitions to improve performance.
Source:
Http://technet.microsoft.com/zh-cn/library/bb630302 (V = SQL .105). aspx
Here is only a general reference. The number of data rows also needs to be checked for the data size of each row.
Set dimension attributes properly
Set the dimension attribute relationship reasonably and set the rigid or flexible link type. Here is a brief introduction to the content in the Microsoft document.
For the relationship between attribute dimension attributes, extract one sentence from the document:
Attribute relationships have the following advantages:
Reduce the memory size required for dimension processing. Accelerate the processing of dimensions, partitions, and queries.
Improves query performance because the storage access speed is faster and the execution plan is more optimized.
If the user-defined hierarchy is defined along the link path, the aggregation design algorithm selects a more effective aggregation.
Reference address:
Http://technet.microsoft.com/zh-cn/library/ms174878.aspx
The following is an excerpt from the description of the relationship between Rigidity and flexibility:
Indicates whether the member relationship changes over time. The value is rigid and flexible. The former indicates that the relationship between members does not change with time, and the latter indicates that the relationship between members changes with time. The default value is flexible. If you define a link as flexible, the aggregate is deleted and recalculated as part of the incremental update (if only new members are added, the aggregate is not deleted ). If you define a link as rigid (rigid), analysis services retains the aggregate when Incrementally Updating the dimension. If the relationship defined as rigid has actually changed, analysis services will generate an error during incremental processing. Specifying appropriate link and Link Attributes can improve query and processing performance.
Reference address:
Http://technet.microsoft.com/zh-cn/library/ms176124.aspx
In general, although the impact on the processing time is not obvious through setting the attribute relationship and link type, this is a good standard and habit for designing the SSAS database.
Improved data granularity
Many projects collect data in Fine Granularity to make the data warehouse "big enough. For example, the amount of data collected by a system on a day is one GB. After browsing all the reports, we find that most of the time granularity in the reports is month, and only part of the reports is day.
Of course, I do not deny that the finer the data granularity, the easier it is to find more useful information. However, for the SSAS database layer, the general statistical analysis does not require a high data granularity. You can consider the granularity of fact data to the upper level, such as seconds to hours, or hours to days, reduce the number of fact data in turn.
For small-granularity statistical analysis, we recommend that you only keep the data of the recent period, which can usually meet most of the requirements. The level at which the granularity is increased is appropriate. We recommend that you re-examine the data granularity based on your actual needs.
In short, the principle is that, when resources are limited, we try to "use the money on the cutting edge", and then design separately based on the characteristics of different needs.
Data sample extraction
During development and testing, it is not necessary to take all the historical data directly for testing. This is mainly because it may take a lot of time to wait in each step. After development and testing fail or errors are found, the process is corrected, you also need to re-run it completely.
You can think that a process can be processed in one night, and you can see the results when you go to work the next day. However, if there is a bug in the subsequent test and verification data process, it means it will take another night, so the project progress is hard to be guaranteed. Even if it is a one-hour process, how many hours a day can you develop and test it repeatedly and then verify the process?
Therefore, we recommend that you use only a small amount of data during the development and testing process. For example, you can use only one year or one month of data for 10 years, or among all product brands, Only one or more brands are used for Bi process testing of the entire project. The final verification is only this small part of data. After these small data processing is successful, then process the complete data.
Data extraction methods can be restricted in the data source view or dynamically controlled through partitions. I personally suggest selecting the former, which is easier to operate and does not need to change the structure of the cube frequently.
Summary:
To solve the problem of slow processing, you can choose different solutions based on different scenarios based on performance, methods, and design.
In addition, you can refer to this design warning rule (Analysis Services-multi-dimensional data)
Http://technet.microsoft.com/zh-cn/library/bb630321 (V = SQL .105). aspx
In short, there are many ways to solve the problem. Here we only list some common problems and my personal suggestions. Other representative problems are also welcome to be listed here for further discussion.
Finally, I hope this article will help you.