This article will simulate a data warehouse system with user data, product data, and order data. Create a multi-dimensional dataset based on the data structure and process it incrementally.
The incremental approach is to consider the growth of data in fact tables. Assuming that it will grow to several billion in the future, full processing will become unrealistic, therefore, the solution focuses on the incremental processing of multi-dimensional datasets.
The key to incremental processing of multi-dimensional datasets is to divide fact data into two parts: an incremental fact table and a historical fact table. For the first time, a multi-dimensional dataset processes data in a historical fact table, each Periodic processing in the future is to process the data in the incremental table.
The and Visual Studio mentioned in this article are both version 2008 and Version 2005 are also applicable.
Data hypothesis: A User table, a product table, an order table, and who bought the order. The demand for multi-dimensional dataset statistics is to count who bought what according to the order.
First, create a data warehouse and create a bidemo library under the Data Engine.
Next, create a user table with the following structure:
There are also product tables:
As well as historical order tables and incremental order tables, their structures are the same:
For testing convenience, we add some test data to the user table:
Then add some test data to the product table.
As for fact tables, it is unrealistic to manually add Test Data. Therefore, a program is written here to add test data using random numbers:
The code of this program can be found in this article. The generated data is basically as follows:
By now, the data structure and data of the test have been prepared, which is equivalent to a small data warehouse.
Next, create a bi solution in Visual Studio. The solution is divided into an SSIS project and an SSAs project.
To create a data source and Data Source view under the SSAS project, you must note that the fact table uses a historical table instead of an incremental table, even though it does not have data.
First, create a data source, connect to the database you just created, and define the relationship in the data source view, such:
Then, create a multi-dimensional dataset Based on the Data Source view. You need to select the History Table for the measurement and the user and product tables for the dimension.
Finally, deploy the multi-dimensional dataset. You only need to deploy it here. You do not need to process it. The processing task will be processed in the future SSIS package.
The following describes the SSIS project. Create four task modules in the SSIS package. The types are as follows:
The first two cube processing modules are used to process multidimensional datasets. Data streams are used to import incremental fact table data to historical fact tables. Finally, an SQL task is executed to delete the data in the incremental table.
Two multi-dimensional dataset modules, the first one is dedicated to processing dimensions, and the second one is processing cubes. Here, we need to put the Multidimensional Dataset dimension processing in front of each other because, in my experience, although all the multidimensional datasets are processed, the newly added dimension data is not aggregated into it, therefore, you need to put it in front of it.
(Remark: I have never understood this topic very well. It is reasonable to say that since it is all handled, how can I not even deal with the dimension? I still need to submit a ticket)
The following is the dimension processing module. On the page, select a dimension.
Then the cube processing module, such.
Then, specify the incremental update table and configure the incremental update table. Here, specify the incremental table.
After the multi-dimensional dataset is processed, you can put the incremental table data into the history table to ensure that the data added to the next day is incremental data.
It should be noted that in actual operation, it is necessary to ensure that no data occurs in the business system during the Bi processing process, otherwise it will cause data omission and lead to inequality. Therefore, Bi is generally processed in the early morning.
Next is the data flow module in step 3. The main task of this part is to transfer the incremental table data to the historical table.
The last SQL task is a delete or truncate table task, which clears the data in the incremental table.
The final task process is as follows:
Run the package, as shown in:
After successful execution, open the history table and you can find that the data is already in it, and the data in the incremental table does not exist.
Query the multi-dimensional dataset and you can see that the new data is aggregated to it.
The Pivot table above clearly shows who bought the product.
Run rubbish again to add several pieces of data to the incremental table, and then run the SSIS package again. You can find that the new data has been aggregated into the multi-dimensional data set. Note that the processing method is incremental.
The data structure model mentioned in this article is very simple. It mainly introduces the process and method of multi-dimensional dataset processing, focuses on the incremental part of the solution, and issues needing attention. I hope that some brothers who know better ways can discuss and discuss them together.
Download related databases, project files, and programs mentioned in this Article
FAQ:
1. How does incremental data come from?
I personally think this should be done with the business system, such as adding triggers. You can also use the timestamp to extract data from the business system.
2. What should I do if there are any updates or deletions?
Generally, a measurement value bit is added to the scheme mentioned in this article and marked as 1 to indicate the addition. A deleted record is actually added with the same record and expressed as-1. Two records are added for an update. The value of-1 indicates that the record is deleted. The value of 1 indicates that the modified record is deleted, time stamps are used to mark the modified records. This identifier is used as a measurement for statistics.