The common dwh architecture is as simple as figures 2 and 3. Generally, for an enterprise, the data lifecycle is 5-7 years, especially for detailed data. The lower the data granularity level, the shorter the lifecycle, the higher the data granularity and the longer the lifecycle. For the flow account data of a bank, you may want to query the details of the record 10 years later, but the probability is very low, it is impossible for our company to keep the data for the past 10 years in a database with Real-Time query in order to have only one thousandth operation probability after 10 years, of course, I am only talking about the data warehouse of the current application, and the transaction system is another matter.
Figure 1 does not explicitly show how to process historical data, but is just a simple application architecture.
Figure 1:
The difference between figure 2 and Figure 1 lies in the difference between 2 and 3 (only in terms of the graphical representation), but the meaning is actually different, although dwh (data warehousing) compared with dwhhistory (Data Warehouse history database), data is essentially the same, but we can predict the efficiency is different. Their storage is significantly different in order of magnitude.
Figure 2:
Figure 3 only merges images 1 and 2, but we can see the structural integrity.
Figure 3:
Before proceeding, let's clarify several important words.
1. Data granularity: The data includes summarized data and detailed data. The daily data is detailed data, and the annual data is summarized data every month. Data granularity exists along with dimensions. The lower the dimension level, the lower the data granularity.
2: Data Life Cycle: data has a life, and its usage frequency is his life. The data in the current month is generally the highest life value, and the longer the data is, generally, the usage frequency is lower. Therefore, when planning the dwh architecture, we can sort the data according to the business usage value, storage space size, and cost, it can be defined as different lifecycles. For example, day, month, and year are simple time dimensions. The Life Cycle of daily data can be defined as 2 years, and the monthly data can be defined as 5 years, the annual data can be defined as 10 years. If the current date is January 1, January 1, 2008, the system can summarize the data from January 1, 2005 and earlier from monthly and annual data, and then put all the daily data before January 1, January 1, 2006 into the historical database, delete the relevant data in the current data warehouse, summarize all the data from the month before January 1, January 2003 as the annual data, load the data to the historical database, and then delete the data. This is also the case for handling annual data. For example
Next, let's talk about problems with historical data.
How to Apply historical data? There are two ways to apply historical data. The first is to create a web application for the historical data warehouse separately, and the second is to use it with the current database, but to differentiate the data source query. For example, you can set an interface for querying historical information and an interface for querying the current application. If the historical database and the current database are both on the same server, you can use the view to query.
How to store historical data? Historical data can be stored in different servers of the current database or different databases of the same server. If the same server is used, the cost of data management is relatively low. At that time, it was not stable enough because the server crash can directly cause the two databases to be offline. However, the cost of storing data on different servers increases, but it is very convenient for applications.
What is the structure of historical data?? The structure of the historical data table, involving the table name and column name. If the data is distinguished by time, enterprises with large data volumes can store historical data by year. For example, the name of the table in the current database is t_name, in this case, the data in the historical database for June 2000 can be t_name_2000. Of course, these tables must have timestamps. Generally, tables with a large amount of data without timestamps are rare, if this happens, you can place it in the same table. The column name can be consistent with the current database or different from the current database. If it is different, you must create a view to map the source table.
How to load data in a simple and effective manner?We recommend that you use a log table. Each data interaction between dwh and dwhhistory is described in detail. There are several types of records
L batchid)
L operation time
L source database name
L target database name
L source data table
L target data table
L start time of data migration (for example, if you want to migrate data from 2000-1-1 to 2000-12-31 to dwhhistory, the time here is 2000-1-1)
L end time of data migration (same as above, 2000-12-31)
L loading start time (from when to start ETL loading)
L loading End Time (when to complete ETL loading)
L load data volume (quantity of data in this batch operation)
L summary dimension Granularity
L operation mark (delete, insert or update)
L complete identification
L load description
L error message
L number of reloads
How to present data in a data mart? If the dataset contains both monthly and annual data of two years and detailed daily data of two years, the same metric has data of different granularities. Simply put, this cannot be displayed in the same cube, because the cube is automatically aggregated unless dimensions are processed according to measurement variables. Therefore, it is best to separate the detailed data presentation from the total monthly data and the annual data presentation.
How does one deal with the need to analyze historical detailed data in a dataset?For such an application, We can insert historical data into the dataset, because the data in the data mart does not need to be reversed, that is, unlike dwh, data needs to be written back to dwhhistory.