My view of Data Warehouse (design article)

Source: Internet
Author: User
Tags count final include log modify reference requires
Design | Data building Data Warehouse What do you want to do?
Generally, there are two main areas of data Warehouse construction:

1. Interface design with operational database.

2. The design of the data Warehouse itself.

It seems simple, but it's not the way it is. Assuming I'm a database designer, I can completely ignore 3,721, first load some data, let the DSS analyst (not forget, that is, the design of the data warehouse requirements) to go to, and so he first to give a point of view, we are not too late.

Next, I'll go to a preschool in order to ask questions and solve problems.


What are the main difficulties in building a data warehouse?
First, correct a widespread misconception: the process of building a data warehouse is the process of extracting data from operational data, which is said to be wrong, mainly because operational data is largely non integrated (who has seen a billing program that can count the bill entries over a number of years), You can't extract what you really need, such as the average cost of this month, Ma Lei in this month's overtime, and so on, I don't need to say, you know: The operational data is mainly for the application services, and each system or application has its own unique "independence", in the development of the time, who would have thought that the old account will be turned over?

Well, let's look at the problem in a new light: if it's not just an extraction, what's the problem? As follows:

First question: system integration. When hundreds of tables are put together and you need to count them, are you sure that a field of this table and another table has the same name? Or, conversely, are you sure that a field of this table and a different field in another table must have nothing to do with it? These problems can be summed up as a problem: the lack of integration of the system! The way to solve this problem is to better design your database, but only by your patience. There is the conversion problem of the field, look at the following example: Gender (Sex) in the database there are many forms of expression, can be written m/f, can also be written as 0/1 to represent men/women, etc. What to do? In order to ensure that the data that is summoned to the data warehouse is correct, we must establish different mappings (Sorry, simply: the same nature as mentioned above, the different data expressed in the same form), which is also a need for patient work!

Second question: The efficiency of accessing data from existing systems. This is normal, when a lot of forms and files need to be scanned, who can know exactly how a file has been scanned? If there is a lot of data in an existing system, and you scan the entire database for some of the data, this is a tragedy. Believe that no one wants this to happen, the specific solution is presented below.

Ask "How to avoid these problems" by first figuring out what to do with the loading work from the operational environment to the Data Warehouse (which one would you choose?). )

L load file data. (Think of the dusty old ledger and know what the file is)

L load current data in an operational system. (Data in the system, not yet backed up)

L mount the changes that have occurred in the operational environment (update of the database) to the data warehouse since the last refresh of the database.

For the first option is very simple, turn the books who will not ah? So the difficulty is very small, but as a DSS analyst, put the existing data, you will be willing to analyze the data ten years ago, many enterprises found that in many environments, the use of old data outweigh the gains.

For the second option, it's not difficult to do it because it only needs to be loaded once. Usually we can write a download sequence file according to the operation environment, use this sequential file, can download to the Data warehouse without destroying the online environment. (Seems pretty good)

The third option can be a bit more complicated, because when you load data, the database is changing and it's not easy to capture those changes effectively. Therefore, scanning existing files (or tables) has become a major problem for data Warehouse architects. How to do, how to do ... In fact, there are many ways-there are five kinds.

1. Scanning sometimes-stamped data, you can clearly know that the required data is recently updated, at least we can effectively avoid time inconsistent data. (unfortunately: Not much data is sometimes stamped)

2. Scan incremental file, (what is incremental file, I do not know, but it is certain that it is generated by the application, only record the data that changed), unfortunately, not many programs have incremental files. L

3. Scan audit files and log files, these two files are essentially the same as the incremental file, in addition to a little bit more useless data, interface program difficult to do, no harm J.

4. Modify the application code, (this seems too much, in order to design the Data warehouse, actually let others write their own application), this is not commonly used, should be a program of the code obsolete and difficult to modify. L

5. The fifth way is no way! Joke. Including all the information in this book to persuade us not to do so, I just say two words: time to do some image files, compare their differences. But better than that, I also feel that the method is not only troublesome, complex, but also requires a variety of resources. So you don't have to. J

The third problem: The time base change is difficult to grasp. The existing operational data is usually the current value, precision controllable, can be updated, but the data in the data warehouse can not be updated, so these data must be accompanied by time elements, when the actual operation, from the operational system to the data warehouse, must be in the data to make a large range of changes. At this point, you have to consider the concentration of data, there is no way, data with time is always changing, data warehouse space is limited AH!

So far, we've covered three questions and their solutions, but that's not enough to build a data warehouse of our own, as we haven't learned a specific method yet. The contents of the following section will ...!
Data/process model and Architecture design methodology
First, we introduce two concepts: process modeling and data modeling, simply speaking, process modeling is like the flowchart we drew before programming! There is the beginning and the end. Data modeling is like giving you cabbage, radish, vinegar, salt and so on, and then ask what you can cook, and then you naturally reply: vinegar, cabbage & radish soup. There is no reason why this should be done, but it should be done. J

Process modeling can never be used in the design of a data warehouse, because process modeling is based on requirements, and it assumes that the requirements are already known at the beginning of the detail design, but at one point it is not enough to build data from that library!

The data model is much better, it's both right! (hehe, like superglue) when building data models, there is no need to consider the differences between existing, operational systems and data warehouses. The thing to do seems simple: to build an enterprise data model, and then build a data warehouse model, preferably another operational data model, so you can understand:

Enterprise model à operation model à Data Warehouse model

Three aspects are important and differ from each other. (a bit like the chicken and egg relationship)

Just talk about the data model, divided into three levels of modeling: high-level Modeling (Entity model red), middle-tier modeling (data item set DIS) and underlying modeling (physical layer). The order of construction is up and down, as if we were sitting together, discussing a general structure, starting the design of the middle tier (because the data that red needs is not easy to extract, requires a certain combination of methods), and then designing the underlying model based on the middle tier, (data from the underlying model can be obtained from the operational data.)

Oh, I still do not in-depth discussion, to give you a bit of content to be able to ponder on their own (and this book is not specifically to talk about modeling materials).

is not a little dizzy, what data modeling, what three levels, don't worry, when you take these questions to read the time, the problem is soon gone, I am suggesting you can record your own problem, not in the reading, even the problems are forgotten. J

Data modeling is also a building block process, each design results are a unique building blocks, this is enough to gather all the building blocks before you can complete a puzzle. (a Task)

This is the data warehouse design Method-Data modeling. Here's a few details about designing a data warehouse: (This can be boring)
Normalization/anti-normalization
The purpose of this operation is to reduce the system's I/O operation time. The specific approach can be summed up in two sentences: To reduce the time spent on I/O operations, to merge some tables (normalize), or to introduce redundant data (reverse normalization).


Snapshots of the Data Warehouse
A snapshot is a detailed record of an event. For example: When you use a lot of money to buy a favorite thing, suddenly found that the next half of the living expenses are not, this is the event, and the resulting snapshot is as follows:

Time | Key Code | Place amount of goods ... The Mood when buying | Account balance ... The mood after buying |

1 2 3 4

It is not difficult to see: The third segment of data is discrete raw data, the fourth paragraph is the causal data after the event (is linked, optional) summary, the snapshot should be a real record of an event, he should include the following:

L Key code.

L Time Unit.

L ONLY the initial data associated with the key code.

L Two of the data captured after the snapshot has no direct relationship to the preceding.


Meta data
About (using) data (historical) data, such as the first time, and the second time, that the data warehouse is imported. Where the source data is, the data structure is what, the history of the extraction, and so on.






Management reference tables in the Data Warehouse
Reference data in Data Warehouse (data Yearbook function), the existence of data Warehouse purpose is to provide a reference, so the regular generation of reference data can reduce the amount of data in the Data Warehouse. This is not hard to understand: With the reference data, naturally there is no need to keep those old books.

There are two ways to establish a reference datasheet:

1. Take a snapshot of a reference table at a specific time.

2. A snapshot is a reference table (one), and then a record for each change.


Data cycle
The term "data cycle" refers to the time it takes to change from an operational environment to a data warehouse. For example, when a bank user moves, his new address is added to the operational data, and the data warehouse is aware of it and updates its data immediately. This is a data cycle.

The question is, what time should such adjustments be made? In principle is greater than or equal to 24 hours. This is for data stability and cost issues.


Complexity of transformations and integrations
There are a lot of content, but they are very fragmented, like in the introduction of experience, or leave you a little research it. (I'm going to be lazy) This is the way to build a database.


Triggering Data Warehouse Records
Triggering a data warehouse requires an event, and this event should be an important activity, important so that it can not ignore its existence, hehe, the simple point is like a button, pop-up a dialog box. When this event is captured, a snapshot of the event is added to the Data warehouse. It's simple, isn't it? Perhaps you would like to know, what event, how to trigger? For example, one of your important clients, call you, change the delivery location, ok! Your response is probably to find this shipping record and customer record (this is a snapshot), modify the delivery location (two data), and write to the Data warehouse. I see?


Managing Data warehouses
The purpose of management is to get the data to go, the stay, the statistics on the statistics, do not let the period of data occupy precious space, hehe, said easy to do difficult, everyone knows the user that day will go mad like old accounts, in case there is a mistake, will be bad oh. So the correct way to deal with it is: #¥%...! #. Didn't understand? Aha, sorry, this is a foreign language, Xi hee, summed up there are two points:

1. Use simple recording methods to summarize and synthesize data. There is a comprehensive scale problem, do not combine the data in the end at once, do not lose all the details of the data at once. Let the first pass of the simple record provide the basis for the second time.

2. Data backup is also established. This is the safest way to find a CD-ROM, a tape or something, put it in the safe and it's done. What the? Expensive and time-consuming, I feel very good ah, users check, you can accept her fees. And I made a fortune, J.



Based on the many of the above, have you built a broad framework? Know what is a data warehouse, what kind of table structure is considered to be in line with the Data Warehouse? To tell you the truth, I'm not really sure what the data model is. Is it a similar object in C + +, or is it a structure in a similar data structure? What I've learned is that data warehouses have to be considered when designing, not how to do them. So, you have to figure this thing out, it's not possible in the near future. Only through continuous practice, should be a process of accumulation of experience, it can be said that there is no fully feasible, can be copied to design the Data warehouse. J is not very disappointed, it doesn't matter, this is a need to repeat the process, the success rate of%50 is good, so there is no need to worry about:P

Well, let's say we've built a perfect data warehouse (a bit brazen, xixi) after considering all the circumstances, and you have to keep in mind the fact that the data warehouse must have the data you need, or you'll have to do two patches. You start statistics, extraction, calculation and so on, no can, only want to!

To simulate that you are a bank employee who receives a loan request from a user, you have to decide whether to give this person a loan by determining the credit value and personal assets and work status of the user. Here's a very complex program to do this in the background. And the Data Warehouse also prepares the corresponding data for this request. This audit is comprehensive and very fast. At this point, you must consider:

1. Repayment of history.

2. Private property.

3. Financial management.

4. Net Worth.

5. Total income.

6. Full cost.

7. Other intangible assets.

......

After a complex calculation, you can get the final results of the audit, but this process requires a lot of data is data warehouse sorted out. Ok, do you understand the data warehouse is very useful.

But let's consider the existence of this data, ..., have you found that the final data is a synthesis of many cases of synthetic data. Lots and lots of content, like a big pot porridge, but inside the ingredients come in different places. Xi hee, in fact, this is a data warehouse inevitable phenomenon, called star-type connection. Oh--in fact, these parts are named, the middle of the synthesis is the "fact table", the surrounding is a dimension table. And there is another phenomenon: the fact table contains the primary key of the dimension table. You may not have reacted, but that's the way it is.

This is where the data warehouse access techniques are contained.

Think about it, think of it better to teach me J.













After understanding the major elements involved in the Data Warehouse, ok! Let ' s go on. The following questions will be discussed in depth similar to the design details and management details of the topic. After reading the need for in-depth thinking, which can understand the author's intention. The main reasons also include translation issues.

Let's take a look at the first question:
Granularity of Data Warehouse
The granularity in the Data warehouse refers to the level of detail of the data, but also to describe a situation, I can use a lot of data, but also I can only use the necessary data. And this is up to memory. If there's a big hard drive, there's nothing we can't save. Therefore, it is the designer's biggest problem to estimate the maximum number of rows and the minimum number of rows in the inner table in one year. This involves a concept: the upper and lower bounds of the method of speculation. (Don't ask me, I don't understand)

Then we can know the approximate situation of the database by simple calculation, then we can adjust our strategy. To be more careful, we can adopt a double granularity or a single granularity method.

Double granularity is the best way to reduce the amount of data. Moreover, most companies adopt this approach. Here's an analysis:

Double granularity includes: Low detail level and high detail level. You know: There is no point in setting up lightly aggregated data at a very low level of detail. Conversely, it is no use setting up summary data at too high a level of detail. Therefore, it is necessary to evaluate the granularity of the data before we can get the best summary plan. And the funny thing is, it's all guesswork, no guarantee of correctness, hey, there is no way, who let us is doing a do not know the condition, refers to know the result of the equation, but you can give your results to the end user to see, let her to evaluate this good or bad, do not expect%100 through,%50 is very nice:)

Here are some feedback tips and an example, on page 90, you can refer to.

If the data granularity teaches you to build a data warehouse, the next topic is to teach you management!
Data Warehousing and Technology
There are a lot of management techniques that I can't read, like, "through addressing, through retrieval, through data epitaxy, through effective overflow management ..." Management here includes the ability to manage a large number of databases and the ability to manage data warehouses. Any technology that generates a data warehouse supports the ability to meet the requirements of efficiency.

You want to be able to manage multiple media, main memory, extended memory, cache, Dsad (such as hard disk), CD, Tape ...

The soul of the Data Warehouse is the flexibility and the unpredictable access to the data, don't you understand? The simple point is that it can evaluate all previous data and provide a basis for analysis. If the data warehouse is not convenient and efficient to use the index, then the establishment of the data Warehouse is unsuccessful. Take advantage of some two-level indexes, dynamic indexes, temporary indexes, and more.

A variety of technical interface, I don't have to explain it, this you should understand.

The control of the location of data storage, as was initially said, requires a complete set of data storage mechanisms for the Data warehouse. And it's better to be automatic.

The parallel storage and management of data, assuming that access to the data are equal probability, the improvement of performance is proportional to the number of physical devices distributed by the data.

The management of metadata, remember this fact, and then a good house, if there is no key you can not!! Therefore, managing metadata is even more important than managing data in a data warehouse. This includes table structure, table properties, source data (record system), mapping relationship, data Model description, extraction log, common routines.

Language interface, SQL Language interface, is that you want to do a foreground control program, you can insert, delete ...

The efficient loading of data, I think, (what, the teacher lazy, I will be lazy, how) there is nothing to say, you need to do different things according to different environment.

Efficient index utilization, data compression, compound key code, variable length data, lock management, fast recovery. I will not say more, you know more than I do.
DBMS types and data warehouses,
Multidimensional database management system (commonly known as "Data Mart") provides an information system structure to make access to the database very flexible. If I don't understand the error, the Data mart provides a management of the data, research program, so it is above the data warehouse, so data warehouse data, is the data mart's main data source, so to speak, the difference between the two is the size of the data, the Data warehouse is very small, The DBMS has a large granularity of data. Of course, this is done for a purpose, not only to keep the storage time longer, but also to concentrate on the data!

There are many other functions here:

For example:

• Supports dynamic Data connections.

u can support common data update processing.

• The structure of the relationship is clear.

So is it perfect? Fact In fact, there are many drawbacks to overcome.

The amount of data is not as much as the relational database supports.

¨ does not support common update technologies.

¨ Zhang as long as time.

¨ structure is not flexible.

¨ Dynamic support is still a problem.



This is I read the Data warehouse after a little feeling, take out everyone together research, research, haha ...


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.