Steps to build a data warehouse

Last Update:2015-09-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Data warehouses are subject-oriented, integrated, non-updatable, and changing over time, which determines that the system design of a data warehouse cannot be designed with the same design method as the traditional OLTP database.
The original requirements of the Data Warehouse system are unclear, and constantly changing and increasing, the developer initially can not understand the user's clear and detailed requirements, the user can provide nothing but the large direction of the demand and part of the demand, but not more accurately foresee the future needs. Therefore, it is more appropriate to use the prototype method to develop the data warehouse, because the idea of prototype is to start from the simple basic frame of constructing system and enrich and perfect the whole system continuously. However, the design and development of data Warehouse is different from the general prototype method, and the Data warehouse design is data-driven. This is because the Data warehouse is developed on the basis of the existing database system, it aims at extracting, synthesizing, integrating and mining the data resources of the existing database effectively, and serving the needs of the senior management decision analysis of the enterprise. However, it is necessary to explain that the development of data Warehouse system is a process of continuous circulation and feedback, which makes the system grow and perfect, which is the main characteristic of the prototype method which is different from the system life cycle method. As a result, throughout the development of the data Warehouse, the decision makers and developers are required to participate and work closely together, requiring a flexible mind, not doing or minimizing ineffective or repetitive tasks.

The design of a data warehouse can be broadly divided into the following steps:

L Conceptual model design;

l Technical preparation work;

l Logical model design;

L Physical model design;

L Data Warehouse generation;

L Data Warehouse operation and maintenance.

Below we have six main design steps as the mainline, which introduces the basic content designed in each design step.

First section conceptual model design
The work to be done with the conceptual model design is:

<1> Defining system boundaries

<2> identify the main subject area and its content

The result of conceptual model design is that a relatively stable conceptual model is established based on the original database. Because the Data warehouse is the data collection which is formed by the integration and reorganization of the data in the original database system, the conceptual model design of the data warehouse must first analyze and understand the original database system, see "What", "How to Organize" and "How to distribute" in the original database system, etc. Then consider how to establish a conceptual model of the Data Warehouse system. On the one hand, through the original database design document and the database relational model in the data dictionary, it is possible to have a complete and clear understanding of the contents of the enterprise's existing database; On the other hand, the conceptual model of the Data Warehouse is set up for the enterprise global, It provides a unified conceptual view of the integration of data from various application-oriented databases.

Conceptual models are designed at a higher level of abstraction, so there is no need to consider the limitations of specific technical conditions when building a conceptual model.

1. Defining the boundaries of the system

Data Warehouse is a database for decision-making analysis, we can not get the detailed and definite requirement at the beginning of Data Warehouse design, but some basic directional requirements are put in front of designers:

What are the types of decisions to make?

What are the issues that policymakers are interested in?

What kind of information do these questions require?

What parts of the original database system do I need to get this information?

In this way, we can draw a current approximate system boundary and focus on the development of the part that is most needed. Therefore, in a sense, the work of defining the boundary of the system can also be regarded as the requirement analysis of Data Warehouse system design, because it reflects the requirement of decision-maker's data analysis with the definition of system boundary.

2. Identify the main subject area

In this step, you determine the subject fields that your system contains, and then you have a more specific description of the content of each topic field, including:

L The public code key of the subject domain;

• links between subject domains;

L fully represent the subject's attribute group.

Section II Technical preparation work
The work of this phase includes: technical assessment, technical environment preparation.

The results of this phase are as follows: Technical evaluation report, software and hardware configuration scheme, system (software and hardware) overall design scheme. The technical requirements for managing data warehouses differ greatly from the technical requirements for managing data and processing in an operational environment, and they are also different in terms of consideration. The reason why we always separate analytical data from operational data in general is to store analytical data separately, that is, storage with Data Warehouse, the difference of technical requirements is an important reason.

1. Technical assessment

The technical evaluation is to determine the Data warehouse performance indicators. In general, the performance indicators that need to be identified in this step include:

L ability to manage large data volumes;

• Ability to perform flexible data access;

• Ability to reorganize data based on data models;

L Transparent data transmission and reception capability;

• Ability to periodically load data in batches;

L can set the completion time of the job management ability.

2. Technical Environment Preparation

Once the model of the architecture of the Data warehouse is generally built, the next step is to determine how we should assemble the architecture model, mainly to determine the requirements of the hardware and software configuration, we mainly consider the relevant issues:

How much data is expected to be processed on the Data warehouse?

How to reduce or mitigate the conflict of competitive access procedures?

How much data is in the Data Warehouse?

How much data traffic is in and out of the warehouse?

Based on these considerations, we can determine the requirements for hardware and software, and at the end of this step the technical preparations should be ready to load the data. These are equipped with:

L Direct Access Equipment (DASD);

L Network;

L manage the operating system of the direct access device (DASD);

L interface to and from the Data Warehouse (mainly data query and analysis tools);

The software that manages the Data warehouse, currently selects the database management system and the related option, the purchase DBMS product cannot satisfy the management Data warehouse need, also should consider oneself or the Software Integrator development related module and so on.

The third section logical model design
The main tasks in this step are:

L ANALYZE the Subject field to determine the current topic to be loaded;

L Determine the level of granularity;

L Determine the data segmentation strategy;

l relationship pattern definition;

L Record System Definition

The result of the logical model design is to define the logical implementation of each topic that is currently being loaded, and to record the relevant content in the metadata of the data warehouse, including:

L appropriate particle size division;

L Reasonable data segmentation strategy;

• appropriate table partitioning;

L Define the appropriate data sources and so on.

1. Analyze Topic fields

In the conceptual model design, we have identified a few basic subject fields, but the design method of data Warehouse is a process of refinement, which is usually done in a topic or a number of topics at a time when designing. Therefore, we must analyze several basic subject domains identified in the conceptual model design step and select the subject domain to be implemented first. The first topic field to consider is that it is large enough to make the subject domain A system that can be applied, and it should be small enough for development and faster implementation. If the chosen subject area is large and complex, we can even develop it for a meaningful subset of it. In each of the feedback process, the subject area of the analysis.

2. Granularity Level Division

One of the important problems to be solved in the logical design of data Warehouse is to decide the granularity of data warehouse, and whether the granularity is suitable or not directly affects the amount of data in the Data warehouse and the type of query. To determine the granularity of the Data Warehouse, you can use the method described in the Granularity Division section to determine whether a single or multiple granularity is used, and the level of granularity divided by estimating the number of rows and the number of DASD required.

3. Determine the Data segmentation strategy

In this step, to choose the appropriate data segmentation criteria, generally consider the following factors: data volume (not the number of records), data analysis and processing of the actual situation, simple and granular classification strategy. The size of data is the main factor to decide whether to divide the data and how to divide it; the requirement of data analysis and processing is a main basis of selecting data segmentation Standard, because data segmentation is closely related to the object of data analysis and processing, and we should also consider that the selected data segmentation standard should be natural, Easy to implement: At the same time, it is also considered that the standard of data segmentation is adaptable to the granularity division level.

4. Relationship Pattern definition

Each topic of the data Warehouse is implemented by multiple tables, which are linked together by the common code keys of the subject, forming a complete theme. In the conceptual model design, we determine the basic theme of the Data Warehouse, and the common Code key, the basic content of each topic is described in this step, we will be selected in the current implementation of the theme of the Pattern division, form a number of tables, and determine the relationship between the tables of the pattern.

Fourth section physical model design
The job of this step is to determine the storage structure of the data, determine the indexing strategy, determine where the data is stored, and determine the storage allocation.

Determining the physical model of a data warehouse implementation requires the designer to do the following:

L to fully understand the selected database management system, especially the storage structure and access methods.

Understanding data environments, frequency of usage, usage, data size, and response time requirements are important criteria for balancing and optimizing time and space efficiency.

L understand the features of external storage devices, such as the chunking principle, block size rules, device I/O characteristics, etc.

1. Determine the storage structure of the data

A database management system often provides a variety of storage structure for designers to choose, different storage structure has different implementation methods, each has its own scope and advantages and disadvantages, designers should choose the appropriate storage structure should weigh three aspects of the main factors: access time, storage space utilization and maintenance costs.

2. Determine the indexing policy

Data Warehouse has a large amount of data, so the data access path needs careful design and selection. Since data in the Data warehouse is not frequently updated, it is possible to design a variety of index structures to improve data access efficiency.

In a data warehouse, designers can consider establishing dedicated, complex indexes on individual data stores for maximum access efficiency, because the data in the Data warehouse is infrequently updated, that is, every data store is stable, so while there is a cost to establishing a dedicated, complex index, But once established, there is little need to maintain the cost of indexing.

3. Determine where the data is stored

As we have said, data on the same topic is not required to be stored on the same media. In physical design, we often classify data in terms of importance, frequency, and response time requirements, and store different classes of data separately in different storage devices. High-importance, frequently accessed, and response-time-critical data is stored on high-speed storage devices such as hard drives, low-frequency access or low-latency data that can be placed on low-speed storage devices such as disks or tapes.

The determination of the location of data storage also takes into account other methods, such as deciding whether to merge tables, whether to set up data sequences for some recurring applications, and whether to store redundant tables or properties that are commonly used and infrequently modified. If you use these techniques, you should write the metadata.

4. Determine storage allocation

Many database management systems provide storage allocation parameters for the designer to physically optimize, such as the size of the block, the size and number of buffers, and so on, which are determined at the physical design time. This is the same as when creating a database system.

Section fifth generation of data warehouses
The work to do in this step is interface programming, data loading.

The result of this step is that the data is already loaded into the data warehouse where it can be used to build the application of the Data Warehouse, the DSS application.

1. Design the interface

Loading data into a data warehouse environment in an operational environment requires an interface between the recording systems in two different environments. At first glance, the establishment and design of this interface, it seems as long as the preparation of a extraction program can be, in fact, in this phase of the work, it is true that the data are extracted, but the extraction is not all the work, this interface should also have the following functions:

• Generate complete data from an application-and operational-oriented environment;

• Time-based conversion of data;

• Aggregation of data;

• Effective scanning of existing recording systems for later appending.

Of course, considering these factors, but also to take into account the physical design of some factors and technical constraints, according to these content, strictly develop specifications, and then according to specifications, interface programming. From the operational environment to the data Warehouse environment of the data interface programming process and the general programming process is not different, it also includes pseudo-code development, coding, compiling, error detection, testing and other steps.

In interface programming, be aware that:

L maintain high efficiency, which is also required by the general programming;

L to save the complete document record;

L be flexible and easy to change;

L to complete and accurate data extraction, transformation and integration from the operational environment to the data Warehouse environment.

2. Data loading

What you do in this step is to run the interface program and load the data into the Data warehouse. The main tasks are:

L determine the order in which the data is loaded;

L Clear invalid or incorrect data;

L Data "aging";

L granularity management of data;

L Data refresh and so on.

Initially, only a subset of the data was used to generate the first topic domain, allowing designers to easily and quickly adjust what had been done, and to commit to the next step as early as possible, namely, the use and maintenance of the Data warehouse. This allows for the fastest return on the economy, as well as the ability to use the end user, identify problems early and propose new requirements, then feed back to designers, who continue to improve and extend the system.

Sixth. Use and maintenance of data Warehouse
The work to be done in this step is to establish a DSS application that uses the data warehouse to understand the requirements, adjust and refine the system, and maintain the data warehouse.

The establishment of the enterprise's systematic environment, not only includes the establishment of the operational and analytical data environment, but also should be included in this data environment to establish a variety of enterprise applications. After the data warehouse is loaded with data, the next step is to use the data in the data warehouse to serve the purpose of decision analysis, that is, to establish DSS applications in the Data Warehouse, on the other hand, based on user usage and feedback, developers further refine the system and manage some of the daily activities of the data Warehouse. , such as refreshing the current details of the data warehouse, transforming obsolete data into historical data, purging data that is no longer in use, and adjusting granularity levels. We refer to this step as the use and maintenance of the Data warehouse.

1. Building DSS Applications

The use of data warehouses, i.e. the development of DSS applications, is fundamentally different from application development in an operational environment, and the salient features of developing DSS applications that differ from the development of online transaction processing applications are:

L DSS application development is based on data;

L DSS application needs can not be clearly understood at the beginning of development;

L DSS Application development is a continuous cycle process, is the development of heuristic.

The application of DSS can be divided into two kinds: routine analysis and heuristic analysis processing. Routine analytical processing refers to the repeated analysis of the processing, it is usually a departmental application, such as departmental statistical analysis, report analysis and so on; While personal-level analysis applications are often random, the business operators are inspired by some kind of information to carry out some of the ad hoc analysis, so we call it heuristic analysis and processing.

The general steps for DSS application development are as follows:

Step l--determine the required data. To meet the requirements of DSS applications, we must identify a possible range of data from the Data warehouse. This is a process of temptation.

Step 2--Program to extract the data. Based on the range of data obtained above, write a decimation program to obtain this data. In order to adapt to the characteristic of the analysis requirement, the extraction program should be universal and easy to modify.

Step 3--Merge the data. If there are multiple sources of data extraction, the extracted data will be merged, refined, so that the data meet the requirements of analysis processing.

Step 4--Analyze the data. On the basis of the preparation of the data based on the analysis processing, and see whether the results meet the original requirements, if not satisfied, then return to step 1, start a new cycle, or prepare the final analysis results report.

Step 5--answer the question. Generate the final analysis results report. -In general, the final results of the analysis are reported after many cycles, because one analysis process is rarely done after a single cycle.

Step 6--, at the end of an analysis process, we decide whether or not to routinely process the analytical processing that has been established above. If the established analytical processing is a repetitive departmental-level DSS application, it is best to routinely make it routine so that the next six-step cycle is not repeated when the same analysis is processed. Moreover, by accumulating this routine and forming a set, we can create a new complex processing by combining these existing processes, or complete a part of a complex process.

2. Understand the requirements, improve and improve the system, maintain the Data Warehouse

Data Warehouse development is a gradual improvement of the development of the prototype method, it requires: to get the system to run as soon as possible to produce benefits, in the system operation or use, to constantly understand the needs, improve the system, constantly consider the new requirements, improve the system.

The job of maintaining the data warehouse is to manage the daily data loading, including refreshing the current details of the data warehouse and transforming outdated data into historical data. Purge data that is no longer in use, manage metadata, and so on, and how to use interfaces to periodically append data from an operational environment to a data warehouse to determine data refresh rates for data warehouses, and so on.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Steps to build a data warehouse

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Steps to build a data warehouse

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support