Implementation methods and steps of Data Warehouse Construction)

Source: Internet
Author: User

Data Warehouses are subject-oriented, integrated, non-updatable, and constantly changing over time, these features make sure that the system design of the data warehouse cannot be the same as that of the traditional OLTP database.

The original requirements of the data warehouse system are unclear and With constant changes and increases, developers cannot precisely understand users' specific and detailed needs at first. What users can provide is nothing more than the general direction and some requirements, we cannot predict future requirements more accurately. Therefore, it is more appropriate to use the original method to develop data warehouses, because the idea of the original method is to constantly enrich and improve the entire system from the simple basic framework of the system. However, the data warehouse design and development is different from the general concept of the original method. The data warehouse design is data-driven. This is because the data warehouse is developed on the basis of the existing database system. It aims to effectively extract, integrate, integrate and mine the data resources of existing databases, it serves the needs of enterprise leadership management decision-making analysis. However, it should be noted that the development of a data warehouse system is a process that continuously increases and improves the system through continuous loops and feedback. This is also the main feature of the prototype method that distinguishes it from the life cycle method of the system. Therefore, the entire process of data warehouse development requires the participation and close collaboration of decision makers and developers from beginning to end, and requires a flexible mind, do not do or do as little ineffective work or duplicate work as possible.

The data warehouse design can be divided into the following steps:

Conceptual Model design;
Technical preparation;
Logical model design;
Physical Model design;
Data Warehouse generation;
Data warehouse operation and maintenance.
 
Next we will take six major design steps as the main line to introduce the basic design content in each design step.

Section 1 Conceptual Model Design

The task of conceptual model design is:

<1> define system boundaries
<2> determine the main subject domain and its content

The results of the conceptual model design are that a relatively solid conceptual model is built on the basis of the original database. Because the data warehouse is a data set formed by integrating and reorganizing the data in the original database system, the conceptual model design of the data warehouse must first analyze and understand the original database system, let's look at the "what", "How to organize", and "how to distribute" in the original database system, and then consider how to establish the conceptual model of the data warehouse system. On the one hand, we can have a complete and clear understanding of the existing database content through the design documents of the original database and the database Relationship Model in the data dictionary. On the other hand, the concept model of Data Warehouse is created globally for enterprises. It provides a unified conceptual view for integrating data from application-oriented databases.
Conceptual models are designed at a high level of abstraction. Therefore, you do not need to consider the restrictions of specific technical conditions when creating a conceptual model.

1. Define system boundaries

Data Warehouse is a database oriented to decision analysis. We cannot obtain detailed and clear requirements from the beginning of the data warehouse design, but some basic direction requirements are still placed in front of the designers:

· What types of decisions are to be made?
· What are the issues that interest policy makers?
· What information do these questions require?
· Which part of the data of the original database system is required to obtain the information?

In this way, we can define a general current system boundary and focus on the most necessary development. Therefore, in a sense, defining system boundaries can also be seen as a requirement analysis for data warehouse system design, it reflects the data analysis requirements of decision makers in the form of system boundaries.

2. determine the main subject domain
 
In this step, you need to determine the theme domains included in the system, and then clearly describe the content of each theme domain. The descriptions include:

· Public Key of the topic domain;
· Relationships between subject domains;
· Attribute groups that fully represent themes.

Section 2 technical preparations

Work at this stage includes technical evaluation and technical environment preparation.
The results of this phase are: technical evaluation report, software and hardware configuration scheme, and overall system (software and hardware) design scheme. The technical requirements for data warehouse management differ greatly from those for data and processing in the management operation environment, and the two have different considerations. In general, we always separate analyticdb from operational data and store analyticdb in a centralized manner, that is, data warehouses, the difference in technical requirements is an important reason.

1. Technical Evaluation

The technical evaluation is to determine the performance indicators of the data warehouse. In general, the performance indicators that need to be determined in this step include:

· Ability to manage large data volumes;
· Flexible data access capability;
· Ability to reorganize data based on the data model;
· Transparent data sending and receiving capabilities;
· Ability to periodically load data in batches;
· Job management capabilities that can set the completion time.

2. Prepare the technical environment

Once the systematic structure model of the Data Warehouse is established, the next step is to determine how we should assemble the systematic structure model, mainly to determine the requirements for software and hardware configuration; we mainly consider the following issues:

· What is the expected data volume to be analyzed and processed in the data warehouse?
· How to reduce or reduce competitive accessProgram?
· What is the data volume of a data warehouse?
· What is the data communication volume in and out of the Data Warehouse? And so on.

Based on these considerations, we can determine the hardware and software requirements, and at the end of this step, the technical preparations should be ready to load data. These configurations include:

· Direct access device (DASD );
· Network;
· Manage the operating system for direct access to the device (DASD;
· Data Warehouse access interface (mainly Data Query and analysis tools );
 
Currently, database management systems and related options are used to manage data warehouses. The purchased DBMS products cannot meet the needs of data warehouses, you should also consider developing related modules by yourself or software integrators.

Section 3 Logical Model Design

The main tasks in this step include:

Analyze the topic domains to determine the topics to be loaded;
Determine the level of granularity;
Determine the data segmentation policy;
Link mode definition;
Record System definition

The logic model is designed to define the logic implementation of each topic to be loaded and record the relevant content in the metadata of the data warehouse, including:

· Appropriate granularity division;
· A reasonable data segmentation policy;
· Proper table Division;
· Define appropriate data sources.

1. Analyze the topic domain

In conceptual model design, we have identified several basic theme domains. However, the data warehouse design method is a process of gradual refinement. during design, generally, a single topic or several topics are gradually completed at a time. Therefore, we must analyze several basic theme domains identified in the conceptual model design step and select the theme domains to be implemented first. When selecting the first topic domain, you need to consider that it should be large enough so that the topic domain can be built into an application system; it should be small enough for development and quick implementation. If the selected topic domain is large and complex, we can even develop a meaningful subset of it. During each feedback process, analysis of the topic domain is required.

2. granular hierarchy

An important problem to be solved in the Logic Design of Data Warehouse is to determine the granularity and hierarchy of the Data Warehouse. Whether the granularity hierarchy is appropriate or not directly affects the data volume in the data warehouse and the appropriate Query type. To determine the granularity of a data warehouse, you can use the method described in the granularity Division section to determine whether to adopt a single granularity or multiple granularities by estimating the number of data rows and the number of required dasds, and the level of granularity.

3. Determine the data segmentation Policy

In this step, you should select the appropriate data segmentation standard, generally consider the following factors: data volume (rather than the number of records), the actual situation of data analysis and processing, simple and easy to use, and the Division of granularity policies. The size of the data volume is the main factor determining whether to split the data and how to split it. The data analysis and processing requirements are a major basis for selecting the data segmentation standard, because data segmentation is closely related to the objects of data analysis and processing, we also need to consider that the selected data segmentation criteria should be natural and easy to implement: at the same time, we also need to consider the criteria for data segmentation and the level of granularity division.

4. Link mode definition

Each topic of a data warehouse is implemented by multiple tables. These tables are linked by the public key of the topic to form a complete topic. In the conceptual model design, we have determined the basic topic of the data warehouse, and described the public key and basic content of each topic in this step, we will divide the selected themes for the current implementation to form multiple tables and determine the relationship modes of each table.

Section 4 Physical Model Design

This step is to determine the data storage structure, index policy, data storage location, and storage allocation.
Determining the physical model implemented by the Data Warehouse requires the designer to do the following:

A comprehensive understanding of the selected database management systems, especially the storage structure and access methods.
Understanding the data environment, data usage frequency, usage method, data scale, and response time requirements are important bases for balancing and optimizing the time and space efficiency.
Measure the test taker's knowledge about the features of external storage devices, such as the block principle, block size rules, and I/O features of devices.

1. Determine the Data Storage Structure

A database management system usually provides a variety of storage structures for designers to choose from. Different storage structures have different implementation methods and have different applicability and advantages and disadvantages, when selecting an appropriate storage structure, the designer should weigh three major factors: Access time, storage space utilization, and maintenance cost.

2. Determine the index Policy

The data warehouse has a large amount of data, so you need to carefully design and select the data access path. Because the data in the data warehouse is not updated frequently, you can design a variety of index structures to improve data access efficiency.

In a data warehouse, designers can consider creating dedicated and complex indexes for each data storage to achieve the highest access efficiency, because the data in the data warehouse is not often updated, that is to say, each data storage is stable, so although the establishment of a dedicated and complex index has a certain cost, once it is created, there is almost no need to maintain the cost of the index.
 
3. Determine the data storage location

We have said that data on the same topic is not stored on the same media. In physical design, we often need to classify data based on the importance, usage frequency, and response time requirements, and store different types of data in different storage devices. Data with high importance, frequent access, and high response time requirements are stored on high-speed storage devices, such as hard disks; data with low access frequency or low access response time requirements can be stored on low-speed storage devices, such as disks or tapes.

Other methods should be taken into account to determine the data storage location, such as determining whether to merge tables and whether to create data sequences for some regular applications; whether to store frequently used and infrequently modified tables or attributes in redundancy mode. If these technologies are used, they must be recorded in metadata.

4. Determine Storage Allocation

Many database management systems provide storage allocation parameters for designers to perform physical optimization, such as block size, buffer size, and number, which must be determined during physical design. This is the same as when creating a database system.

Section 5 Data Warehouse generation

The task in this step is Interface Programming and data loading.

The result of this step is that the data has been loaded into the data warehouse, and the data warehouse application can be created on it, that is, the DSS application.

1. Design Interface

To load data in an operating environment into a data warehouse, you must create an interface between two recording systems in different environments. At first glance, it seems that you only need to compile an extraction program to establish and design this interface. In fact, in this phase of work, data is indeed extracted, however, extraction is not all done. This interface should also have the following features:

· Generate complete data from application and operation-oriented environments;
· Time-based Data Conversion;
· Data aggregation;
· Effective scanning of existing record systems for future append.

Of course, while considering these factors, we also need to consider some physical design factors and technical constraints. Based on these factors, we should strictly define specifications and then program interfaces according to specifications. The Data Interface Programming process from an operating environment to a data warehouse environment is no different from the general programming process. It also includes steps such as pseudo code development, encoding, compilation, error checking, and testing.

In Interface Programming, note the following:

· Maintain high efficiency, which is also required by general programming;
· Complete document records should be saved;
· Flexible and easy to modify;
· Complete and accurate data extraction, conversion, and integration from the operational environment to the data warehouse environment.

2. Data Loading

In this step, the interface program is run to load the data into the data warehouse. The main tasks are:

· Determine the order of data loading;
· Clear invalid or error data;
· The data is aging ";
· Data granularity management;
· Data refresh.

Initially, only a portion of the data is used to generate the first topic domain, so that the designer can easily and quickly adjust the work done and submit it to the next step as soon as possible, that is, the use and maintenance of data warehouses. In this way, we can get the fastest return in the economy, discover some problems as soon as possible through the use of end users, and propose new demands, and then report them to the designers, the designer continues to improve and expand the system.

Section 6 Use and Maintenance of Data Warehouses

The work to be done in this step involves establishing DSS applications, that is, using data warehouses to understand requirements, adjust and improve systems, and maintain data warehouses. The establishment of a systematic environment for enterprises should not only include the establishment of operational and analytical data environments, but also the establishment of various enterprise applications in this data environment. After the data warehouse is loaded into the data warehouse, the next step is to use the data in the data warehouse to serve the purpose of decision analysis, that is, to establish a DSS application in the data warehouse; on the other hand, based on new requirements from user usage and feedback, developers can further improve the system and manage some daily activities of the data warehouse, for example, refresh the current detailed data of the Data Warehouse, convert outdated data into historical data, clear unused data, and adjust the granularity level. We call this step the use and maintenance of the data warehouse.

1. Establish a DSS Application

Using a data warehouse, that is, developing a DSS application, is essentially different from developing an application in an operating environment. Developing a DSS application is different from developing an online transaction processing application:

· DSS application development starts from data;
· The requirements of DSS applications cannot be clearly understood at the initial stage of development;
· DSS application development is a continuous process and a heuristic development process.

DSS applications can be divided into routine analysis and heuristic analysis. Routine analysis refers to repeated analysis processes. It is usually a department-level application, such as department statistical analysis and report analysis; personal-level analysis applications are often very random. Enterprise Operators are inspired by some information to perform some impromptu analysis and processing. Therefore, we call it heuristic analysis and processing.

The general steps for DSS application development are as follows:

Step l -- determine the required data. To meet the requirements of the DSS application, we must determine a range of data that may be used from the data warehouse. This is a test process.
Step 2: extract data programmatically. Based on the data range obtained above, write an extraction program to obtain the data. To adapt to the changing analysis requirements, the extraction program must be generic and easy to modify.
Step 3: Merge data. If multiple data sources are selected, the extracted data must be merged and extracted to meet the analysis and processing requirements.
Step 4: analyze data. Analyze and process the data prepared in the previous step, and check whether the obtained results meet the original requirements. If not, return Step 1 to start a new cycle, otherwise, prepare the final analysis result report.
Step 5: answer questions. Generate the final analysis result report. -In general, the final analysis result report is obtained after many cycles, because one analysis is rarely completed after one loop.
Step 6: perform routine analysis and processing at the end of an analysis. We need to determine whether to perform the preceding analysis and processing. If the created analysis process is a department-level DSS application that is repeated, it is best to take it as an example, so that during the next same analysis process, you do not have to repeat the preceding six steps. Moreover, by accumulating such routine processing and forming a set, we can combine these existing processing to generate a large complex processing or complete a complex processing part.

2. Understand requirements, improve and improve systems, and maintain Data Warehouses

Data Warehouse development is a step-by-step development method of the original method, which requires that the system should be run as soon as possible to produce benefits; it should be in the system operation or use, constantly understand requirements and improve the system; constantly consider new requirements and improve the system.

Data warehouse maintenance mainly involves managing daily data loading, including refreshing the current detailed data of the Data Warehouse and converting outdated data into historical data. clear unused data, manage metadata, and so on. In addition, you can use interfaces to regularly append data to a data warehouse from an operating environment and determine the data refresh frequency of a data warehouse.

Source: forestconsult

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.