3.4 define the object of the Repository: Subject and metadata
Most business data is multidimensional. Therefore, the method used to collect and represent more than three dimensions cannot be used in business database design. A new method must be provided to express multidimensional data. There are currently two popular methods. One is the object-oriented method, which abstracts business data into objects and expresses these objects using object modeling tools such as Rational Rose; another method is to use the information package diagram, which is a simple and efficient method with a high penetration rate in projects.
The information packet graph is actually a good tool for top-down data modeling. The top-down modeling technology is designed from the user's point of view. The user's opinion is obtained through communication with the user, and the user's information needs can be further clarified. The top-down approach considers almost all sources of information and the ways in which these sources affect business activities. It allows designers to develop information packages around a common subject or business field.
The following describes how to create an information package chart through information packaging technology to determine the topic and metadata in the data warehouse.
3.4.1 information packaging technology
1. Basic use of information packaging technology
The information packaging method is a top-down design method, which focuses on one or more topics of an enterprise from the perspective of managers, focus on the multidimensional features of the data involved in the topic. This method is divided into four phases:
(1) Use the top-down method to analyze the multidimensional features of business data, use an information packaging diagram to represent the transfer and ing relationships between dimensions and categories, and establish a conceptual model. Categories are classified by certain standards. For example, products can be classified by different standards, such as color, texture, origin, and place of sale.
(2) filter a large number of indicators and Entity Data of an enterprise and extract available central indicators. Indicators are also called key performance indicators and key business measurement values. They are a way to measure business information in a dimension space. For example, the product income amount, raw material consumption, and replenishment of new employees or equipment running time can all be called indicators.
(3) create a star chart based on the information packaging diagram, analyze the detailed category entities, expand it to a snowflake chart, and establish a logical model.
(4) based on the star chart and snowflake chart, the entity, key mark, non-key mark, data capacity, update frequency, and entity feature are defined according to the defined data standards, design the physical data model.
The information package diagram helps you complete the following tasks:
L define the scope of common topics involved in a business, such as time, customer, geographical location, and product.
Design key business indicators that can be tracked to determine how a business event is run and completed.
L determine how the data is transmitted to the user of the data warehouse.
L determine how users aggregate data and move data hierarchically.
L determines how much data is actually included in a given user analysis or query.
L defines how to access data and what the entry point is. Where do users want to access and how to guide them into the information package.
L estimate the Data Warehouse size.
Determine the data update frequency in a data warehouse.
L determine how information can be packaged to better provide to users.
Figure 3-24 shows a blank information package. Pay attention to the horizontal line shown in the information package diagram. here we need to describe the information package. You can selectively enter a general description and a detailed description or a description of the information packet diagram. The Shadow part refers to the metric indicators under certain dimensions and categories. This part reflects the main task of data analysis. It must be completed with the user when creating an information package diagram.
In the future analysis of the AdventureWorksDW data warehouse, we will mainly analyze the sales situation of Adventure Works Cycles. Based on the analysis of the previous needs, we will combine the four stages of the information packaging method, you can use the following method to create an information package diagram.
(1) obtain the multidimensional feature analysis results of various commercial departments and determine the dimensions that affect sales, the following five dimensions can be extracted: date, region, product, customer age, and customer status.
(2) analyze each dimension and determine the transmission and ing relationships between the dimension and the category. For example, in the AdventureWorks business database, the date has a level of year, quarter, month, or even smaller, regions are generally divided into countries, regions, cities, and specific stores.
(3) determine the indicator system required by the user. Here, relevant sales indicators are determined based on the sales situation, such as actual sales, planned sales, predicted sales, planned deviations, and predicted deviations.
With the above analysis, you can draw a sales analysis information package diagram, as shown in 3-25, other analysis requirements information package diagram can be expressed in a similar way.
(4) This step can be used to construct a star chart based on the information packaging graph, as shown in 3-26. Then, based on the actual situation, connect the detailed category object to the Star Graph to obtain the snowflake model of the enterprise data warehouse. For example, in the AdventureWorks Business Database, products have been hierarchically classified through the table "ProductCategory", "ProductSubcategory", and "Product, attach them to the star chart 3-26 to form the snowflake architecture diagram 3-27.
Note: according to the design conventions, indicator entities, dimension entities, and detailed category entities are represented by rectangles, diamond, and hexagonal entities, respectively.
Through the above technology, we have actually established the concept model and logic model of the data warehouse. The information package diagram shown in 3-25 is completed by end users and technicians, and its data structure is transformed from the objective world to the subjective world. Figure 3-26 is a logical model because it converts information into a relational model based on the information packet diagram. By comparing the final data warehouse architecture (described in section 3.2.2), it is very close to building a complete data warehouse database.
2. Dynamic Information Packaging
The dimensions involved in the information packing diagram and their corresponding categories are fixed in advance. The most direct problem caused by fixed dimensions and categories is that the designed data warehouse not only has poor adaptability to some specific query and analysis operations, in addition, when the query or analysis requirements change, it cannot be adapted. To solve this problem, allow the dimension and category to be changed freely. This is the dynamic information packaging method.
Dynamic information packaging includes two aspects: dynamic combination of dimensions corresponding to the indicator analysis and dynamic combination of categories associated with dimensions. According to the article "Design Method of Data Warehouse Based on Dynamic information packaging" by Li xuemei and others from Nanjing University, seven steps of dynamic information packaging method can be obtained.
(1) Use a top-down approach to discover as many topics as possible through conversations with company leaders and managers, and then find the corresponding indicator entities based on these topics, the most obvious Dimension Entity contained in each indicator entity is further analyzed using the basic information packaging method.
Figure 3-28 and Figure 3-29 are two star charts, namely, sales analysis and customer demographic analysis. The former consists of three dimensional entities: time, region, and product, the latter includes three dimensional entities: time, region, and customer.
(2) considering all topics comprehensively, use the indicator entity matrix to unify and standardize the defined information packages and dimension entities. The unified object matrix shown in Figure 3-30 is used to eliminate the differences and inconsistencies in entity definitions, so as to ensure the consistency of entity definitions in the data warehouse. 'X' of the intersection in the matrix indicates correlation.
(3) for a single indicator entity (Information Package), find all dimension entities related to the indicator entity but belonging to other information packages, sort the information package based on its relevance to obtain an ordered set of all relevant dimension indicators of the indicator entity. It should be noted that, due to the relativity of dimension definitions, a single category in some detail category entities is closely related to the query or analysis of indicator entities. It can also be used as a separate Dimension Entity. For example, the customer details entity includes age groups, gender, income group, occupation, education and marital status, among which age groups, gender, income group and occupation are closely related to sales analysis, therefore, they can be used as different dimensional entities for sales respectively. In this way, we can obtain the Dimension Entity set dim Sales Related to the sales analysis = {period, region, product, age group, gender, income group, occupation }. In this case, we define the relevance of the first three as 1, and the relevance of other dimensional entities as 0.5.
(4) classify each dimension object to identify all feasible categories. Then, the classification conditions of these types are sorted from large to small based on their granularity to obtain an ordered set of category indicators of the dimension object.
(5) create a dynamic dimension for the indicator entity. Dimension entities can be divided into two types. One type refers to the dimension entities that are essential for the analysis of the indicator entities, and the other type can be freely selected as needed, it is called an optional dimension. For example, in DIM sales set, period, region, and product are required dimensions, while others are optional dimensions.
(6) create a dynamic category object corresponding to the dimension object. Different from dimension entities, category entities are optional. Category entities can be determined based on specific conditions.
(7) create a conceptual model (information packaging chart) and a logical model (star chart or snowflake chart) for each indicator in the data warehouse ).
The data warehouse design method for dynamic information packaging uses the dynamic Reorganization technology of dimensions and categories to provide modifiable data storage methods, so that the designed data warehouse has a truly adaptive data structure, it can better meet the enterprise's future query and analysis needs.
3.4.2 understand the topics in the data warehouse
The information packet graph is used to determine the topic and most of the metadata of the data warehouse. This section describes the relationship between a data packet graph and a topic.
1. Concepts of topics
A topic is an abstract concept that combines, classifies, and analyzes data in an enterprise information system at a higher level. Each topic corresponds to a macro analysis field. Logically, it is the analysis object involved in a certain macro analysis field of an enterprise. For example, in the preceding example, "sales analysis" is an analysis field. Therefore, the topic of this data warehouse application is "sales analysis ".
The topic-oriented data organization method is a complete and consistent description of the analysis object data at a higher level, which can portray the enterprise data involved in each analysis object, and data connections. The higher level refers to the application-oriented data organization mode, which means that the data organization mode based on the topic has a higher data abstraction level. In contrast to the characteristics of traditional databases for application-oriented data organization, data in data warehouses is subject-oriented. For example, the subject of a data warehouse of a production enterprise may include product order analysis and goods shipment analysis. However, organizations by application may be financial sub-systems, sales sub-systems, supply sub-systems, human resources sub-systems, and production scheduling sub-systems.
The subject is determined according to the analysis requirements. This is different from organizing data according to data processing or application requirements. For example, in production enterprises, materials are also supplied. In operational database systems, people are concerned about how to conveniently and quickly process the material supply business; during analysis and processing, people should be concerned about the timely supply of materials and the quality of materials through different procurement channels.
Data Warehouse is oriented to the main theme areas of companies that have been defined in data models. Typical theme areas include customers, products, orders, finance, or other transactions or activities.
2. Obtain the topic domain
A topic domain is the border of a topic after analysis. Analyze the topic domain and determine the topic to be loaded to the data warehouse as the first step in information packaging technology. In data warehouse design, a topic or a part of the entire enterprise topic is usually created at a time. Therefore, a topic domain is selected in most data warehouse design processes. The topic domain must be determined by end users and data warehouse designers.
For example, for a type of company management such as adventure works cycle, the topics to be analyzed generally include supplier topics, product topics, customer topics, and warehouse topics. The topic content includes recording the purchase status, sales status, and storage status of supermarket products. The topic content may include information about the products purchased by customers; the topic of a repository includes the storage of items in the repository and the management of the repository, as shown in Figure 3-31.
Determining the theme boundary actually requires further understanding of the business relationship. Therefore, after determining the entire analysis topic, you need to make a preliminary refinement of these theme to obtain the boundary that each topic should have. For the four themes in Figure 3-31 and their business relationships in the enterprise, the boundary 3-32 can be determined.
3. Determine the topic content
Although the topic occupies only the title in the information package diagram, it is the most important part of the information packaging method. After the topic is defined, the logic model in the data warehouse is basically formed. In this case, all attributes and system-related behaviors must be included in the Logical Relationship Mode of the topic. The Data Storage Structure in the data warehouse also needs to be defined in the design phase of the logical model. The required information and attribute groups that fully represent the topic must be added to the data warehouse. Take the company data warehouse such as Adventure Works Cycle as an example. As shown in table 3-7, you can add an Attribute Group on the "product", "sales", and "customer" topics respectively.
Table 3-7 topic details
4. Use of themes
Since the data warehouse design is a spiral development process, at the beginning, there is no need to reflect all themes in the database of the data warehouse, it is necessary to select the most important theme as the touchstone of data warehouse design. Therefore, to use a topic, you must first find the topic domain to be analyzed.
For example, in the conceptual model design of the AdventureWorksDW data warehouse, after analyzing the requirements, we realized that the "commodity" topic is the most basic business object of a sales enterprise, this is also the main area of decision analysis. Therefore, the "sales analysis" subject domain is defined as the first topic to be established. Through the establishment of the "commodities" topic, the operator can have a comprehensive understanding of the overall business situation. The first implementation of the "commodity" topic can meet the initial requirements of enterprise management personnel to establish a data warehouse as soon as possible, so the first selection of "commodity" topic for implementation.
By applying the topic boundary division to the obtained relational model, the original conceptual model can also be formed. This model combines topic domain division with tables in the transaction processing database. For example, in the preceding example, the relationship tables that may be covered by a commodity subject include the commodity table, supply relationship table, purchase relationship table, and warehouse relationship table; A warehouse topic may cover warehouse relationship tables, warehouse tables, warehouse management relationship tables, and administrator tables. Associate the keys and fields of these tables to form the original conceptual model diagram shown in Figure 3-33.
3.4.3 understand metadata in a data warehouse
The information package diagram also contains most of the metadata in the data warehouse. The most common definition of metadata is "data about data ". It is with metadata that the end users of the data warehouse can use the data warehouse as they wish and use the data warehouse to explore various management and decision-making modes. Metadata is the soul of the data warehouse application. It can be said that there is no data warehouse without metadata.
1. Metadata type
Metadata is usually divided into technical metadata (technical metadata) and business metadata (business metadata ).
Technical metadata describes the technical details of a data warehouse. It is used for development, management, and maintenance of data warehouses. It mainly includes the following information.
L description of the data warehouse structure, including the definition of the warehouse mode, view, dimension, hierarchy, and exported data, as well as the location and content of the data mart;
L architecture and model of business systems, data warehouses, and data markets;
L summary algorithms, including measurement and dimension definition algorithms, data granularity, topic fields, aggregation, summary, and pre-defined queries and reports;
L ing from the operating environment to the data warehouse environment, including source data and their content, data segmentation, data extraction, cleaning, conversion rules and data refresh rules and Security (user authorization and access control ).
Business metadata describes data in a data warehouse from a business perspective. It provides a semantic layer between the user and the actual system, this allows business personnel who do not understand computer technology to "read" data in the data warehouse. Business metadata mainly includes the following information: Data Model, Object Name, and attribute name expressed by the user's business terms; Principles for accessing data and data sources; the analysis method and formula and report information provided by the system.
In the information packaging process, you need to use a package diagram to indicate the transfer and ing relationships between dimensions and categories. In fact, this operation creates metadata on the basis of the original business system. The dimensions, categories, and hierarchies are typical technical metadata, and the corresponding terms in the business system are business metadata. For example, the dimensions such as date, region, product, customer age, and customer status extracted from the previous example are as follows, metrics such as actual sales, planned sales, predicted sales, planned deviations, and predicted deviations are all metadata. This data plays an extremely important role in future analysis. These functions are summarized below.
2. Functions of metadata
From the perspective of the type and function of metadata, metadata is actually to solve the problem of who, where and why, and how to use the data warehouse. Specifically, the metadata in the eyes of the Data Warehouse administrator is a complete knowledge base and document that contains all the content and processes in the data warehouse, and in the eyes of end users (that is, data analysts, metadata is the information map of the data warehouse.
To effectively use the data warehouse environment, data analysts often need the help of metadata. Especially when data analysts perform information analysis and processing, they need to view the metadata first. Metadata also involves data ing from an operational environment to a data warehouse environment. When data enters the data warehouse environment from an operating environment, the data must undergo a series of major changes, including data conversion, filtering, aggregation, and structure change. The metadata of the data warehouse must be able to track these changes in a timely manner. When data analysts need to trace data changes from the data warehouse environment to the operational environment, they need to use metadata to track such changes. In addition, because the data in the data warehouse may exist for a long time, the data warehouse may change the data structure. As time passes by to track changes in data structures, metadata is another common usage function.
Metadata describes the data structure, content, chain, index, and other items. In traditional databases, metadata is the description of each object in the database, and data dictionary in the database is a kind of metadata. In a relational database, this description defines databases, tables, columns, views, and other objects. However, in a data warehouse, metadata defines many objects in the data warehouse-tables, columns, queries, business rules, and data transfer within the data warehouse. Metadata is an important component of a data warehouse and an indicator chart of a data warehouse. Metadata plays an important role in the process of data source extraction, data warehouse development, business analysis, data warehouse service, data refinement and reconstruction engineering, figure 3-34 shows the huge impact of metadata in the entire data warehouse development and application process. Therefore, it is of decisive significance to design a metadata with strong descriptive ability and complete content for effective development and management of data warehouses.
The huge role of metadata will be gradually realized in the later analysis of data warehouses. In this section, the concept model of Data Warehouse is established through the information packaging technology. The star structure or snowflake structure obtained through the information packet graph actually creates a logic model for the data warehouse. It can be said that through the analysis of themes and metadata, we should be able to have a deep understanding of the process from the real world to the subjective world (that is, the construction of the conceptual model, the logic model also needs to be further studied from the perspective of facts and dimensions.