Defining objects for warehouses: Themes and metadata
Most business data are multidimensional, so collecting and representing more than three-dimensional data can not completely borrow the method of business database design, there must be a new way to express multidimensional data. At this stage, there are 2 methods, one is the object-oriented approach, that is, to abstract business data into objects, and then use rational rose and other object modeling tools to express these objects, the other way is to use the information package diagram, which is a simple and efficient method, the use of the project is very high prevalence.
Packet diagrams are actually a great tool for top-down data modeling methods. Top-down modeling techniques are designed from the user's point of view. The user's point of view is through the communication with the user, can further clarify the user's information needs. The top-down approach takes into account almost all sources of information and how these sources affect business activities, enabling designers to develop information packages around a common topic or business area.
The following is a detailed description of how information packaging techniques can be created to identify topics and metadata in a data warehouse.
3.4.1 Information Packaging Technology
1. Basic use of information packaging technology
The information packing method is a top-down design method, which focuses on one or several topics of the enterprise from the point of view of managers, and focuses on the multidimensional characteristics of the data involved in the subject. This law is divided into 4 phases:
(1) A top-down method is used to analyze the multidimensional characteristics of business data, and a conceptual model is established by using information package diagram to represent the transfer and mapping relationship between dimension and category. Categories are classified according to a certain standard of a dimension, such as products can be color, texture, origin and land and other different standards classification.
(2) The company's large number of indicators of the Entity data screening, extraction of the central indicators can be used. The indicators, also known as key performance indicators and key business measurement values, are a way to measure business information in dimension space. For example, product income amount, raw material consumption, supplementary new employees or equipment running time can be called indicators.
(3) construct the star graph on the basis of information packing diagram, analyze the detailed category entity, further expand it to snowflake diagram, and establish the logical model.
(4) based on the star and snowflake charts, the physical data model is designed according to the data standard defined by the entity, key, non-key, data capacity, update frequency and entity characteristics.
A packet diagram can help users do the following:
L Define the scope of common topics involved in a business, such as time, customer, location, and product.
L Design key business metrics that can be tracked to determine how a business event is run and completed.
L Determine how the data is passed to the Data Warehouse user.
L Determine how users can aggregate data and move data by hierarchy.
L Decide how much data is actually included in a given user analysis or query.
L defines how the data is accessed and what its entry point is. Where the user wants to visit, and how to boot into the message pack.
L estimate Data Warehouse size.
L Determine how often data is updated in a data warehouse.
L How information is packaged in order to be better available to users.
Figure 3-24 is a blank information package diagram. Note the horizontal line above the packet diagram, where you will write a description of the message packet. You can optionally fill in a summary description and a detailed description or describe what information the package diagram describes. The shadow part is represented in a certain dimension and category of the metric, which is reflected in the data analysis of the main task, in the production of information package diagram needs to be completed with the user.
In the later analysis of ADVENTUREWORKSDW Data Warehouse, mainly to adventure Works Cycles Company's sales situation analysis, according to the analysis of pre-facing demand, combined with the information packaging method of 4 stages, can be established by the following methods of information package map.
(1) To obtain the multi-dimensional analysis results of commercial data, determine the dimensions affecting sales, which can be extracted from the date, region, product, customer age and customer status, such as 5 dimensions.
(2) Analyze each dimension to determine the transfer and mapping between it and the category, such as the year, Quarter, and month, or even smaller, in the AdventureWorks business database, where the region is generally divided into countries, regions, cities, and specific stores.
(3) Determine the user needs of the indicator system, where sales as a factual basis to determine the relevant sales indicators, such as actual sales, planned sales, forecast sales, planning deviations and forecast deviations.
With the above analysis, you can draw a sales analysis of the packet diagram, 3-25, other analysis needs of the packet diagram can be expressed in a similar way.
(4) This step can be constructed on the basis of the information packaging diagram, the star chart, 3-26 shows. Then, according to the actual situation, we can get the snowflake model of enterprise Data Warehouse by connecting the detailed category entity to the star graph. As in the AdventureWorks business database here, the products have been hierarchically categorized by the table "ProductCategory", "ProductSubCategory" and "Product", Hang them in the star diagram in Figure 3-26 to form the snowflake schema shown in Figure 3-27.
Note that, by design convention, the indicator entities, dimension entities, and detail category entities are represented by rectangles, diamonds, and hexagonal shapes, respectively.
Through the above technology, the concept model and logical model of data Warehouse are actually established. The information shown in 3-25 is package diagrams by the end user and the technician, and the data is transformed from the objective world to the subjective world. Figure 3-26 is a logical model because it transforms the information into a relational model based on the packet graph. Comparing the architecture of the final Data Warehouse (described in section 3.2.2), it is close to building a complete data Warehouse database.
2. Dynamic Packaging of information
The dimensions involved in the information packaging diagram and their corresponding categories are fixed in advance. The most straightforward problem with pinning dimensions and categories is that the data warehouse you design not only has poor adaptability to some specific query analysis operations, but is simply not adaptable when the requirements of the query or analysis are changed. The solution to this problem is to allow the dimensions and categories to be freely changed, which is the way information is packaged dynamically.
Information dynamic packaging includes 2 aspects: Dynamic combination of the dimensions corresponding to the indicator analysis and the dynamic combination of the categories associated with the dimension. Referring to the "design method of data warehouse based on information dynamic packaging", Li Xuemei, Nanjing University, can get the 7 step Dafa of Information dynamic packaging method.
(1) using a top-down approach, by talking with the leaders and managers of the enterprise to dig out as many topics as possible, and then based on these topics to find the corresponding indicator entities, further to each indicator entity using the basic information packaging method to analyze the most obvious dimension entities contained therein.
Figures 3-28 and 3-29 respectively are two star graphs of sales analysis and customer demographic analysis, including time, region, and product 3 dimension entities, including time, region, and customer 3 dimension entities.
(2) Comprehensively consider all the topics, using the indicator entity matrix for the unified and standardized processing of defined packets and dimension entities. The uniform entity matrix shown in Figure 3-30 is used to eliminate discrepancies and inconsistencies in the entity definition, thus guaranteeing consistency of entity definitions in the Data warehouse. The ' X ' representation of the intersection in the matrix is related.
(3) for a single indicator entity (packet) to find all the dimension entities that are related to the indicator entity but belong to other packets, and then sort by their relevance to the packet, get an ordered set of all relevant dimension indicators for that indicator entity. It should be noted that, because of the relativity of dimension definitions, it can also be used as a separate dimension entity when a single category in some detail category entities is closely related to the query or analysis of the indicator entity. such as the customer detail entity includes age group, gender, income group, occupation, education and marital status, and where age group, gender, income group and occupation are closely related to sales analysis, they can be used as different dimension entities for sales. So we can get the dimension entity set related to Sales analysis Dim sales ={Period, region, product, age group, gender, income group, occupation}. Here we define the correlation of the first 3 to 1, the correlation of the other dimension entities is 0.5.
(4) For each dimension entity, classify the categories to find all feasible categories. Then the classification criteria of these categories are sorted according to their granularity from large to small, and an ordered set of the category indicators of the dimension entity is obtained.
(5) Create a dynamic dimension of the indicator entity. Dimension entities can be divided into 2 categories, a class refers to the entity of the indicator is an essential dimension of the entity, called the necessary dimension, the other can be freely selected as required, called optional dimension. such as the dim sales collection, the period, region, and product are required dimensions, and the rest are optional dimensions.
(6) Create a dynamic category entity corresponding to the dimension entity. Unlike dimension entities, category entities are set to optional, and category entities can be determined by their own circumstances.
(7) Establish the conceptual model (information package diagram) and logical model (star or snowflake) of each index in the Data warehouse.
The Data Warehouse design method of information dynamic packaging adopts the dimension and category dynamic recombination technology, and provides the data storage method which can be modified, so that the designed data Warehouse has a truly adaptive data structure, and satisfies the needs of enterprise's future query and analysis.
3.4.2 Understanding the topics in the Data Warehouse
The information package diagram actually determines the subject and most of the metadata for the Data warehouse. This section first tells the relationship of the packet diagram and the theme.
1. The concept of a theme
The topic (Subject) is an abstract concept that synthesizes, classifies and analyzes the data in enterprise information systems at a higher level, each of which corresponds to a macroscopic analytical field. In the logical sense, it is the analysis object which is involved in a macro analysis field in the corresponding enterprise. For example, in the example used in the previous packet diagram, "Sales Analysis" is an area of analysis, so the subject of this data Warehouse application is "Sales analytics."
The topic-oriented data organization method is a complete and consistent description of the analytical object data at a higher level, which can describe the enterprise data involved in each analysis object, as well as the connection between the data. The so-called higher level is relative to the application-oriented approach to data organization, refers to the way the data organized by the topic has a higher level of data abstraction. The data in the Data Warehouse is organized by the topic, which corresponds with the characteristic of the traditional database oriented to the application. For example, a production enterprise's data Warehouse organizes topics that may include product ordering analysis and shipment analysis. The organization by application may be financial subsystem, sales subsystem, supply subsystem, human resource subsystem and production scheduling subsystem.
The subject is determined according to the requirements of the analysis. This is different from organizing data according to data processing or application requirements. As in the production enterprise, is also the material supply, in the operational database system, people are concerned about how to more convenient and faster processing of the material supply business, and in the analysis process, people should be concerned about the different procurement channels and materials supply timely, as well as material quality status.
The data warehouse targets the main subject areas of a company that has been defined in the data model. Typical subject areas include customers, products, orders and finances, or some other business or activity. 2. Getting the subject Domain
A subject field is the boundary of a topic that is determined after an analysis of a topic. The first step in information packaging technology is to analyze the topic domain and determine the topic to mount to the Data warehouse. When designing a data warehouse, it is generally a matter of creating a topic or a part of the enterprise's entire topic at a time, so that in most data warehouse design process There is a topic domain selection process. The determination of the subject domain must be done jointly by the end user and the designer of the Data Warehouse.
For example, for adventure Works cycle This type of company management needs to analyze topics that typically include vendor topics, product topics, customer topics, and warehouse topics. Among them, the content of the subject includes recording the purchasing situation of the supermarket goods, the sale of the goods and the storage of the goods; Customer topics include content that customers may purchase, and warehouse topics include storage of goods in warehouses and management of warehouses, as shown in 3-31.
Determining the subject boundary actually requires a further understanding of the business relationship, so after you determine the entire analysis topic, you need to make a preliminary elaboration of these topics to facilitate access to the boundaries that each topic should have. For the 4 topics in Figure 3-31 and their business relationships in the enterprise, it is possible to identify boundary 3-32 as shown.
3. Determine the content of a topic
Although the topic only occupies the title position in the packet diagram, but it is the most important part of the information packaging method, when the topic is defined, the logical model in the Data warehouse is basically formed. At this point, you need to include all of the attributes and system-related behaviors in the topic's logical relationship pattern. The data storage structure in the data warehouse also needs to be defined in the design phase of the logical model, with the need to add the required information and attribute groups that adequately represent the subject. As an example of a company data warehouse such as adventure Works cycle, as shown in table 3-7, you can add attribute groups that further describe the topic on the items, sales, and customers topics.
Table 3-7 Detailed description of the topics
4. Use of themes
Since the design of Data Warehouse is a spiral development process, at the beginning, it is not necessary to embody all the topics in the database of data Warehouse, it is necessary to choose the most important topic as the touchstone of Data Warehouse design. So using the topic first is to find the subject field that needs to be analyzed.
For example, in the conceptual model design of ADVENTUREWORKSDW Data Warehouse, after analyzing the requirement, it is recognized that the subject of "commodity" is not only the basic business object of a sales enterprise, but also the most important area of decision analysis, so the "Sales Analysis" subject field is defined as the topic to be established first. Through the establishment of the "commodity" theme, the operator can have a more comprehensive understanding of the business situation of the whole enterprise. The first implementation of the "product" theme can be as soon as possible to meet the enterprise management personnel to establish the initial requirements of the Data Warehouse, so select the "product" theme to implement.
The original conceptual model can also be formed by applying the division of the topic boundary to the already obtained relational model. This model is a model that combines the partitioning of the subject domain and the tables in the transactional database, for example, in the above example, the relational table that the commodity subject may cover includes a commodity table, a supply relationship table, a Purchase relationship table, and a warehousing relationship table; a warehouse topic may cover relational tables with warehouse relations tables, warehouse tables, warehouse management relationships tables, Staff table. By linking the keys and fields of these tables, the original conceptual model diagram shown in Figure 3-33 can be formed.
3.4.3 Understanding metadata in a data warehouse
The packet diagram also contains most of the metadata in the Data warehouse. The most common definition of metadata is "data about data". It is the metadata that makes it possible for the end users of the data warehouse to use the Data Warehouse at will, and to make use of the data warehouse to explore various management decision models. Metadata is the application soul of data Warehouse, it can be said that there is no data warehouse without meta-data.
1. Types of meta data
Metadata is typically divided into technical metadata (Technical Metadata) and business Metadata.
Technical metadata is data that describes the technical details of the data Warehouse, which is used to develop, manage, and maintain the data warehouse, which mainly contains the following information.
L Description of the data warehouse structure, including the definition of warehouse mode, view, dimension, hierarchy and export data, and the location and content of the data mart;
• Architecture and patterns of business systems, data warehouses, and data marts;
L summarize algorithms, including metric and dimension definition algorithms, data granularity, subject area, aggregation, summary, and predefined queries and reports;
Mapping of the operational environment to the data Warehouse environment, including source data and their content, data segmentation, data extraction, cleanup, transformation rules, and data refresh rules and security (user authorization and access control).
Business metadata describes the data in the data warehouse from a business perspective, which provides a semantic layer between the user and the actual system, making it possible for business people who do not understand computer technology to "read" data in the Data Warehouse. Business metadata mainly includes the following information: The data model, object name and attribute name expressed by the consumer's business terms, the principles and sources of data access, the analysis methods provided by the system, and the information of the formulas and reports.
In the process of information packaging, it is necessary to use package diagrams to represent the dimension and category as well as the transfer and mapping between them, in fact, the operation is to create the metadata based on the original business system. The dimension, category and hierarchical relationship are typical technical metadata, while the corresponding terminology in business system belongs to business metadata. For example, the date, region, product, customer age, and customer status are extracted from the previous example, and metrics such as actual sales, planned sales, forecast sales, plan deviations, and forecast deviations are all metadata. These data have played a very important role in the future analysis. These effects are summarized below.
2. The role of Meta data
In terms of the type and role of metadata, metadata is really about who is going to solve the problem of WHO and where and how to use the Data warehouse. To materialize, metadata in the eyes of the Data Warehouse administrator is a complete knowledge base and document containing all the content and processes in the data warehouse, and in the eyes of the end user (i.e., the data analyst), the metadata is the information map of the Data Warehouse.
Data analysts often need the help of metadata in order to be able to use the data Warehouse environment effectively. Especially when data analysts are dealing with information analysis, they first need to look at the metadata. Metadata also involves mapping data from an operational environment to a data warehouse environment. When the data from the operational environment into the data Warehouse environment, the data to undergo a series of major changes, including the transformation of data, filtering, aggregation and structural changes and other processes. The metadata of the data warehouse should be able to track these changes in a timely manner, and when the data analyst needs to trace the changes in the data from the Data Warehouse environment to the operational environment, the metadata will be used to track the change. Also, because data in the data warehouse can exist for a long time, data warehouses can often change the structure of the data. Tracking the changes in data structures over time is another common use of metadata.
Metadata describes the structure, content, chain, and index of the data. In the traditional database, metadata is the description of each object in the database, and the data dictionary in the database is a kind of meta data. In a relational database, this description is the definition of databases, tables, columns, views, and other objects, but in a data warehouse, metadata defines many objects in the Data Warehouse-tables, columns, queries, business rules, and data transfer within the Data Warehouse. Metadata is an important component of data warehouse and an indicator diagram of Data Warehouse. Metadata in the data source extraction, data Warehouse development, business analysis, data warehousing services and data refinement and reconstruction of the process are important, in figure 3-34 can see the metadata in the entire data warehouse development and application process of great impact. Therefore, designing a descriptive and well-documented meta-data is crucial to the effective development and management of data warehouses.
Metadata has a huge role to play in the post-facing Data Warehouse analysis gradually realized. This section actually establishes the conceptual model of the Data warehouse through the information packing technology, and the star structure or snowflake structure obtained by the information packet graph actually establishes the logical model for the Data warehouse. It can be said that through the analysis of the subject and meta-data, we should be able to have a deep understanding of the process from the real world to the subjective world (that is, the construction of the conceptual model), while the logical model needs to be further studied from the perspective of fact and dimension.
Design of Data Warehouse theme and meta-data