Basic Bi concepts (full)

Source: Internet
Author: User

I. multidimensional data models and related concepts


Data models generally have two layers: The concept layer (logical layer) and the physical layer. The Logical Data Model abstracts the internal laws of the real world from the conceptual perspective, such as business processes and data architecture. The physical data model focuses on the specific implementation in a specific environment, such as efficiency and security.

Multi-dimensional data model is a logical concept. It mainly solves the problem of fast query and multi-angle display of a large amount of data, so as to obtain information and knowledge conducive to management decision-making. The application fields of multidimensional data models mainly include data warehouse, OLAP and data mining. Among them, multidimensional structure is the core of OLAP.

By introducing dimensions, dimension layers, and measurements, a multidimensional data model regards information as a cube.

 

1. Cube: Describe an object with three or more dimensions. Each dimension is perpendicular to each other. The measurement value of data occurs at the intersection of dimensions. Each part of the data space has the same dimension attribute.

2. dimension: it is a specific angle for people to observe data. It is a type of attribute in consideration of the problem. The set of attributes forms a dimension (such as time dimension and organization dimension ).

3. dimension hierarchy: the same dimension can also have different descriptions (such as year, quarter, month, day, and date ).

4. Dimension attributes: A value of a dimension, which is the description of the position of a data item in a dimension (for example, "a month or a day of a year" is the description of the position in the time dimension ).

5. Measurement: a cell in a cube used to store data.

OLAP basic multidimensional analysis operations include drilling (roll up, drill down), slice, dice, and rotation.

Drilling includes drill-down and drill-up (roll-up) operations. The drill depth corresponds to the level of the dimension. The volume-up operation aggregates data cubes through the dimension conventions. The drill-down operation is the inverse operation of the volume-up operation, from less detailed data to more detailed data.

Slice and slice are the distribution of measurement data on the remaining dimension after a set value is selected on some dimensions. In the multi-dimensional data structure, you can obtain the required data by two-dimensional and two-dimensional chunking, such as cutting and slicing in the three-dimensional cube of "organization, product, and time, you can obtain the sales information of various cities and products.

Rotation (rotation axis) is to change the direction of the dimension, that is, to reschedule the placement of the dimension in the table (such as row-column swaps), and to obtain data from different angles through rotation.


Ii. physical implementation of Multidimensional Data Models

There are multiple ways to implement OLAP multi-dimensional data models, including multi-dimensional databases using arrays, relational databases, and the combination of the two. They are usually called molap, ROLAP, and holap. However, molap is easy to misunderstand. After all, according to the multidimensional concept of OLAP, ROLAP is also a way to organize multidimensional data.

1. Multidimensional online analysis and processing (multidimensional database management system)

Multi-dimensional online analysis is handled in strict accordance with the definition of kudd. You can create a multi-dimensional database to store the data of the online analysis system. It takes the multi-dimensional data organization as the core, that is, multidimensional Online Analytical Processing uses multidimensional arrays to store data.

When using a multi-dimensional database to store OLAP data, you do not need to convert the dimensions, layers, and cubes in the multi-dimensional data model into other physical models, because the multi-dimensional array (matrix) it can well reflect the characteristics of multidimensional data models.

For Figure 1, a three-dimensional array matrix (7, 6, 3) can be defined to reflect the dimension, attribute, and dimension measurement of the cube. The number of dimensions in the array corresponds to the number of dimensions in the cube. Each dimension value in the array corresponds to the number of attributes in each dimension in the cube, and the 126 intersections in the array correspond to cells in the cube to store data.

The advantage of using arrays to implement multi-dimensional data models lies in the rapid access to data, but it also brings about storage space redundancy, that is, the sparse matrix problem, which leads to a great demand for storage space. For example, when an array is used to define a cube structure defined in Figure 2, there may be 10446.3 billion values. But in fact, not every day, every business organization in different regions and different special merchants will generate transactions with different currencies and different types of cards, compared with the relational database management system, only when a transaction does occur will a record be left in the corresponding table.


 

To solve the problem of sparse matrix, some products have proposed sparse and dense strategies. Index blocks are generated by sparse dimensions, and data blocks are formed by density dimensions. An index block is created only when a combination of sparse dimensions occurs for the first time in a transaction event, and then a data block is created.

Figure 3 shows the relationship between data blocks and index blocks. The introduction of sparse dimension and density dimension reduces the storage redundancy of cubes to a certain extent. In addition, data compression technology can reduce the storage space of data blocks.

 

2. OLAP (relational database management system)


ROLAP uses relational databases as the core to represent and store multidimensional data in a relational structure. It divides multidimensional databases into two types of tables: fact tables, data and dimension keywords are stored. dimension tables use at least one table for each dimension to store dimension descriptions such as dimension levels and member categories. A dimension table is associated with a fact table by the primary keyword and the external keyword to form a "star mode ". For complex hierarchical dimensions, to avoid occupying too much storage space for redundant data, you can use multiple table descriptions. This star mode extension is called the snowflake mode ". This multi-dimensional data presentation method allows users to understand the data in a simple way, increase query efficiency, and have less requirements on massive data storage space.

 

Fact tables have the following features: ① a large number of data columns can be stored in TBT; ② It is mainly numerical information with only a few texts or multimedia information; ③ There are external keywords connected to the dimension table; ④ static data and clustered data.

The information in the dimension table describes the fact table, such as product features, sales time, and customer account. A dimension table divides complex descriptions into several small parts, such as the sales volume at a certain point in time. This reduces the number of scans on fact tables and optimizes queries. It has the following features: ① the number of records is small, and there may be only thousands or tens of thousands of records; ② most of them are text materials; ③ information has a hierarchical structure; ④ only one primary key (primary key or dimension key); ⑤ information can be modified.

The snowflake architecture is a deformation of the star architecture. It normalizes the dimension tables under the StAR architecture, so that it can provide richer information and make information processing more flexible.

3. Hybrid Online Analytical Processing

Hybrid Online Analytical Processing uses multidimensional Online Analytical Processing Technology to store upper-layer summary data and relational Online Analytical Processing to process stored detailed data. That is, the lower layer is relational and the higher layer is multidimensional matrix. This method provides better flexibility.

There are other ways to implement OLAP, such as providing a dedicated SQL Server and providing special support for SQL queries in some storage modes (such as star and snowflake.

Iii. Comparison and Selection of storage modes

The advantage of multidimensional online analysis processing is not only that it can clearly express the multidimensional concept, but more importantly, it has an extremely high overall speed. In the relational database management system, if you want to obtain the total sales volume of a region, you can only retrieve records one by one, find the records that meet the conditions, and then add the data. In multidimensional databases, data can be accumulated directly by row or column, and the statistical speed far exceeds that of the relational database management system. The more records in the database, the more obvious the effect. However, with the increase of dimension and dimension members, the storage space of multidimensional online analysis may burst.

There is no size limit on the storage space for OLAP. The existing relational database technology can be used. You can use SQL to store detailed data and summary data. Existing relational databases have made a lot of Optimizations to OLAP, including parallel storage, parallel query, parallel data management, cost-based query optimization, bitmap indexes, and SQL OLAP extensions, this greatly improves the access efficiency of OLAP. In comparison, the online link analysis and processing technology has greater scalability.

Bi, Bi software, Bi tools

Basic Bi concepts (full)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.