"Reading notes-data mining concepts and techniques" data cube technology

Last Update:2015-03-12 Source: Internet

Author: User

Tags web database

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Basic concepts:

Basic unit: Unit of Basic square body
Aggregation Unit: Unit of non-basic square body
Iceberg Cube: A partially materialized cube
Minimum support (Minimum support threshold): Partially materialized minimum threshold (white is limited to a range)

∵ Iceberg Cubes still have a lot of interesting units to calculate

∴ Introduction--closed coverage--a unit that has no descendants or whose descendants have different measures

Method 2: The cube shell of the data cube that is only expected to be counted in a cube that involves a few dimensions

General strategies for Computing: four optimization techniques

1. Sorting, hashing, and grouping

2. Simultaneous aggregation and caching of intermediate results

3. When more than one child is present, the smallest child gathers

4. Prior Pruning method

Calculation method of ————————————————————————————————————————————————————————————————————————————— data cube

Multi-Cluster aggregation

Calculate the 2-d plane as an example:

Size of dimension: a--40;b--400; c--4000 1. Scan sequence is 1~64

2. Scan sequence is 1,17,33,49,5,21,37,53

Characteristics:

1. Arrays can be addressed directly

2. Convert the table to an array, compute the cube with the data, and then convert the results to a table. (Of course it doesn't mean slow)

3. May only be valid for cubes with relatively small dimensions, because the number of square bodies that need to be computed increases exponentially with the number of dimensions

BUC (bottom-up construction): Compute sparse iceberg cubes from Vertex square

Main ideas

The measurement of the entire data cube is computed first, then divided along each dimension, while the iceberg conditions are checked, the branches that do not meet the conditions are pruned, and the next dimension is searched recursively.
Calculation process:

Characteristics:

1. Adopting the strategy of dividing and administering, the advantage lies in being able to share the dividing cost and reduce unnecessary calculation consumption;

2. Performance is susceptible to wade order and unbalanced data, and should be divided in descending order of the number of wikis (optimization: Sorting, hashing, grouping);

3. Parent-child relationships cannot be used to share aggregation calculations, unlike multiway;

EG2:

Star-cubing: Calculating iceberg cubes Using dynamic star tree structures

Note: The cardinality of a dimension refers to the number of different values of this attribute

Minimum support Min_sup (threshold): The same value appears at least how many times

Concept

Star node: The aggregation of a single dimension on the attribute value P does not meet the iceberg condition (can be used to prune) the main idea

Integrated top-down and bottom-up, combined with simultaneous aggregation of multiple data aggregation and Apriori pruning strategies in Buc
Using the star tree data structure for storage, the core is the introduction of shared dimensions. If the aggregated value of the shared dimension does not meet the iceberg condition, all cells down the shared dimension do not meet the iceberg condition

Predicting shell fragments for fast high-dimensional OLAP

A problem:

Why are you interested in data cube projections?

Because the data cube facilitates the fast OLAP of multidimensional data space

Although the Iceberg Cube allows us to get results in a shorter period of time, it is not the final solution.

So , one possible solution is to calculate a very thin cube shell.

And because the cube shell has two drawbacks:

High-dimensional OLAP is not supported
Do not support drill down

So, we only calculate part or fragment of it

The shell fragment method involves two algorithms: one to compute the shell fragment cube, and the other to process the query with a cube fragment. The shell fragment method can handle very high-dimensional databases and can quickly calculate small local cubes online. It exploits the popular inverted index structure in information retrieval and web-based information systems.

———————————————————————————————————————————————————————————————————————

Processing advanced queries using data cube technology (http://blog.csdn.net/mamianskyma/article/details/15494471)

The basic data cube has been further expanded to a variety of complex data types and new applications. For example, the spatial data cube used for the design and implementation of the geographic Data Warehouse, the multimedia cube for multi-dimensional analysis of multimedia data , the compression and multidimensional analysis ofRFID data cube processing radio frequency (RFID), The text cube and thesis cube are developed for the application of vector space model and generative language model in multidimensional text database (including structure attribute and narrative text attribute) respectively.

Sampled cubes: OLAP-based mining on sample data

- - When collecting data, we often collect only a subset of the data we want to collect, and the resulting data is called the sample data.
  - If you try to use traditional OLAP tools for sample data, you will encounter two challenges: first, in multidimensional sense, sample data is often too sparse. When the user drills up and down the data, it is easy to drill into a point that has little or no sample. Based on small sample inference, the overall answer can be misleading: a single outlier or small bias in a sample can dramatically distort the answer. Second, using sample data, statistical methods will be used to provide reliability metrics (such as confidence), indicating the quality of the overall, query answer. Traditional OLAP is not equipped with such a tool.
  - A sampled cube is a data cube structure that stores sample data and their multidimensional aggregation. It supports OLAP on sample data. It calculates the confidence interval as a quality metric for multidimensional queries. The confidence interval is used to indicate the estimated reliability, such as the average age of the audience in the sample data is 35 years old, but 35 years old is also the average age of the overall data is how much grasp? So we need some way to limit our estimates and guide the general magnitude of the error. A confidence interval is a range of values that covers real population estimates at a given high probability. Confidence intervals are always limited by a single confidence level. Example: "At 95%, the actual mean change will not exceed +/-two standard deviation", the confidence level is 95%. The calculation method of confidence interval is shown in "Data mining concept and Technology" page 142.
  - If the confidence interval is large, then reliability becomes an issue. Two factors that affect the size of the confidence interval: the variance of the sample data and the sample size. First, a large unit variance indicates that the selected unit is poor, and a better solution may be to drill into a more detailed unit under the query unit, and second, a small sample may result in a large confidence interval. The best way to solve a small sample problem is to get more data, there is usually enough data in the cube, the data does not exactly match the query unit, however, you can consider the data in the "neighboring" unit, there are two ways to obtain this data, To enhance the reliability of query answers: 1) in-vivo query expansion takes into account neighboring units within the same party; 2) the Inter-Body query extension considers a more general version of the query unit (from the parent body).
    - In-the-body query extension: Expands the sample by including adjacent units that are in the same body as the query unit, and the new sample is designed to increase the confidence of the answer without altering the semantics of the query. 1) which dimensions should be extended: values that are unrelated or weakly related to measures (values to be predicted). To accurately measure the correlation between dimensions and cube values, calculate the correlation between dimension values and their aggregation cubes, often using Pearson correlation coefficients for numeric data, and using x^2 correlation tests on nominal data, although other measures such as covariance can also be used. 2) After selecting the dimension for the extension, the extension should use which values in these dimensions: Select a value that is similar in semantics to minimize the risk of changing the end result.
    - Inter-Body Query extension: extended by examining more general units.
- Sort cubes: Efficient calculation of TOP-K queries
  - The data cube not only facilitates the online analysis of multidimensional queries, but also facilitates search and data mining. The Top-k query (or sort query) returns only the best K results as the answer to the query based on user-specified preferences, rather than returning a large number of non-differentiated results. The results are returned in a scheduled order, making the best results at the top. Typically, a user-specified preference condition consists of two parts, a selection condition, and a sort function. Top-k queries are common in many applications, such as searching the Web database, using the approximate matching K-nearest neighbor search, and the similarity query of the multimedia database.
  - OLAP requires an offline estimate so that multidimensional analysis can be done online, but the temporary set of sorting functions is completely materialized. A natural tradeoff is the use of semi-offline materialized and semi-online computing models.
  - The general principle of the sort cube is the cube on the materialized selection attribute set. Using interval-based partitioning on the sort dimension allows the sorting cube to support the user's ad hoc queries efficiently and flexibly.
Multidimensional data analysis of data cube space
- Prediction Cube: Prediction Mining of Cube space
  - Multidimensional data mining, the discovery of changes in the dimensions of the combination and change the granularity of knowledge, this mining is also known as exploratory multidimensional data mining or online analysis mining.
  - A prediction cube is a cube structure that stores predictive models in multidimensional data spaces and supports predictions in an OLAP manner. The prediction cube is an example of multidimensional data mining. In the data cube, each cell value is a clustered value computed on a subset of the data in that cell, and each cell value of the prediction cube is calculated by evaluating the prediction model that is built on the subset of the data in that cell, thus representing the prediction of the behavior of the subset of data.
- Multi-feature cubes: Complex aggregation on multiple granularity
  - The traditional data cube is built on a commonly used dimension, using simple metrics. Multi-feature cubes compute more complex queries, whose answers depend on the grouping of multiple clusters on varying granularity layers.
  - Multi-feature cubes allow users the flexibility to define complex, task-oriented cubes where multidimensional aggregation and OLAP-based mining are possible.
  - The calculation of a multi-feature cube depends on the type of aggregation function used by the cube. Aggregation functions can be divided into distributed, algebraic and integral.
- Anomaly-based, discovery-driven cube space exploration
  - A data cube may have a large number of square bodies, and each square body may contain a large number of (aggregation) cells. To the user page, even just browsing the cube becomes a burden. Tools need to be developed to help users intelligently explore the huge aggregation space of the data cube.
  - The exception indicator indicates a pre-computed measure of the data anomaly, which is used in all aggregation layers to guide the user's data analysis process. An exception is a data cube cell value, based on a statistical model that is significantly different from the expected value. The model considers the changes and patterns of measures on all dimensions to which the cell belongs. The model considers exceptions that are hidden in all grouping aggregates in the data cube.

"Reading notes-data mining concepts and techniques" data cube technology

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More