3) data analysis: OLAP and Data Mining
OLAP and data mining are an organic whole. In OLAP, data warehouse of different topics must adopt corresponding data mining algorithms for data analysis. If we compare the effect of data warehouse on BI systems to the Food Ingredients of chefs, then OLAP and data mining are kitchen utensils.
The concept of Online Analytical Processing (OLAP) was first proposed by E. F. codd, the father of relational databases, in 1993. It aims to allow managers to flexibly browse and analyze massive data. At that time, codd believed that online transaction processing (OLTP) could not meet the needs of end users for database query and analysis, and SQL for simple queries on large databases could not meet the needs of user analysis. A user's decision analysis requires a large amount of computation on the relational database to obtain results. The query results cannot meet the requirements of decision makers. Therefore, codd proposes the concept of multi-dimensional database and multi-dimensional analysis, that is, OLAP. Codd proposes 12 principles for OLAP to describe the OLAP system:
Criterion 1 the OLAP model must provide a multi-dimensional conceptual view
Guideline 2 transparency criteria
Criterion 3 estimation of access capability
Criterion 4 stable report capability
Guideline 5 customer/Server Architecture
Criterion 6-dimensional equality Criterion
Criterion 7 Dynamic sparse matrix processing Criterion
Criterion 8 multi-user support criteria
Criterion 9 unrestricted cross-dimensional operations
Guideline 10 intuitive data manipulation
Rule 11 flexible report generation
Criterion 12 unrestricted dimension and aggregation Layers
Compared with traditional online transaction processing (OLTP), the two are very different, as shown in the following table:
| |
OLTP |
OLAP |
| User |
Operator and lower-level management personnel |
Decision makers and senior management personnel |
| Function |
Routine operations |
Analysis and decision-making |
| DB Design |
Application-oriented |
Subject-oriented |
| Data |
Current, latest, and two-dimensional discrete |
Historical, aggregated, multi-dimensional, and unified |
| Access |
Read/write dozens of records |
Read millions of records |
| Work Unit |
Simple transactions |
Complex Query |
| Number of users |
Thousands |
Hundreds |
| DB size |
100 MB ~ GB |
100 GB ~ TB |
With the concept of multi-dimensional, OLAP provides multi-dimensional analysis and cross-dimensional analysis functions such as slicing, cutting, drilling, rolling, and rotating. Compared with common static reports, OLAP can better meet the needs of decision makers and analysts for Data Warehouse analysis. OLAP system architecture is divided into three types: relational database-based ROLAP (Relational OLAP), multi-dimensional database-based molap (multidimen1_olap), and hybrid data-based holap (Hybrid OLAP. The first two methods are common. ROLAP indicates the OLAP implementation based on relational databases. It uses relational databases as the core and relational structures to represent and store multidimensional data. ROLAP divides the multidimensional structure of a multi-dimensional database into two types of tables: fact tables used to store data and dimension keywords, and dimension tables, that is, at least one table is used for each dimension to store the description information of dimension levels, member categories, and other dimensions. Molap indicates the OLAP implementation based on multi-dimensional data organization. It uses multi-dimensional data as the core and uses multi-dimensional arrays to store data. Molap queries use a combination of index search and direct addressing, which is much faster than ROLAP's table index search and table connection methods.
Data Mining (DM) refers to a large number of incomplete, noisy, fuzzy, and random data, the process of extracting useful information and knowledge hidden in it. Concepts, rules, patterns, and so on.
From the business layer, I personally think that the goal of Data Mining in the business intelligence system can be roughly divided into two categories:
① Discover potential useful information from accumulated business data that is unknown to the management, and create new business opportunities for it. There are already a large number of examples of commercial sales in this regard. This is the "beer and diapers" that have been around for a long time in the Bi industry, and the examples I mentioned at the beginning of this article.
② Seek optimal resource planning solutions from accumulated business data to reduce costs and increase profits. Let's start with an example that we may all think about-a postman sends a mail. Suppose I am a postman in a city and I want to send multiple mails at a time, the recipient's address is distributed across all streets of the city. So how can we design a line to minimize travel? There are a lot of similar examples in business activities. When there is not much data available for analysis, we can use paper and pen for manual computation to find the optimal solution. However, if the raw data volume is extremely large, we will have to turn to computers.
At present, there are many mature Data Mining methodologies in the industry, providing an ideal guidance model for practical applications. CRISP-DM (cross-industry standard process for data mining) is one of the recognized and influential methodologies. CRISP-DM emphasizes that DM is not only a data organization or presentation, but also a data analysis and statistical modeling, but also a complete process from understanding business needs, seeking solutions to acceptance of practical testing. CRISP-DM divides the entire mining process into the following six phases: Business understanding, data understanding, data preparation, Modeling ), evaluation and deployment ). The architecture diagram is as follows:
Business understanding is an understanding of enterprise operations, business processes, and industry background; Data understanding is an understanding of existing enterprise application systems; data preparation is to extract a subset of model data related to the problem to be explored from the large amount of enterprise data. Modeling is based on the understanding of business issues, based on data preparation, select a more practical mining model to form a mining conclusion. The evaluation is to test the mining conclusions in practice. If the expected results are achieved, the conclusions can be published.
In actual projects, General transaction processing systems and even some simple business intelligence systems that only provide report analysis functions require only a small amount of engineering maintenance after completion, however, the use of data mining technology in business intelligence systems is often very different. Because data mining is a process of business understanding, data understanding, modeling, and evaluation that has been repeatedly and repeatedly adjusted, constantly revised, and improved, and the application of the model is not static, update and rebuild as appropriate. Therefore, the general business intelligence project does not pursue one-time project construction. It advocates a consulting service that is closely related to the business of the enterprise and can enhance the competitiveness of the enterprise, in addition, analysts familiar with business and analysis methods play a vital role in the application of business intelligence systems.
From the technical layer, data mining technology can be divided into descriptive data mining and predictive data mining. Descriptive data mining includes data summarization, clustering, and association analysis. Predictive data mining includes classification, regression, and time series analysis.
1. Data Summary: inherited from statistical analysis in data analysis. The purpose of the Data summary is to concentrate the data and provide a compact description. Traditional statistical methods such as sum, average, and difference are all effective methods. You can also use histograms, pie charts, and other graphical methods to express these values. In a broad sense, multidimensional analysis can also be classified into this category.
2. Clustering: divides the entire database into different groups. It aims to make the difference between a group and a group obvious, while the data between the same group is as similar as possible. This method is usually used for customer segmentation. You do not know how to divide users into several categories before starting subdivision. Therefore, clustering analysis can be used to identify groups with similar customer characteristics, such as similar customer consumption characteristics or similar age characteristics. On this basis, you can develop marketing solutions for different customer groups.
3. Association Analysis: it is used to find the correlation between the values in the database. Two common technologies are association rules and sequence pattern. Association rules are used to find the correlations between different items in the same event. The sequence pattern is similar to the sequence pattern in that it looks for the temporal correlations between events, such as the analysis of stock ups and downs.
4. Classification: The purpose is to construct a classification function or classification model (also known as classifier). This model can map data items in the database to a specific category. To construct a classifier, you must have a training sample dataset as the input. A training set consists of a set of database records or tuples, each of which is a feature vector consisting of values of relevant fields (also known as attributes or features). In addition, a training sample also has a category tag. A specific sample can be expressed as: (V1, V2,..., vn; c), where VI represents the field value and C represents the category.
5. Regression: prediction of values of other variables by using variables with known values. In general, regression uses standard statistical techniques such as linear regression and nonlinear regression. Generally, the same model can be used for both regression and classification. Common algorithms include Logistic regression, decision trees, and neural networks.
6. Time Series: Time series uses past values of variables to predict future values.
In the early days, due to the immature theories and technologies of data mining, software vendors did not develop corresponding data mining tools for their database products. However, at that time, a small number of large enterprises had technical requirements in this regard. Therefore, there are some independent data mining tools on the market, such as enterprise miner of SAS, intelligent miner of IBM, Clementine and anomaly Detect of SPSS,, Sequence Analysis and Deviation Analysis (thanks to Yi Hong Kong for adding ). Nowadays, with the increasing maturity of related technologies, more and more enterprises have put forward such technical requirements, and software vendors are aware of the potential. It is estimated that in the future 3 ~ Within five years, a complete data mining tool will be integrated into the data warehouse.
Finally, I would like to remind you that, despite the bright future of business intelligence applications, the Bi industry has not yet formed a unified standard. Moreover, because the implementation of the Bi system is a long-term and iterative process, enterprises will definitely experience a short-term profit rollback in this process, this has also greatly affected enterprises' confidence and enthusiasm for practice. Therefore, most enterprises are currently holding a wait-and-see attitude towards this, or only implement Bi in a limited Department. I personally think it is wise for enterprises to do so. However, despite being implemented locally, there are still some opportunities. As a technician, you can make breakthroughs in R & D of related technologies. As a software vendor, you should seek opportunities from the technical upgrades of existing old customers and existing products.
[References]
I. Bibliography
1. W. H. inmon, building the data warehouse: Third Edition (data warehouse)
2. David Marco, building and managing the meta data repository a full lifecycle Guide (metadata warehouse construction and management)
3. David hand, Heikki Mannila, padhraic Smyth, rrinciples of Data Mining)
4. Olivia Parr Rud, data mining cookbook: modeling data for marketing, risk, and custormer Relationship Management (data mining practices)
Ii. References
1. Wang Yu, "what is business intelligence, not ?"
2. Yang Lin: "a correct understanding of business intelligence"
3. Dr. Wu Bin, "Technology and practice of business intelligence"
4. Unknown, many smart technical articles searched by Google
Iii. Data Source
1. bookshop
2. Technical magazine
3. http://www.google.com/
4. http://www.chinabi.net/
5. http://www.dmresearch.net/
6. http://www.amteam.org/
· BI architecture and Related Technologies (I)
· BI architecture and Related Technologies (medium)