A recent project demand side proposed the application of the congons product. Although I understand from my business needs that this is a tool platform for data analysis and statistics, I do not know the specific application. In the past, data warehouse and data mining technologies were just esoteric and remote things, avoiding fear. There is an inexplicable confusion that suddenly comes into being. I searched Google and Baidu for a while and obtained some Bi (Business Intelligence) knowledge. I got a new understanding of data warehouse and data mining. The past confusions and Technical Development confusions seem to be all at once solved.
- OLAP tutorial
- Technical Comparison of nine Data Warehouses
- Blog: What is Bi and OLAP?
In the previous blog, the author explained the concepts of Bi from a simple perspective, and gradually introduced the concepts of OLAP, multidimensional cube, data mining, and data warehouse, the usage of congnos is also mentioned in specific applications, and there is an epiphany. This is a brand new concept and field.
1) What is bi?The databases we usually access, that is, record-oriented databases, store detailed business data, which is fragmented and abstract. Therefore, they are also called "operational" databases. For the decision makers of enterprises, it is a pity that historical data is tasteless. How to build a bridge between data and decision makers is born from the concept of business intelligence (BI. Business Intelligence (BI) is a new technology that uses data warehousing, online analysis, data mining, and other technologies to process and analyze data. It aims to provide enterprise decision makers
Decision support. Bi is a factory:
> The raw materials of Bi are massive.Data;
> Bi products are produced by data processing.InformationAndKnowledge;
> Bi pushes these products to EnterprisesDecision makers;
> Enterprise decision makers use Bi factory products to make the right decisionsDecision MakingTo promote the development of enterprises;
2) Data QueryCompared with traditional SQL statement-based queries, Bi tools generally provide interface-based queries. You can simply query data by dragging the mouse.
3) data reportsA report is used to display the queried data in a specified format. It is a visual display of SQL queries.
4) OLAP)
Online Analytical ProcessingBi brings a brand new way of data observation and is one of the core technologies of Bi. In traditional tables, data is stored in the database as records. If we want to obtain the summary and statistical information of data under a specific condition, such as comparison and proportions of data of different categories and periods, you must manually write complex and error-prone SQL sum statements. Here, I can't help but think back to my previous project called "big cash management. In this project, there is such a data statistics module that requires data comparison and statistical proportions for different periods, cross-year and cross-month. At that time, a large number of SQL statistical functions and union were used to write such SQL statements. To debug the written statements in Java and DB2 environments, you have to add the/R/N line break to the line feed of each SQL statement, which is really ugly.
OLAP systems can be divided into Relational OLAP (relationalolap), multi-dimensional OLAP (multidimensionalolap, molap), and Hybrid OLAP (hybridolap, holap) based on their storage formats.
1. ROLAP
ROLAP stores multidimensional data used for analysis in relational databases and defines a batch of real views as tables based on application requirements. You do not need to save each SQL query as a real view. You only need to define those queries with high application frequency and computing workload as a real view. For each query on the OLAP Server, the computed real view is preferentially used to generate the query results to improve the query efficiency. The RDBMS used as the ROLAP memory is also optimized for OLAP, such as parallel storage, parallel query, parallel data management, cost-based query optimization, bitmap index, SQL OLAP extension (cube, rollup), and so on.
2. molap
Molap physically stores the multidimensional data used for OLAP analysis in the form of multi-dimensional arrays to form a "cube" structure. The attribute values of a dimension are mapped to the value or range of a multi-dimensional array, and the summary data is stored as the value of a multi-dimensional array in the array unit. Since molap adopts a new storage structure and is implemented from the physical layer, it is also known as physical OLAP (physicalolap). ROLAP is mainly implemented through some software tools or intermediate software, the physical layer still uses the storage structure of relational databases, so it is called virtual OLAP (OLAP ).
3. holap
Because molap and ROLAP have their own advantages and disadvantages (as shown in the following table) and their structures are quite different, this poses a challenge for analysts to design OLAP structures. Therefore, a new OLAP structure-Hybrid OLAP (holap) was proposed, which combines the advantages of molap and ROLAP. So far, there is no formal definition of holap. However, it is obvious that the holap structure should not be a simple combination of the molap and ROLAP structures, but an organic combination of the advantages of these two structures and technologies, which can satisfy users' various complex analysis requests.
OLAP introduces the concept of cube, which abstracts data into "dimensions" and "measurements ". A dimension can be imagined as the coordinate axis of a multi-dimensional coordinate. The value corresponding to the intersection of multiple coordinates is "measurement ".
Dimensions can be further classified and summarized. For example, the dimension is classified by day as month, the dimension is classified by month as year, And the dimension is classified by location as region ......, No matter how we classify them, we can always find a focal point in the coordinates. This point is the value we pay attention. This is a very intuitive concept. By drilling down the cube and rolling up the cube, We can get any data we want. Cubes can also be displayed through a variety of charts. You only need to select the fields corresponding to the dimension and measurement to immediately obtain a vivid Chart display from the Bi platform:
5) Data MiningCompared with the above technology, data mining embodies Bi I, that is, intelligence. The purpose of data mining is to use computers to analyze a large amount of data, find hidden patterns and knowledge between data, and present it to users in an understandable way. The three main elements of data mining are:
>Technologies and algorithms:Currently, common data mining technologies include --
Auto Cluster Detection)
Decision tree (demo-trees)
Neural Networks)
>Data:Because data mining is an unknown process in known data mining, it requires a large amount of data accumulation as a data source. The larger the data accumulation volume, the data mining tool will have more reference points.
>Prediction Model:That is, it simulates the business logic that requires data mining by a computer, which is also the main task of data mining.
6) Data Warehouse
As mentioned above, "operational database" is an extension of the concept of operational database.
"Data Warehouse" is used for decision support and is oriented to analytical data processing, unlike operational databases. In addition, data warehouse effectively integrates multiple heterogeneous data sources. After integration, the Data Warehouse is reorganized according to the topic, and contains historical data, and the data stored in the data warehouse is generally not modified.
The relationship between operational databases, data warehouses, and databases is like the relationship between C:, D: And hard disk. The database is a hard disk, and the operational database is C :, data Warehouse is D: The operation-type database and data warehouse are stored in the database, but the design mode and usage of the table structure are different.
7) Why is there a data warehouse?
So why should we add such a "data warehouse" between operational databases and bi?
First, the operational database is busy day and night. It aims to respond quickly to the business and has no energy to meet the data requirements of Bi. In addition, the data requirements of Bi are usually summarized, A select sum (XX) group by XX can make the operation-type database consume a lot of resources, and the business processing can't keep up with it. The trouble is huge. For example, if you save 5000 yuan, what do you think if the money hasn't been paid in ten minutes? The Bank's leaders must be looking at the pie chart?
Second, enterprises generally have multiple applications that correspond to multiple operational databases, such as the human resource library, financial database, sales document library, and inventory database, to provide a panoramic data view, Bi must combine these scattered data. For example, to achieve OLAP analysis that integrates sales and inventory information, bi tools must be able to efficiently obtain data from two databases. The most efficient method is to integrate the data into the data warehouse first, while Bi applications retrieve data from the data warehouse in a unified manner.
Integrating data from distributed operational databases into a data warehouse is a university question that gave birth to the market of data integration software. This kind of integration does not simply overlay tables together, but must extract the dimensions of each operational database and set the common dimensions to the shared dimensions, then, the database tables containing specific metric values are unified into several large tables (the term "fact table" and fact tables) according to the topic, and the Data Warehouse table structure is established according to the dimension-measurement model, then extract and convert the data. Subsequent extraction is generally performed to incrementally extract new data when the operational database load is relatively small (such as in the early morning), so that data in the data warehouse will accumulate. Most Bi applications do not require real-time data. For example, decision makers only need to see the weekly report of last week every Monday. 95% of Bi applications
NoTo achieve real-time performance, data may lag from one hour to one month. This is a feature of the Decision Support System. This lagging interval is the time when data extraction tools work. Of course, Bi applications usually have very few requirements on real-time data. In this case, you only need to directly connect the Bi querying software to the Business Database to address these special requirements, however, you must restrict the load and prohibit complex queries.
Currently, database products provide special optimizations for Data Warehouses. For example, when installing a high version of MySQL, you will be asked in sequence to make the database instance transaction-oriented, or demo-support. The former is an operational database, and the latter is a data warehouse (Decision supportAgain), for these two forms, the database will provide targeted optimization.