Bi-open the door to Business Intelligence To meet the requirements of enterprise managers, a Bi system must take the following steps: 1) to integrate data in various formats, clear the error records in the original data-data preprocessing requirements. 2) data pre-processing should be centralized in a unified manner-metadata (meta DaTa, data warehouse (DA)Ta Warehouse) requirements; 3) Finally, professional statistics should be made on a large centralized dataset to discover new opportunities that are valuable to enterprise decision-making-OLAP (online transaction analysis) and data mining (DA)Ta Mining. Therefore, a typical BI architecture should contain the relevant requirements involved in these three steps. |
|
Figure 3 architecture of Bi |
The entire system architecture includes: end user query and report tools, OLAP tools, and data mining (DA Ta Mining software, data warehouse (DA) Ta warehouse) and data mart (DA Ta Mart) products, Online Analytical Processing (OLAP) And other tools. 1) end user query and report tools. It is designed to support access to raw data from junior users, excluding finished report generation tools suitable for professionals. 2) data preprocessing (STL-data extraction, conversion, and loading) Extract and clear useful data from many data from different enterprise operating systems to ensure data correctness, and then extract and convert the data (transformation) and load, that is, the ETL process, is merged into an enterprise-level data warehouse to obtain a global view of enterprise data. 3) OLAP tools. It provides a multi-dimensional data management environment. Its typical application is modeling business problems and analyzing business data. OLAP is also called multidimensional analysis. 4) data mining (DATa mining) software. Uses techniques such as Neural Networks and rule induction to discover the relationships between data and make data-based inferences. 5) Data Warehouse (DATa warehouse) and data mart (DATa Mart) products. Pre-configuration software, including data conversion, management, and access, usually includes some business models, such as financial analysis models. 6) OLAP) . OLAP is a kind of software technology that enables analysts, managers, or executors to quickly, consistently, and interactively access information from multiple perspectives to gain a deeper understanding of data. The core technologies are data preprocessing, data warehouse creation (DW), data mining (DM), and Online Analytical Processing (OLAP. Next, we will describe these core parts in detail: Data preprocessing: Shortly after the advent of the early large-scale online transaction processing system (OLTP), a simple method for "extraction" processing emerged.ProgramIt is used to search for the entire file and database, use certain criteria to select the required data, and copy it for overall analysis. This will not affect the online transaction processing system in use and reduce its performance. At the same time, you can control the extracted data on your own. However, the current situation has changed dramatically. Enterprises have adopted multiple online transaction processing systems at the same time, and the data definition formats between these systems are different, even if you use different software products provided by the same software vendor, or only the product versions are different, there is a slight gap between the data definition formats. Therefore, we must first define a unified data format, convert the data from various sources according to the new unified format, and then load the data into the data warehouse in a centralized manner. Note that not all data in different formats from different sources can be accepted by the new unified format, we should not insist on putting all data from all data sources together. Why? There are many reasons. It is possible that a small number of records in the originally entered data use incorrect data. If such data cannot be corrected, it should be removed. Some data records are unstructured, and it is difficult to convert them into a new unified format. In addition, the entire file must be read from the extracted information, which is extremely inefficient, such as large-capacity binary data files, multimedia files and so on. If this type of data is difficult for enterprises to make decisions, they can be removed. Some software vendors have developed specialized ETL tools, including: Ardent Datastage Evolutionary Technologies, Inc. (ETI) Extract Information Powermart Sagent Solution SAS Institute Oracle warehouse Builder MSSQL Server2000 DTS Data warehouse: The concept of Data Warehouse is called "the father of Data Warehouse" William H. inmon first proposed in the book "building a data warehouse" written in the 1980s S, "data warehouse is a topic-oriented, integrated, non-volatile, A set of data that changes over time to support decision-making by managers ". Topic-oriented is the first notable feature of a data warehouse. In a data warehouse, data is organized according to different themes. Data in each topic is extracted from various operational databases, all the historical data related to the topic forms the corresponding topic domain. The second notable feature of a data warehouse is integration. The data comes from different data sources, and the consistency conversion is performed through the corresponding rules, and finally the integration is integrated. The third feature of the Data Warehouse is non-volatile. Once the data is loaded into the data warehouse, the data value will not change, even though the system is running to add, delete, modify, and perform other operations on the data, however, operations on the data will be recorded as new snapshots in the data warehouse, without affecting the data that has already entered the data warehouse. The last feature of a data warehouse is that it changes over time. Each data record in a data warehouse is recorded at a specific time, and each record has a corresponding timestamp. |
|
Figure 4 data warehouse architecture |
Data Warehouses classify metadata of external data sources and operational data sources according to the design requirements of the data warehouse model, build a metabase, and load the corresponding data to the Data Warehouse after ETL; when an information customer needs to query data, the customer first learns the metadata through the Information Display System or directly browses the metabase, and then initiates a data query request to obtain the required data. A typical enterprise data warehouse system consists of three parts: data source, data storage and management, and data access. |
|
Figure 5 Data Warehouse System |
Data source: refers to various production and operation data, office management data, and other internal data in the enterprise operation database, as well as some survey data, market information, and other data from external environments. These data are the basis for building a data warehouse system and the data source of the entire system. Data storage and management: Data warehouse storage consists of metadata storage and data storage. Metadata is data about data. Its content mainly includes data dictionary, data definition, Data Extraction Rules, data conversion rules, and data loading frequency. The data in each operation database is extracted, cleaned, converted, and integrated according to the rules defined in the metabase. The data is reorganized according to the topic and stored according to the corresponding storage structure. You can also create data marketplaces for applications. A data mart can be seen as a subset of a data warehouse. It contains fewer theme domains and has less data in a shorter history time, generally, it can only serve a certain part of the management personnel, so it is also called a department-level data warehouse. Data Access: it consists of OLAP (Online Analytical Processing), data mining, statistical reports, and ad hoc queries. For example, OLAP: for specific analysis themes, design a variety of possible observation forms and design the corresponding analysis topic structure (that is, design fact tables and dimension tables ), enables management decision makers to quickly, stably, and interactively access data based on multi-dimensional data models and perform various complex analysis and prediction tasks. OLAP can be divided into molap, ROLAP, and other storage methods. (Multi-dimension OLAP) stores the data required for OLAP analysis in multidimensional databases. Analyze the data of a topic to form one or more dimensions.CubeBody. ROLAP (Relational OLAP) stores the data required for OLAP analysis in a relational database. The data of the analysis topic is organized in star schema of "fact table-dimension table. Data mining: The definition of data mining is very vague. Its definition depends on the opinion and background of the definer. Some definitions in DM literature are as follows: Data mining is an important process for determining effective, new, potentially useful, and ultimately understandable patterns in data. Data Mining is a process that extracts previously unknown, understandable, and executable information from large databases and uses it for key business decisions. Data Mining is used in the knowledge discovery process to identify unknown relationships and patterns in data. Data Mining is a process of discovering useful patterns in data. Data Mining is a decision support process for studying large datasets for unknown information models. Although these definitions of data mining are a bit untouchable, it has become a business. Like in the past gold rush, the goal is to 'development miner '. The biggest profit is selling tools to miners, rather than doing practical development. At present, there are many mature Data Mining methodologies in the industry, providing an ideal guidance model for practical applications. Among them, there are three main standardization: CRISP-DM; pmml; Ole DB for DM. CRISP-DM (Cross-Industry Standard Process for da Ta Mining) is one of the currently recognized and influential methodologies. CRISP-DM emphasizes that DM is not only a data organization or presentation, but also a data analysis and statistical modeling, but also a complete process from understanding business needs, seeking solutions to acceptance of practical testing. CRISP-DM divides the entire mining process into the following six stages: Understanding), data understanding (DA Ta understanding), data preparation (DA Ta Preparation), modeling, evaluation, and release ). The architecture diagram is as follows: |
|
Figure 6 framework of CRISP-DM Model |
From the technical layer, data mining technology can be divided into descriptive data mining and predictive data mining. Descriptive data mining includes data summarization, clustering, and association analysis. Predictive data mining includes classification, regression, and time series analysis. 1. Data Summary: inherited from statistical analysis in data analysis. The purpose of the Data summary is to concentrate the data and provide a compact description. Traditional statistical methods such as sum, average, and difference are all effective methods. You can also use histograms, pie charts, and other graphical methods to express these values. In a broad sense, multidimensional analysis can also be classified into this category. 2. Clustering: divides the entire database into different groups. It aims to make the difference between a group and a group obvious, while the data between the same group is as similar as possible. This method is usually used for customer segmentation. You do not know how to divide users into several categories before starting subdivision. Therefore, clustering analysis can be used to identify groups with similar customer characteristics, such as similar customer consumption characteristics or similar age characteristics. On this basis, you can develop marketing solutions for different customer groups. 3. Association Analysis: it is used to find the correlation between the values in the database. Two common technologies are association rules and sequence pattern. Association rules are used to find the correlations between different items in the same event. The sequence pattern is similar to the sequence pattern in that it looks for the temporal correlations between events, such as the analysis of stock ups and downs. 4. Classification: The purpose is to construct a classification function or classification model (also known as classifier). This model can map data items in the database to a specific category. To construct a classifier, you must have a training sample dataset as the input. A training set consists of a set of database records or tuples, each of which is a feature vector consisting of values of relevant fields (also known as attributes or features). In addition, a training sample also has a category tag. The form of a specific sample can be expressed :( V1, V2,..., vn; c ), Where VI represents the field value and C represents the category. 5. Regression: prediction of values of other variables by using variables with known values. In general, regression uses standard statistical techniques such as linear regression and nonlinear regression. Generally, the same model can be used for both regression and classification. CommonAlgorithmSuch as logistic regression, decision tree, and neural network. 6. Time Series: Time series uses past values of variables to predict future values. Data Mining (DA Ta Mining) software. Uses techniques such as Neural Networks and rule induction to discover the relationships between data and make data-based inferences. |
|
Figure 7 Data Mining System |
The following are some current data mining products: IBM: intelligent miner smart miner Tandem: Relational da Ta miner relationship data miner Angosssoftware: knowledgeseeder Knowledge searcher Thinking Machines Corporation: darwintm Neovista software: ASIC Isl demo-systems, Inc.: Clementine Datamind Corporation: Datamind da Ta cruncher Silicon Graphics: mineset California Scientific software: brainmaker Wizsoft Corporation: wizwhy Lockheed Corporation: recon SAS Corporation: SAS enterprise miner OLAP ): The concept of OLAP was first proposed by E. F. codd, the father of relational databases in 1993. He also proposed 12 principles on OLAP. The proposal of OLAP has caused a great deal of response. As a kind of product, OLAP is used as a kind of parallel machine transaction processing. (OLTP) is clearly differentiated. Today's data processing can be roughly divided into two categories: online transaction processing OLTP (on-line transaction OLAP (on-line analytical) Processing ). OLTP is the main application of traditional relational databases, mainly for basic and daily transaction processing, such as bank transactions. OLAP is the main application of the data warehouse system. It supports complex analysis operations, focuses on decision support, and provides intuitive and easy-to-understand query results. OLAP is a kind of software technology that enables analysts, managers, or executors to quickly, consistently, and interactively access information from multiple perspectives to gain a deeper understanding of data. OLAP is designed to meet decision-making support or specific query and report requirements in multi-dimensional environments. Its core technology is the concept of "dimension. "Dimension" is a high-level classification from the perspective of observing the objective world. Dimensions generally contain hierarchical relationships, which are sometimes quite complex. By defining multiple important attributes of an object into multiple dimensions, you can compare data in different dimensions. Therefore, OLAP is also a collection of multidimensional data analysis tools. OLAP basic multidimensional analysis operations include drilling (Roll Up and drill down), slice and dice, as well as rotation, drill slave SS, Drill Through. Drilling is to change the level of the dimension and the granularity of the analysis. It includes roll up and drill down ). Roll In a single dimension, up summarizes low-level detailed data to high-level summary data, or reduces the dimension. In contrast, drill is down, it goes from summarized data to detailed data to observe or add new dimensions. Slice and slice are the distribution of measurement data on the remaining dimension after selecting a value on some dimensions. If there are only two remaining dimensions, the slice is used; if there are three, the slice is used. Rotation is to change the direction of the dimension, that is, to reschedule the placement of the dimension in the table (such as row-column swaps ). OLAP has multiple implementation methods. Different data storage methods can be divided into ROLAP, molap, and holap. ROLAP indicates the relational database-based OLAP implementation (relational OLAP ). With relational databases as the core, multidimensional data is represented and stored in a relational structure. ROLAP divides the multidimensional structure of a multi-dimensional database into two types of tables: fact tables used to store data and dimension keywords, and dimension tables, that is, at least one table is used for each dimension to store the description information of dimension levels, member categories, and other dimensions. A dimension table is associated with a fact table by the primary keyword and the external keyword to form a "star mode ". For complex hierarchical dimensions, to avoid occupying too much storage space for redundant data, you can use multiple tables to describe this star mode extension called "Snowflake mode ". Molap: OLAP Implementation Based on Multi-Dimensional Data Organization OLAP ). Taking multi-dimensional data as the core, that is, molap uses multi-dimensional arrays to store data. Multi-dimensional data will form a "cube" structure in storage, in molap, "rotation", "cut", and "slice" of "cube" are the main technologies used to generate multidimensional data reports. Holap indicates the OLAP implementation (Hybrid OLAP ). For example, the lower layer is relational and the higher layer is multi-dimensional matrix. This method provides better flexibility. There are other ways to implement OLAP, such as providing a dedicated SQL Server, which provides special support for SQL queries for certain storage modes (such as star and snow-slice. OLAP is an online data access and analysis tool for specific problems. It analyzes, queries, and reports data in multiple dimensions. Dimension is a specific angle for people to observe data. For example, when considering the sales status of a product, an enterprise usually observes the sales status of the product from different perspectives of time, region, and product. The time, region, and product here are dimensions. Different combinations of these dimensions and multidimensional arrays composed of the measured indicators are the basis of OLAP analysis, which can be formally expressed as (Dimension 1, dimension 2 ,......, Dimensions N, metrics), such as (Region, time, product, sales ). Multi-dimensional analysis refers to the use of slice, dice, drill (drill) Down and roll Up, rotating, and other analysis actions, in order to analyze data, so that users can observe the data in the database from multiple angles and sides, in this way, you can gain an in-depth understanding of the information contained in the data. According to the different organization methods of comprehensive data, currently common OLAP mainly includes multi-dimensional database molap and relational database-based ROLAP. Molap organizes and stores data in multiple dimensions, while ROLAP uses the existing relational database technology to simulate multi-dimensional data. In data warehouse applications, OLAP applications are generally the front-end tools of Data Warehouse applications. At the same time, OLAP tools can be used together with data mining tools and statistical analysis tools to enhance the decision analysis function. |