Some basic concepts of data warehouse and data mining

Last Update:2014-08-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The following content is taken from the Internet and sorted out.

Terms:

Bi (Business Intelligence): Business Intelligence,

DW (data warehouse): Data Warehouse. For details, see section Q1.

OLTP (on-line transaction processing): Online Transaction Processing

It is also known as a transaction-oriented processing system. Its basic feature is that the customer's raw data can be immediately transmitted to the Computing Center for processing and the processing results can be given in a very short time. The biggest advantage of this is the ability to process input data in real time and answer in a timely manner. It is also called a real time system ). An important performance indicator used to measure the online transaction processing system is system performance, which is embodied in the response time (Response Time). After a user sends data to a terminal, the time required by the computer to give a response to this request.

The OLTP database is designed to allow transaction applications to write only the required data to process a single transaction as soon as possible.

OLAP (on-line analytical processing): Online Analytical Processing

OLAP was proposed by E. F. codd in 1993.
Definition by the OLAP Council: OLAP is a software technology that enables analysts to quickly, consistently, and interactively observe information from various aspects to gain an in-depth understanding of data, this information is directly converted from raw data. They reflect the real situation of the enterprise in a way that is easy to understand.
Most of OLAP policies store relational or common data in multiple dimensions for analysis and online analysis. This multi-dimensional dB is also seen as a supercube that stores data along each dimension. It allows users to easily analyze data along the axis of the transaction, analysis forms related to mainstream business users generally include slicing, chunking, drill-down, mining, and other operations.

Datamart: A part of data that is independent from a data warehouse for a specific purpose or scope of application. It can also be called Department data or topic data (subjectarea ). In the implementation process of the data warehouse, we can often start from the data mart of a department, and then use several data marketplaces to form a complete data warehouse. Note that field definitions of the same meaning must be compatible when different data marketplaces are implemented, so that subsequent implementation of data warehouses will not cause great trouble.

Data Mining: Data Mining. For details, see section Q5.

ETL: ETL is the abbreviation of each word "extract", "transform", and "LOAD", that is, "extraction", "Conversion", and "loading ", however, we often call it Data Extraction for short. ETL is the core and soul of Bi/DW (Business Intelligence/data warehouse). It integrates and improves the value of data according to unified rules, it is responsible for the process of converting data from the data source to the target data warehouse, and is an important step for implementing the data warehouse.

Metadata: metadata that describes the data structure and creation methods in a data warehouse. It can be divided into two categories by purpose, technical metadata and commercial metadata. Technical metadata is the data used by the design and management personnel of the data warehouse for development and daily management. Includes: Data Source Information; description of data conversion; definition of objects and data structures in the data warehouse; rules used for data cleaning and data updating; ing between source data and target data; the user has the permission to ask questions, including data backup history, data import history, and Information Publication history.

Commercial metadata describes the data in a data warehouse from the business perspective. Including: Business topic description, including data, queries, reports;

Metadata provides an information folder (informationdirectory) for the data warehouse ), this folder fully describes what data exists in the data warehouse, how the data is obtained, and how to submit the data. It is the center for data warehouse execution and maintenance. The data warehouse server uses it to store and update data. Users can use it to understand and ask for data.

Q1: What is a data warehouse?

A data warehouse is a topic-oriented (subject oriented), integrated (integrate), nonvolatile, and time variant data set, used to support management decisions. The concept of Data Warehouse can be managed at two levels: ① Data Warehouse is used to support decision-making and oriented to analytical data processing. It is different from the existing operational database of an enterprise; ② Data Warehouse is an effective integration of multiple heterogeneous data sources.

Questions are reorganized, including historical data, and data stored in the data warehouse is generally not changed. The construction of enterprise data warehouses is based on the accumulation of existing enterprise business systems and a large amount of business data. A data warehouse is not a static concept. It only gives information to users who need it in a timely manner for them to make decisions to improve their business operations. Information can only play a role, so that information can be meaningful. Organizing, summarizing, and reorganizing information and providing it to corresponding management decision-makers in a timely manner is the fundamental task of data warehouses.

Q2: Why do we need to build a data warehouse?

An enterprise establishes a data warehouse to fill in the existing data storage format and cannot meet the needs of information analysis. A core concept in the data warehouse theory is that the processing performance of transactional and Decision-Making data is different.

Enterprises collect data in their transaction operations. In the process of enterprise operation: with the ordering and sales records, these transaction data is also continuously generated. To import data, we must optimize the transaction database.

When processing decision-supporting data, some questions are often raised: which type of customers will buy what kind of products? How much will the sales change after the promotion? After the price changes or the store address changes, how much will the sales change? In a certain period of time, which products are relatively easy to sell compared with other products? Which customers have added their purchases? Which customers have cut their purchases?

Transactional databases can answer these questions, but the answers they give are often not very pleasant. Competition often exists when using limited computer resources. When adding new information, we need to be overwhelmed by transactional databases. When answering a series of detailed information analysis questions, the effectiveness of the system processing new data will be greatly reduced. Another problem is that transactional data is always changing dynamically. Decision-supporting processing requires relatively stable data so that the problem can be answered in a consistent and continuous manner.

Data Warehouse solutions include separating decision-supporting data processing from transaction data processing. Data is imported from a transaction-type database every night or every weekend according to a certain period of time (usually every night)-"Data Warehouse ". A data warehouse organizes data based on a topic to answer a question from an enterprise. This is the most effective way to organize data.

In addition, the information system in daily operation of an enterprise is a complex data set composed of multiple traditional systems, incompatible data sources, and databases and applications. Each part cannot communicate with each other. From this perspective: the application system currently executed is an irreplaceable system built by users, especially system data. The purpose of establishing a data warehouse is to integrate and organize the data from different sources for unified management, so as to ensure data consistency and integration and provide a comprehensive and single-entry solution. This reminds me of the concept of SOA. The former is the Integration Optimization at the data layer, and the latter is the Integration Optimization at the application service layer.

Q3: What is the general structure of the data warehouse?

1. Architecture:

(1) data sources are the basis of the data warehouse system and the data source of the entire system, usually including internal and external information of the enterprise.

(2) data storage and management is the core of the entire data warehouse system. The data warehouse can be divided into enterprise-level data warehouse and department-level data warehouse (usually called data mart) according to the data coverage scope ).

(3) OLAP (on line analytical processing) server effectively integrates the data required for analysis and organizes the data according to multi-dimensional models for multi-angle and multi-level analysis and trend discovery.

(4) Front-end tools include various report tools, query tools, data analysis tools, data mining tools, and various application development tools based on data warehouses or data marketplaces.

2. fact tables and dimension tables

Fact tables and dimension tables are two basic concepts in a multi-dimensional model.

Fact tables are the primary data items for data analysis. They are usually a business or event in an enterprise. Fact tables generally have data characteristics and can be added. fact tables store data of different granularities. Data of different granularities in the same topic is generally stored in different fact tables.

A dimension table usually contains descriptive text information, which becomes a search condition for a fact table. The dimension attributes in a dimension table should be detailed and clear, reflecting the division of dimension layers, which can be a constraint for analyticdb queries. This is a difference between data warehouse and operational applications in data model design. The number of dimension table levels depends on the query granularity. In the actual business environment, the multi-dimensional data model generally contains 4 ~ 15 dimensions, many other dimensions or fewer dimensions are generally rare. In the detailed work, the designer must determine the corresponding dimensions based on the actual situation of the enterprise.

In a multidimensional model, the master code of a fact table is a combination code, the master code of a dimension table is a simple code, and the components corresponding to the master code of a fact table and a dimension table are external codes. The fact table is associated with the external code values corresponding to each dimension with the dimension table. When querying, the relationship between a fact table and a dimension table is used.

3. Data Organization Structure:

Star Model

Multidimensional Data Modeling organizes data in an intuitive way and supports high-performance data profiling. Each Multidimensional Data Model is represented by multiple multidimensional data modes. Each multidimensional data mode consists of a fact table and a group of dimension tables. The most common multi-dimensional model is the star mode. In star mode, the fact table is centered, and multiple dimension tables are radiating around it and connected to the fact table.

A star-centered entity is an indicator entity. It is the center of basic entities and query activities that users are most concerned about. It provides quantitative data for the query activities of data warehouses. Each indicator entity represents a series of related facts and completes a specified function. Objects located on the star angle of a star chart are dimension objects. They are used to restrict the user's query results and filter data so that fewer rows are returned from the index object query, thus narrowing down the limit. Each dimension table has its own attributes. The dimension table is associated with the fact table through keyword.

Snowflake Model

The snowflake model is an extension of the star model. Each dimension can be connected to multiple specific category tables. In this case

Mode. In addition to the dimension table function in the star model, dimension tables are connected to fact tables.

Detailed description of the foreign category table. By describing the dimension of a fact table

To narrow down fact tables and improve query efficiency.

Q4: how to design and establish a data warehouse?

Nine steps for designing a data warehouse

1) select an appropriate topic (the domain to be resolved)

2) Clearly define the fact table

3) confirm and confirm dimensions

4) Choosing the facts

5) Calculate and store derivative data segments in the fact table

6) Rounding out the dimension tables

7) Choosing the duration of the database

8) the need to track slowly changing dimensions

9) determine the query priority and query mode.

Technically

Hardware Platform: the hard disk capacity of the Data Warehouse is usually 2-3 times of the hard disk capacity of the operating database. Generally, the mainframe has more reliable performance and stability, and is also easy to combine with legacy systems; while the pcserver or unixserver is more flexible, easy operations and the ability to dynamically generate query requests for queries. When selecting a hardware platform, do you have to consider: Do you want to provide parallel I/O throughput? What is the multi-CPU support capability?

Data Warehouse DBMS: its ability to store large data volumes, query performance, and support for parallel processing.

Network Structure: The implementation of the data warehouse will produce a large amount of data communication in that part of the network segment, and there is no need to improve the network structure.

Implementation

　　Procedure for creating a data warehouse

1) Collect and analyze business requirements

2) establish a data model and physical design of the Data Warehouse

3) define the data source

4) Select the Data Warehouse Technology and Platform

5) extract, convert, and load data from operational databases to Data Warehouses

6) Select the submit and report tools.

7) select database connection Software

8) Select data analysis and data presentation software

9) update the data warehouse

Data Extraction, cleaning, conversion, and migration

1) Data Conversion tools should be able to read data from various data sources.

2) supports flat files, index files, and legacydbms.

3) data can be integrated with different types of data sources.

4) standardized data renewal Interface

5) it is best to have the ability to read data from the data dictionary.

6) The code generated by the tool must be maintained in the development environment.

7) Only the specified data that meets the specified conditions and the specified part of the source data can be extracted.

8) data type conversion and Character Set conversion during extraction

9) generate derivative fields during extraction

10) allows the Data Warehouse Management System to actively call data to regularly extract data, or generate flat files based on the results.

11) the vitality and product support capabilities of software vendors must be carefully evaluated

Q5: What is data mining?

Data mining refers to massive, incomplete, noisy, fuzzy, and random practical application data, the process of extracting hidden information and knowledge that people do not know in advance but is potentially useful.

Data Mining is a cross-discipline that promotes the application of data from low-level simple queries to Data Mining knowledge and provides decision support. With such demands, researchers in different fields are gathered, especially database technology, artificial intelligence technology, mathematical statistics, visualization technology, parallel computing and other scholars and Project technicians, he devoted himself to the emerging research field of data mining to form a new technical hot spot.

Q6: how to mine data?

1. Determine the Business Object

Clearly defining business problems and recognizing the purpose of data mining is an important step in data mining. The final structure of data mining cannot be previewed, but the problems to be explored should be foreseen. for data mining, data mining is blind and will not succeed.

2. Data Preparation

1) Data Selection

Search for all internal and external data information related to business objects, and select the data suitable for data mining applications.

2) data preprocessing

Study the quality of data to prepare for further analysis. Determine the type of the mining operation to be performed.

3) Data Conversion

Converts data into an analysis model. This analysis model is created for mining algorithms. Creating an analysis model that is truly suitable for mining algorithms is the key to successful data mining.

3. Data Mining

Mine the converted data. In addition to selecting an appropriate mining algorithm, you can complete all the work on your own initiative.

4. Result Analysis

Interpret and evaluate the results. The analysis method used is generally determined by data mining, and visualization technology is generally used.

5. Knowledge assimilation

Integrate the knowledge obtained from the analysis into the organizational structure of the business information system.

Q7: What is the relationship between data warehouses and data mining?

Relationship between data warehouse and data mining are both important components of the data warehouse system. They have both links and differences.

Contact:

(1) Data Warehouses provide better and wider data sources for data mining.

(2) the data warehouse provides a new support platform for data mining.

(3) Data Warehouses provide convenience for better use of the data mining tool.

(4) Data Mining provides better decision support for data warehouses.

(5) Data Mining puts forward higher requirements on the data organization of data warehouses.

(6) Data Mining also provides extensive technical support for data warehouses.

The difference is:

(1) data warehouse is a data storage and data organization technology that provides data sources.

(2) Data Mining is a data analysis technology that can analyze data in a data warehouse.

Q8: Application and practical significance of data warehouse and data mining in some commercial fields

1) goods sales. The Department of Commerce regards data as a kind of competitive Fuke, which is more important than any other department. Therefore, it is necessary to turn a large marketing database into a data mining system. Kraft is one of the companies that use the marketing database. kgf collects a list of 30 million users who buy its products) obtained through various promotion means. Kgf regularly sends coupons for famous brand products to these users to introduce the performance and usage of new products. The company understands that the more users it has, the more opportunities it has to buy and use the products, and the better the company's business conditions.
2) manufacturing. Many companies not only use decision-making support systems to support marketing activities, but also use decision-making support systems to monitor manufacturing processes as market competition grows, some manufacturers claim to have directed their respective offices to reduce manufacturing costs by 25% each year within three years. It is self-evident that the manufacturer often collects information about suppliers of various components. Because they must also follow the manufacturer's cost reduction strategy. To deal with challenges from various parties, the manufacturer already has a "cost" decision support system that can monitor the component costs provided by various suppliers to achieve the set price targets, such applications need to collect product cost information from various vendors for one consecutive year, so as to determine whether such an organization can meet the previously formulated strategic objectives for price reduction.
3) Financial Services/credit cards. General Motors has used its credit card, the GM card, and has 12 million customers with credit cards in its database. Through observation, the company can learn what kind of cars they are driving, what kind of cars they plan to buy next, and what kind of vehicles they like. For example, if a customer with a credit card expresses interest in a truck, the company can send an email to the truck department and inform the relevant department of the customer's information.
4) remote communication. Many large remote communication companies have recently discovered that they are facing great competition pressure, which did not exist a few years ago. In the past, the business did not require them to closely watch the market trend. Due to the limited choice of customers, this situation has changed a lot recently. Companies are actively collecting a large amount of customer information, providing new services to their existing customers, and opening up new business projects to expand their market scale. From these new services, the company can achieve greater benefits in the short term.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Some basic concepts of data warehouse and data mining

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Some basic concepts of data warehouse and data mining

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support