Some basic concepts of data warehouse and data mining

Source: Internet
Author: User

The following is an excerpt from the Internet and collated.

Noun:

BI (Business Intelligence): Commercial Intelligence,

DW (Data Warehouse): Warehouse, see the Text Q1 section.

OLTP (on-line Transaction processing): Online transaction processing

Also known as the transaction-oriented processing system, its basic characteristic is that the customer's raw data can be transferred to the computing center immediately and the processing results are given in a very short time. The greatest advantage of this is the ability to instantly process input data and respond in a timely manner. Also known as real time system. An important performance metric for online transaction processing systems is System performance, which is detailed in real-time response time (Response times), which is the time required for a computer to respond to a request after it has been fed into the terminal.

The OLTP database is designed to allow transactional applications to write only the data they need to handle a single transaction as quickly as possible.

OLAP (On-line Analytical Processing): Online analytical Processing

OLAP was proposed by E.f.codd in 1993.
OLAP Council definition: OLAP is a software technology that enables analysts to quickly, consistently and interactively observe information from all aspects in order to achieve an in-depth understanding of the data, which is translated directly from the original data, and they reflect the real situation of the enterprise in a way that users can easily understand.
Most of OLAP's strategy is to store relational or normal data for multidimensional data storage, so that it can be analyzed so as to achieve the purpose of online analytical processing. Such a multi-dimensional db is also considered a hypercube, along the various dimensions of the storage data, it agreed that users along the axis of things convenient analysis of data, and the mainstream business users of the analysis of the form generally have slicing and cutting and drilling, digging and other operations.

Datamart: A data mart, for a specific application purpose or scope of application, and part of the data that is isolated from the data warehouse can also be referred to as departmental data or subject data (Subjectarea). In the process of data Warehouse implementation, it is often possible to start with a Department data mart and then make a complete data warehouse with several data marts. It is important to note that when implementing a different data mart, the same meaning of the field definition must be compatible, so that later implementation of the data Warehouse will not cause great trouble.

Data Mining: See the text Q5 section

Etl:etl are "Extract", "Transform", "load" the initials of the three words is "extract", "transform", "load", but we are often referred to as the daily data extraction. ETL is the core and soul of BI/DW (Business intelligence/Data Warehouse), integrating and improving the value of data according to unified rules, is responsible for the completion of data from the data source to the target Data Warehouse transformation process, is the implementation of data Warehouse important step.

MetaData: Metadata, metadata is data describing the structure of data in the data warehouse and the method of establishing it. It can be divided into two categories, technical metadata and business metadata, depending on the purpose. Technical metadata is the data that the data warehouse design and management personnel use to develop and manage the Data Warehouse daily. Includes: Data source information, descriptive narrative of data conversion, definition of objects and data structures in data Warehouse, rules for data cleansing and updating, mapping of source data to destination data, user access rights, data backup history, data import history, information publication history, etc.

Business metadata describes the data in a data warehouse from a business perspective. Include: Descriptive narrative of business topics, including data, inquiries, reports;

Metadata provides an information folder (Informationdirectory) for access to the Data Warehouse, which comprehensively describes what data is in the Data warehouse, how the data is obtained, and how to access the data. is the center of Data Warehouse execution and maintenance, which the Data Warehouse server uses to store and update data that users can use to understand and access data.

Q1: What is a data warehouse?

A data warehouse is a subject-oriented (Subject oriented), integrated (integrate), relatively stable (nonvolatile), data collection that reflects historical changes (time Variant) to support management decisions. The concept of a data warehouse can be ① from two levels: The Data Warehouse is used to support decision-making and analytical data processing, which is different from the existing operational database of the enterprise. ② Data Warehouse is an effective integration of multiple heterogeneous data sources, which is integrated in accordance with the main

and includes historical data, and the data stored in the Data warehouse is generally no longer changed. The construction of enterprise Data Warehouse is based on the accumulation of existing enterprise business system and large amount of business data. The Data warehouse is not a static concept, only the information in a timely manner to users who need the information, for them to improve their business operations decision-making, information ability to play a role, information only makes sense. The basic task of data Warehouse is to organize, summarize and reorganize the information and provide it to the corresponding management decision-maker in time.

Q2: Why build a Data warehouse?

The enterprise establishes the Data warehouse to fill the existing data storage form already cannot satisfy the information analysis the need. One of the core concepts in Data Warehouse theory is that transactional data and decision-supporting data have different processing performance.

The enterprise collects data in their transactional operations. In the process of enterprise operation: With the order, sales records, these transactional data is also continuously generated. In order to introduce data, we must optimize the transactional database.

When dealing with decision-supporting data, some questions are often raised: which type of product does the customer buy? How much will the sales change after the sale? How much will the sales change after the price changes or when the store's address changes? In a certain period of time, what kind of products are particularly easy to sell relative to other products? Which customers have added their purchase amount? Which customers have cut back on their purchase amount?

Transactional databases can answer these questions, but the answers they give are often not very pleasant. There is often competition in the use of limited computer resources. We need the transactional database to be spare when adding new information. When answering a series of detailed questions about information analysis, the effectiveness of the system in processing new data is greatly reduced. Another problem is that transactional data is always in the midst of dynamic change. Decision-support processing requires relatively stable data, so that the problem can be consistently and continuously answered.

The solution of Data Warehouse includes: Separating decision-supporting data processing from transactional data processing. Data is imported from a transactional database into a decision-supporting database-both "data warehouses"-in accordance with a certain period (usually nightly or weekly). Data Warehouse is the most effective way to organize data by "subject" according to the question of answering enterprise.

In addition, the daily operation of the enterprise information system is often composed of a number of traditional systems, incompatible data sources, databases and applications together constitute a complex collection of data, the various parts can not communicate with each other. From this point of view: the current implementation of the application system is the user spent a lot of energy and financial build, irreplaceable system, especially the data of the system. The purpose of building data Warehouse is to organize and unify the data of these different sources, so as to achieve the consistency and integration of data, and provide a comprehensive, single-entry solution. The idea that reminds me of SOA is that the former is a data-level integration optimization, and the latter is an integration optimization at the application service level.

Q3: What is the general structure of the Data Warehouse?

1. Architecture:

(1) The data source is the basis of the Data Warehouse system, the data source of the whole system, usually contains the internal information and external information of the enterprise.

(2) The storage and management of data is the core of the whole data Warehouse system. Data warehouses can be divided into enterprise-level data warehouses and departmental data warehouses (often referred to as data marts) in terms of data coverage.

(3) OLAP (on Line Analytical Processing) server effectively integrates the data needed for analysis, organized by multidimensional models for multi-angle, multi-layered analysis, and to uncover trends.

(4) The front-end tools include a variety of reporting tools, query tools, data analysis tools, data mining tools, and a variety of data warehousing or data Mart application development tools.

2. Fact tables and dimension tables

Fact tables and dimension tables are the two basic concepts in multidimensional models.

A fact table is a major data item for data analysis, usually a business or an event within an enterprise. Facts In fact tables generally have data characteristics and are additive, the fact table can store different granularity of data, the same topic in the different granularity of data is generally stored in different fact tables.

The dimension table is usually composed of descriptive textual information, which becomes the condition of the fact table. The dimension attribute in the dimension table should be understood in detail, and the partition of dimension hierarchy can be the constraint condition of analytic query, which is a different point of data warehouse and operation application in data model design. The number of levels in the dimension table hierarchy depends on the granularity of the query. In a real business environment, multidimensional data models typically contain 4~15 dimensions, and many other dimensions or fewer dimensions are generally rare. In the detailed work, the designer must determine the corresponding dimension according to the actual situation of the enterprise.

In the multidimensional model, the main code of the fact table is the combination code, the main code of the dimension table is a simple code, and the various components corresponding to the main code of the dimension table in the fact table are external codes. The fact table is connected with the dimension table by the corresponding external code values of each dimension. This is the corresponding relationship between the fact table and the dimension table when querying.

3. Data organization structure:

Star-shaped model

Multidimensional data modeling organizes data in an intuitive way and supports high-performance data access. Each multidimensional data model is represented by multiple multidimensional data patterns, and each multidimensional data pattern consists of a fact table and a set of dimension tables. The most common form of multidimensional models is the star pattern. In star mode, the fact table is centered, and multiple dimension tables are radiated in their four weeks and connected to the fact table.

The entity in the Star Center is the indicator entity, which is the center of the basic entity and query activity that the user is most concerned about, and provides quantitative data for the query activity of the Data Warehouse. Each indicator entity represents a series of related facts and completes a specified function. Entities located on Star Corners are dimension entities that restrict the user's query results and filter data so that fewer rows are returned from the indicator entity query, narrowing the range of access. Each dimension table has its own attributes, and the dimension table and fact table are associated by keyword.

Snowflake model

The snowflake model is an extension of the star model, and each dimension can be connected outward to more than one specific category table. In such a

Mode. Dimension tables are connected to the fact table in addition to the dimension table function in the Star model

To carry out a detailed description of the classification of the list of fine. The concrete category table through the fact table in the relevant dimensions of the specific description of the narrative, reached the

The purpose of reducing the fact table and improving the query efficiency.

Q4: How to design and build a data warehouse?

Nine steps to designing a data Warehouse

1) Select the appropriate topic (the area to solve the problem)

2) Understand the definition of the fact table

3) Identify and confirm dimensions

4) Choosing the facts

5) calculate and store the derived data segments in the fact table

6) Rounding out the dimension tables

7) Choosing the duration of the database

8) The need to track slowly changing dimensions

9) Determine the query priority and query mode.

Technically,

Hardware platform: The hard disk capacity of the Data warehouse is typically 2-3 times the capacity of the database drive. Mainframes typically have more reliable performance and stability, and easy integration with legacy systems, while pcserver or unixserver are more flexible, easy to operate and provide the ability to dynamically generate query requests for queries. What to consider when choosing a hardware platform: Does it provide parallel I/O throughput? What is the ability to support multiple CPUs?

Data Warehouse DBMS: his ability to store large amounts of data, query performance, and support for parallel processing.

Network structure: the implementation of data warehouse in that part of the network segment will generate a lot of data communication, need not to improve the network structure.

Implementation on

  Steps to build a data warehouse

1) collection and analysis of business requirements

2) Establish the physical design of data model and data Warehouse

3) define the data source

4) Select Data Warehouse technology and platform

5) Extract, transform, and load data from the operational database to the Data Warehouse

6) Select access and reporting tools

7) Select database connection software

8) Select data analysis and data display software

9) Update the Data Warehouse

Data extraction, cleanup, transformation, and porting

1) Data Conversion tool to be able to read data from a variety of different data sources.

2) Support flat files, index files, and Legacydbms.

3) can integrate data with different types of data sources for input.

4) Data access interface with specifications

5) It is best to have the ability to read data from a data dictionary

6) The code generated by the tool must be maintainable in the development environment

7) can only extract the data that satisfies the specified conditions, and the specified part of the source data

8) Ability to perform data type conversions and character set conversions in extraction

9) Generate derived fields can be calculated during the extraction process

10) enables the Data Warehouse management system to call itself on a regular basis for data extraction, or to generate flat files for the results

11) A detailed assessment of the vitality and product support capabilities of the software vendor is required

Q5: What is Data mining?

Data Mining is the process of extracting information and knowledge that is hidden in the unknown, but is potentially useful, from a large number of incomplete, noisy, fuzzy, random practical application data.

Data mining is a cross-disciplinary, it puts people's application of data from the low-level simple query, to the mining of knowledge from the data, to provide decision-making support. Under the traction of such demand, the researchers in different fields, especially the database technology, artificial intelligence technology, mathematical statistics, visualization technology, parallel computing and other scholars and project technicians, devote themselves to the new research field of data mining and form the hot spot of technology.

Q6: How to do data mining?

1. Identify the business object

Clearly define business problems, and realize that data mining is an important step to data mining. The final structure of the excavation is not predictive, but the problem to be explored should be predictable, and data mining for data mining is blind and will not be successful.

2. Data preparation

1) Selection of data

Search for all the internal and external data related to the business object, and select the data that applies to the data mining application.

2) Preprocessing of data

Study the quality of the data to prepare for further analysis. and determine the type of mining operation that will be made.

3) Conversion of data

Transform the data into an analytic model. This analysis model is based on the mining algorithm. The key to the success of data mining is to establish an analytic model that is really suitable for mining algorithm.

3. Data Mining

The resulting transformed data is mined. In addition to intact from the selection of the appropriate mining algorithm, all the rest of the work can be self-motivated to complete.

4. Results analysis

Interpret and evaluate the results. The analytical methods used in this paper are generally used for data mining operation, and the visualization technology is generally applied.

5. Assimilation of knowledge

Integrate the knowledge gained in the analysis into the organizational structure of the business information system.

Q7: What is the relationship between data warehousing and data mining?

The relationship between data warehouse and data mining is an important part of Data Warehouse system, which has both connection and difference.

Contact is:

(1) The Data Warehouse provides a better and more extensive data source for data mining.

(2) The Data Warehouse provides a new support platform for data mining.

(3) The Data Warehouse is convenient for better use of the data mining tool.

(4) Data mining provides better decision support for Data Warehouse.

(5) Data mining provides higher requirements for data organization of Data Warehouse.

(6) Data Mining also provides a wide range of technical support for data Warehouse.

The difference is:

(1) Data Warehouse is a kind of data storage and data organization technology, providing data source.

(2) Data mining is a kind of data analysis technology, which can be used to analyze data in Data Warehouse.

Q8: The application and practical significance of data warehouse and data mining in some commercial fields

1) Sale of goods. The business sector's view of the data as a competitive wealth may be more important than any other sector, and the big marketing database needs to be transformed into a data mining system. Colafort (Kraft) food Company (KGF) is one of the companies that applied the marketing database, which collects a list of 30 million users of its merchandise, which is (KGF) obtained through various promotional means. KGF regularly sends these users coupons for branded products that describe the performance and use of new products. The company realizes that the more users know about its products, the more opportunities to buy and use the goods, and the better the company's business.
2) manufacture. Many companies not only use decision support systems to support marketing campaigns, they have used decision support systems to monitor the manufacturing process as the market becomes more competitive, with manufacturers claiming to have instructed their offices to reduce manufacturing costs by 25% per cent annually over a three-year period. It goes without saying that the manufacturer often collects the parts suppliers. As a result, they must also follow the manufacturer's strategy to reduce costs. In order to meet the challenges from all sides, the manufacturer has a "cost" decision support system that can monitor the cost of components provided by each supplier to achieve the set price targets, such applications need to collect information about the product cost of each vendor for a year in a row, In order to determine whether such an organization would meet the original strategic goal of reducing prices.
3) Financial Services/credit cards. General Motors has already adopted a credit card--GM card, which already has 12 million customers with credit cards in the company's database. By observing, the company is able to understand what kind of car they are driving, what kind of cars they plan to buy and what kind of vehicles they like. For example, a customer with a credit card said that he was interested in a truck and that the company could send an e-mail to the trucking Department and inform the authorities of the customer's information.
4) remote communication. Many big telecommunications companies have suddenly discovered that they are under a lot of competitive pressure, which did not exist a few years ago. In the past, there was no need for them to keep a close eye on market movements, and because of the limited choice of customers, the situation has changed very much recently. Companies are currently actively collecting a large number of customer information, to their existing customers to provide new services, open up new business projects to expand their market size. From these new services, the company will be able to achieve greater benefits in the short term.

Some basic concepts of data warehouse and data mining

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.