Analysis of Data Mining Technology

Source: Internet
Author: User
Tags ibm db2

I. Keywords

1. DM (data mining), DW (data warehouse), OLAP, Bi

2. Databases have become the basis of the system for collecting and distributing information. The purpose of data collection is to make correct decisions based on the database content. The deep hiding of these massive data is a lot of business patterns (pattern), Rules (rules), and these hidden "business knowledge" is of great significance to the current data owners, therefore, they may predict the future business strategies, market development plans, and new profit points of the entire group, to mine the "knowledge" behind a large number of seemingly unrelated data, special statistics or measurement methods are required.

3. What is Data Mining

The following are some definitions of data mining by the "Predecessors:

1. "mine" The Hidden modes, trends, and links in the data (Groth)

2. Discovery of useful modes and rules in massive data through automatic or semi-automatic methods (beryy & linoff)

3. Analyze common data (usually massive) to discover stable relationships between data, and summarize the data in an easy-to-understand manner to provide valuable decision-making support to data owners. (Hand, Mannila & Smyth)

4. Data Mining and Analysis (Wegman) using feasible computer technology without human intervention or few manual intervention)

5. Extract the effective and practical information that has not been found from a large number of databases, and then use this information to help develop key business decision-making processes (cabena etal)

4. What do we need to do before doing data mining?

1. A large amount of data preparation (case, data mining case) (generally 108-1012 bytes, that is, hundreds of MB to TB)

U 103 = 1 K

U 106 = 1 m

U 109 = 1g

U 1012 = 1 t

2. Multi-dimensional data (data mining variables)

Generally, a data must have 10 to 104 attributes, that is, the same data may be viewed from several different perspectives.

5. Why data mining?

1. Currently, only a small portion of data (usually 5% to 10%) has been analyzed and used. Other data is rarely viewed and analyzed after being inserted into the database.

2. Data may have never been analyzed, but the current data manager is afraid that the data may be used in the future, so it will continue to collect the data for a good future. In this case, the database will only become larger and larger, and the efficiency in searching for useful information will be lower and lower.

6. Huber-Wegman dataset size classification

 

Description

 

Data size (bytes)

 

Storage Mode

 

Extremely small

 

102

 

Paper

 

Small

 

104

 

Stacked paper

 

Medium

 

106

 

One floppy disk

 

Large

 

108

 

Hard Disk

 

Very large

 

1010

 

Multiple Hard Disks

 

Extremely Large

 

1012

 

Tape

 

Massive

 

1015

 

Distributed Storage

 

VII. Status quo

A) in recent decades, many companies have spent a lot of resources to build and maintain information databases, including developing large-scale data warehouses.

B) generally, the existing data cannot be analyzed through regular analysis. The reason may be that many incomplete records are lost, or the data may exist in quantitative rather than qualitative forms.

C) In most cases, the information in the current database is not properly evaluated or used because it is not easy to access and analyze.

D) some databases grow so fast that even the system administrator often does not know which information in the system can be used to handle the current problem, and the relationship between the data in the system and the current issue.

E) If organizations are provided with a way to "mine" important information and business models in these large databases, they will have very direct benefits.

8. Why is Data Mining so popular recently?

A) The main reason is that computer technology, especially database management, is complicated and tricky.

B) because the data in the database grows rapidly, it is very difficult to manually search for information. Data Mining is useful for discovering and describing hidden patterns in Relational Tables. Algorithms provided by data mining allow automatic mode lookup.

9. knodge DGE discovery in databases in the KDD Database)

A) knowledge discovery was the first term in the field of AI (Artificial Intelligence ).

B) KDD consists of the following parts (including "Data Mining" of course ")

I. "data cleansing" (removing noise data and inconsistent data)

Ii. "Data Integration" (data from multiple data sources is collected together)

Iii. "data filtering" (Select Topic-related data from the database based on the topic to be analyzed)

Iv. "Data Conversion" (organize and convert data so that they can easily use mining algorithms such as "aggregation" and "aggregation)

V. "Data Mining" (this is the core step. It uses intelligent methods to extract implicit models and rules)

Vi. "Mode evaluation" (verify and evaluate the newly discovered "Knowledge" to check whether this mode is feasible)

VII. "Knowledge Representation" (visualizes the mined patterns to Users)

10. Databases Used for database Mining

Databases that can be used for database mining are as follows:

U Relational Database Service

U Data Warehouse

U transaction processing database

U supports relational databases of Objects

U Object-Oriented Database

 

11. Data Warehouse)

It is a collection of information data that is collected from multiple data sources and changes over time but the information itself is relatively stable.

The data warehouse isolates the report data from the running database system. By moving the query work to a more efficient system, this isolation can improve the performance of the running system. This improves security. Sensitive information is stored in the running database that is not exposed to the query. The extraction level provided by the Data Warehouse simplifies access to the statistical tables generated by decision-making support applications.

Data in OLTP is regularly inserted into the data warehouse. The structure and security of the data warehouse are simplified compared with OLTP, because the main purpose of data warehouse is to improve the efficiency of analysis and query, instead of online transaction processing.

The basic unit of a conventional OLTP database is a two-dimensional table consisting of rows and columns. The basic unit of a data warehouse is a multi-dimensional cube. It is possible to observe and analyze this data entity from multiple perspectives, it is an integrated information repository from existing data sources. These units are usually associated in the form of star schema or snow flake schema.

12. OLAP (on-line analytical processing) Online Analysis System

A) is part of DST (decision support tool)

B) use traditional query and report forms to describe information in the current database

C) OLAP is mainly used to show why a business model is correct, that is, to verify the correctness of a "knowledge" (opposite to data mining, data Mining is a new "Knowledge" model)

D) verify or overturn a series of "assumptions" and "associations" by querying the database ".

E) OLAP technology analyzes, queries, and generates reports in multiple dimensions. It is different from traditional OLTP processing applications. OLTP applications are mainly used to complete user transaction processing, such as the Civil Aviation ticket booking system and the Bank's savings system. It usually requires a large number of update operations and high response time requirements. The application of the OLAP system is mainly to analyze the current and historical data of the user and assist the leadership in decision-making, typically, it should have the analysis and prediction of bank credit card risks and the formulation of the company's marketing strategy. It mainly involves a large number of query operations, with less stringent requirements on time.

F) It is usually a process of "inference ".

13. Comparison between OLAP and Data Mining

A) OLAP is mainly used to verify a mode.

B) "Data Mining" mainly refers to "discovering" A Model Based on Data

C) "Data Mining" is usually a "deduction" process.

14. "Data Mining" is an interdisciplinary edge technology

It mainly includes the following subjects:

A) Computer

I. Database Technology

Ii. Machine Learning

B) Information technology

C) Statistics

D) visiualizaion

E) Pattern Recognition

15. Commercial applications using "Data Mining"

A) CRM (Customer Relationship Management) Customer Relationship Management System

B) customer behavior Customer Behavior Analysis

C) Market Basket Analysis

D) Retailing retail

E) Market Segmentation

F) creadit scoring credit grade

G) fraud detection

H) Taxpayer noncompliance

I) Church n Prediction

J) E-Business

K) web-mining

16. Other applications using "Data Mining"

A) Research on the trend of stock market trends

B) textual and multimedia analysis text and multimedia Analysis

C) Sports scouting

D) medical outcomes Analysis

E) Scientific

F) Research on Web surfing behavior online behavior

17. Data Mining tasks

A) prediction model (for example, "prediction ")

B) description model (for example, "Classification Analysis ")

C) modes and related rules

18. Prediction Model

A) A model created on the "training dataset" composed of "Examples"

B) This model will then use a "detection dataset" to verify whether the model is available and easy to use.

C) Each "case" is composed of two parts:

I. "input variables" (input data, "independent" variables)

Ii. "target variable" (for example, "response" and "output ")

D) type of "output variable"

I. Supervised Classification

Ii. Linear Regression

Iii. Survival Analysis (survival analysis over time)

E) Coherent output

I. Health care output (medical expenses)

Ii. Continuity Management (the remaining amount in an ATM or the balance in a branch vault)

Iii. Commercial Return Management (time difference between purchase and return of goods)

19. Target marketing)

I. Example: "customer" and "housing"

Ii. Input: Geographic Information System, Financial System

Iii. Target: response to a request

Iv. Operation: target a customer segment that can respond quickly in the future competition

20. CRM

A) Example: existing customers

B) input: purchase history, goods/service usage records, and statistical data

C) Objective: Adjust the brand, cancel, and discover shortcomings

D) operation: increase customer loyalty

21. Credit Scoring Credit Grade Assessment

A) Example: Past applicants

B) input: information and credit report generated by the application

C) objective: to charge fees, keep bad credit records, and revoke credit

D) operation: accept or reject future credit applicants

22. Difficulties encountered when processing data during Data Mining

A) Data Error

I. Incorrect Value

Ii. Unrelated data

Iii. Data Loss

1. Only use the complete record set for analysis

2. Use a reasonable value to fill in the location where data is lost

Iv. Incomplete Data

23. Main software used for data mining

A) Enterprise miner of SAS

B) Clementine of SPSS

C) IBM intelligent miner

D) Nearly other third-party processing packages

24. Analysis of MS Analysis Service

A) MS Analysis Service includes OLAP and Data Mining

B) analysis services organizes data in a data warehouse into multidimensional datasets that contain pre-computed aggregate data to provide quick answers to complex analysis queries. Analysis Services allows you to create a data mining model from both multidimensional (OLAP) and relational data sources. You can also apply data mining models to these two types of data.

Common OLAP software on the market

OLAP servers:

Epoch (version 4.0.1 or higher)

Microsoft Analysis Services SQL 2000 (Service Pack 1 or higher, Service Pack 3 recommended)

IBM DB2 OLAP Server cshowcase AS/400 OLAP Server (version 3.5 or higher)

Cognos powerplay (Version 7.3 or higher)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.