Introduction to Data Mining Technology

Last Update:2013-12-15 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Introduction

Data Mining) it extracts information and knowledge hidden in a large number of incomplete, noisy, fuzzy, and random data that people do not know beforehand, but is potentially useful.. With the rapid development of information technology, the amount of data accumulated by people is increasing rapidly. It is imperative to extract useful knowledge from massive data in terabytes. Data Mining is a data processing technology developed to meet this need. It is a key step of Knowledge Discovery in Database.

2. Data Mining tasks

Data Mining tasks include association analysis, clustering analysis, classification, prediction, time series mode, and Deviation Analysis.

(1) association analysis)

Association Rule Mining was first proposed by Rakesh Apwal and others. There is a regularity between the values of two or more variables, which is called Association. Data Association is an important and discoverable knowledge in databases. Associations are classified into simple associations, time series associations, and causal associations. The purpose of association analysis is to find hidden associated networks in the database. Generally, the correlation of association rules is measured by the two thresholds of support and reliability. parameters such as interest degree and correlation are introduced continuously to make the rules more compliant with requirements.

(2) Cluster Analysis (clustering)

Clustering classifies data into several categories based on similarity. Data in the same category is similar to each other, and data in different categories is different. Clustering analysis can establish macro concepts, discover data distribution patterns, and possible relationships between data attributes.

(3) classification)

Classification is to find out the concept description of a category. It represents the overall information of this type of data, that is, the connotation description of this class, and uses this description to construct a model, which is generally expressed in the rule or decision tree mode. Classification is to use a training dataset to obtain classification rules through certain algorithms. Classification can be used for Rule Description and prediction.

(4) Prediction)

Prediction uses historical data to identify change patterns, establish models, and predict the types and features of future data. Prediction is concerned with precision and uncertainty. It is usually measured by the prediction variance.

Time-series pattern)

The time series mode refers to the pattern with a high probability of repeated occurrence found through time series. Like regression, it also uses known data to predict future values, but the difference between these data is the time of the variable.

Deviation)

There is a lot of useful knowledge in the deviation. There are many exceptions in the data in the database. It is very important to find exceptions in the data in the database. The basic method of deviation test is to find the difference between the observed results and the reference.

3. Data Mining objects

Based on the information storage format, objects used for mining are relational databases, object-oriented databases, data warehouses, text data sources, multimedia databases, spatial databases, temporal databases, heterogeneous databases, and the Internet.

4. Data Mining Process

(1) definition problem: clearly define business problems and determine the purpose of data mining.

(2) data preparation: data preparation includes selecting data-extracting the target dataset of Data Mining from large databases and data warehouse targets; data preprocessing-re-processing data, this includes checking data integrity and data consistency, removing noise, filling in lost domains, and deleting invalid data.

(3) Data Mining: selects algorithms based on the types of data functions and the characteristics of data, and mines data on purified and converted datasets.

(4) Result Analysis: interpret and evaluate the data mining results and convert them into knowledge that can be understood by users.

Knowledge application: integrates the knowledge obtained by analysis into the organizational structure of the business information system.

5. Data Mining Methods

(1) neural network method

Due to its robust robustness, self-organizing adaptability, parallel processing, distributed storage, and highly fault-tolerant features, neural networks have become increasingly popular in recent years. Typical neural network models are classified into three main categories: a feed-forward neural network model, represented by a sensor, back-propagation BP model, and function network. It is used for classification, prediction, and pattern recognition; A Feedback Neural Network Model represented by the discrete model and continuous model of the samples used for Lenovo memory optimization and computation respectively. It is represented by the ART model and the Koholon model, the self-organizing ing method used for clustering. The disadvantage of the neural network method is the "black box" feature, which makes it hard for people to understand the learning and decision-making processes of networks.

(2) Genetic Algorithm

Genetic algorithm is a random search algorithm based on biological natural selection and genetic mechanism. It is a bionic global optimization method. Genetic algorithms have implicit concurrency and are easy to combine with other models so that they can be applied in data mining.

Sunil has successfully developed a genetic algorithm-based data mining tool, which is used to conduct data mining experiments on the real databases of two aircraft crashes, the results show that genetic algorithms are one of the most effective methods for data mining [4]. The Application of Genetic Algorithms is also reflected in the combination with neural networks, rough sets, and other technologies. For example, you can use genetic algorithms to optimize the neural network structure and delete redundant connections and hidden layer units without increasing the error rate. You can use genetic algorithms and BP algorithms to train a neural network, then extract rules from the network. However, the algorithm of the genetic algorithm is complicated, and the earlier convergence problem that converges to the local pole is not solved yet.

(3) decision tree method

Decision tree is an algorithm commonly used in prediction models. It classifies a large amount of data to find valuable and potential information. Its main advantage is its simple description and fast classification, which is especially suitable for large-scale data processing. The most influential and earliest decision tree method was the well-known information entropy-based ID3 algorithm proposed by Quinlan. The main problem is: ID3 is a non-incremental learning algorithm; ID3 decision tree is a single-variable decision tree, which is difficult to express complex concepts; lack of emphasis on the relationship between the same sex; poor noise resistance. In response to the above problems, many improved algorithms have emerged, such as Schlimmer and Fisher's design of the ID4 progressive learning algorithm; Zhong Ming and Chen Wenwei have proposed the visible algorithm.

(4) Rough Set Method

Rough set theory is a mathematical tool to study inaccurate and uncertain knowledge. The rough set method has the following advantages: no additional information is required; the expression space of input information is simplified; the algorithm is simple and easy to operate. Rough Set objects are information tables similar to two-dimensional Relational Tables. The mature relational database management system and the newly developed data warehouse management system have laid a solid foundation for data mining in rough sets. However, the mathematical basis of rough sets is set theory, which makes it difficult to directly process continuous attributes. Continuous attributes exist in real information tables. Therefore, discretization of continuous attributes restricts the practical application of rough sets. At present, some tools and applications based on rough set have been developed internationally, such as KDD-R developed by Regina University in Canada and LERS developed by Kansas University in the United States.

Distinct overwrite positive sample exclusion inverse example Method

It uses the idea of covering all positive examples and rejecting all counterexamples to find rules. First, select a seed from the positive sample set and compare it to the inverse sample set one by one. If the selection is child compatible with the field value, it is removed; otherwise, it is retained. Cycle all the seeds of the positive sample according to this idea, and the rule of the positive sample is obtained (select the child union type ). Typical algorithms include the AQ11 method of Michalski, The AQ15 method improved by Hong Jiarong, And the AE5 method.

✓ Statistical Analysis Method

There are two relationships between database field items: functional relationships (deterministic relationships expressed using functional formulas) and related relationships (they cannot be expressed using functional formulas, but are still deterministic relationships ), they can be analyzed using statistical methods, that is, using statistical principles to analyze information in the database. Common statistics (finding the maximum, minimum, sum, and average values of a large amount of data) and regression analysis (using regression equations to represent the quantitative relationship between variables) can be performed) correlation Analysis (correlation coefficient is used to measure the degree of correlation between variables), Difference Analysis (determine whether there is a difference between the overall parameters from the differences of the sample statistic values), and so on.

Limit Fuzzy Set Method

That is, we use the fuzzy set theory to perform fuzzy judgment, fuzzy decision-making, fuzzy pattern recognition, and Fuzzy Clustering Analysis on actual problems. The higher the complexity of the system, the more fuzzy it is. Generally, the theory of fuzzy sets uses the degree of affiliation to describe fuzzy things. Based on the traditional fuzzy theory and probability statistics, Li deyi and others proposed a qualitative and quantitative uncertainty conversion model-cloud model, and formed the cloud theory.

6. Evaluate the considerations of data mining software

More and more software vendors have joined the competition in the field of data mining. How to correctly evaluate a commercial software and select appropriate software becomes the key to a successful data mining application.

The evaluation of a data mining software mainly involves the following four main aspects:

(1) computing performance: if the software can run on different commercial platforms, the architecture of the software, whether it can connect to different data sources, and whether the performance changes are linear or exponential when operating big data sets; computing efficiency, whether the component structure is easy to expand, and the running stability;

(2) functionality: such as whether the software provides enough algorithms, whether the mining process can be avoided, whether the algorithms provided by the software can be applied to various types of data, and whether the user can adjust the parameters of algorithms and algorithms; can the software randomly extract data from a dataset to establish a pre-mining model; can the software present the mining results in different forms;

(3) Availability: is the user interface friendly? Is the software easy to learn and use? users facing the software: beginners, Senior users or experts? Whether error reports are of great help to user debugging; software application fields: specialized in a certain professional field or applicable to multiple fields;

(4) auxiliary functions: whether the user is allowed to change the error value in the dataset or clean the data; whether the global substitution of allowed values; whether the continuous data can be discretization; can a subset be extracted from a dataset based on user-defined rules; Can null values in data be replaced by an appropriate mean or user-specified value; whether the results of one analysis can be fed back to another analysis.

7. Conclusion

Data Mining Technology is a young and promising research field. The powerful drive of commercial interests will continuously promote its development. every year, new data mining methods and models are available. People are exploring them extensively and deeply. Even so, data mining technology still faces many problems and challenges: for example, the efficiency of data mining methods needs to be improved, especially the efficiency of centralized data mining in ultra-large scale data volumes; develop and adapt to multiple data types and noise tolerance mining methods to solve data mining problems of heterogeneous datasets, data mining of dynamic data and knowledge, and Data Mining in networks and distributed environments; in addition, multimedia databases have developed rapidly in recent years, and mining technologies and software for multimedia databases will become a hot topic in research and development in the future.

(

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More