The original: "Bi thing" analysis of 13 kinds of commonly used data mining technology
First, the forefront
Data mining is from a large number of incomplete, noisy, fuzzy, random data, the extraction of hidden in it, people do not know beforehand, but also potentially useful information and knowledge of the process. The task of data mining is to discover patterns from the data set, there are many kinds of patterns that can be found, according to the function can be divided into two categories: predictive (predictive) mode and descriptive (descriptive) mode . In the application often according to the actual function of the model is subdivided into the following : classification, valuation, prediction, correlation analysis, sequence, time series, description and visualization.
Data mining involves a lot of disciplines and techniques, there are many classification methods.
According to the mining task , can be divided into classification or prediction model discovery, data summarization, clustering, association rule discovery, Sequence pattern discovery, dependency or dependency model discovery, anomaly and trend discovery, etc.
According to the mining object , there are relational database, object-oriented database, spatial database, temporal database, text data source, multimedia database, heterogeneous database, Heritage database and global network web.
According to the method of mining , it can be divided into: machine learning method, statistic method, neural network method and database method.
- Machine learning, can be subdivided into: inductive learning methods (decision tree, rule induction, etc.), based on sample learning, genetic algorithm and so on.
- Statistical methods can be subdivided into: regression analysis (multivariate regression, autoregressive, etc.), discriminant analysis (Bayesian discriminant, Fisher discriminant, non-parametric discrimination, etc.), cluster analysis (System clustering, dynamic clustering, etc.), exploratory analysis (principal element analysis, correlation analysis, etc.).
- Neural network method can be subdivided into: feedforward neural network (BP algorithm, etc.), self-organizing neural network (self-organizing feature mapping, competitive learning, etc.).
- Database methods are mainly multidimensional data analysis or OLAP methods, as well as attribute-oriented induction methods and so on.
second, the data mining technology brief
There are many kinds of data mining techniques, and there are different classifications according to different classification. The following focuses on some of the techniques commonly used in data mining: statistical Techniques, Association rules, Historical analysis, genetic algorithms, aggregation detection, connectivity analysis, decision trees, neural networks, rough sets, fuzzy sets, regression analysis, differential analysis, concept descriptions and other 13 commonly used data mining techniques.
1. Statistical techniques
Data mining involves a wide range of scientific fields and technologies, such as statistical techniques. The main idea of statistical techniques for data sets is that the statistical method assumes a distribution or probabilistic model for a given set of data (for example, a normal distribution) and then uses the corresponding method to excavate it according to the model.
2. Association Rules
Data Association is a kind of important and discoverable knowledge that exists in the database. If there is some regularity in the value of the I division of two or more variables, it is called Association. Association can be divided into simple association, Timing Association, causal Association. The purpose of association analysis is to find out the hidden network of associations in the database. Sometimes it is not known that the associated functions of data in the database, even if known, are uncertain, so the rules generated by the association analysis are credible.
3. History-based MBR (memory-based reasoning) Analysis
Look for similar situations based on empirical knowledge, and then apply the information in these cases to the current example. This is the nature of the MBR (Memory Based reasoning). The MBR first looks for neighbors that are similar to the new records, and then uses these neighbors to classify and estimate the new data. The use of MBR has three major problems, looking for certain historical data, determining the most effective way to represent historical data, and determining the number of distance functions, union functions, and neighbors.
4, Genetic algorithm GA (Genetic algorithms)
Based on evolutionary theory, optimization techniques such as genetic integration, genetic variation, and natural selection are used. The main idea is: according to the principle of survival of the fittest, formed by the current group of the most appropriate rules to form a new group, and the descendants of these rules. Typically, the suitability of the rule (Fitness) is used to evaluate the classification accuracy of the training sample set.
5. Aggregation Detection
The process of grouping a collection of physical or abstract objects into multiple classes consisting of similar objects is called clustering. A cluster generated by a cluster is a collection of data objects that are similar to objects in the same cluster and differ from objects in other clusters. The dissimilarity is calculated according to the genus 眭 value of the descriptive object, and the distance is a frequently used measure.
6. Connection Analysis
Link analysis, the basic theory of which is graph theory. The idea of graph theory is to find an algorithm that can produce good results but not perfect results, rather than finding the perfect solution. Connection analysis is the use of such a thought: imperfect results if feasible, then such analysis is a good analysis. With connection analysis, some patterns can be analyzed from the behavior of some users, and the resulting concepts are applied to a wider user base.
7. Decision Tree
The decision tree provides a way to demonstrate rules such as what the value will be under what conditions.
8. Neural network
In structure, a neural network can be divided into input layer, output layer and hidden layer. Each node of the input layer corresponds to a predictor variable. The node of the output layer corresponds to the target variable and can have more than one. Between the input layer and the output layer is the hidden layer (not visible to the neural network users), the layer number of the hidden layer and the number of nodes per layer determines the complexity of the neural network.
In addition to the input layer nodes, each node of the neural network is connected to many of its previous nodes (called the input nodes of this node) and each connection is given a weight of wxy, and the value of this node is obtained by the sum of the values of all its input nodes and the input of the corresponding connection weights, as a function. We call this function the activity function or the squeezing function.
9. Rough Set
Rough set theory is based on the establishment of equivalence classes within a given training data. All data samples that form an equivalence class are indistinguishable, that is, for the properties that describe the data, the samples are equivalent. Given real-world data, some classes are usually not distinguished by the available attributes. Rough sets are used to approximate or roughly define this class.
10. Fuzzy set
Fuzzy set theory introduces fuzzy logic into data mining classification system, which allows to define "fuzzy" domain values or boundaries. The fuzzy logic uses the truth value between 0.0 and 1.0 to indicate the degree to which a given member is one, rather than the exact truncation of a class or collection. Fuzzy Logic provides the convenience of processing at a high level of abstraction.
11. Regression Analysis
Regression analysis is divided into linear regression, multivariate regression and nonlinear coinage. In linear regression, the data is modeled by a straight line, and multivariate regression is an extension of linear regression, involving multiple predictor variables. Nonlinear regression is to add polynomial items on the basic linear model to form a nonlinear model of the same type.
12. Differential Analysis
The purpose of the differential analysis is to try to find anomalies in the data, such as noise data, fraud data and other abnormal data, so as to obtain useful information.
13. Concept Description
The concept description is a description of the connotation of some kind of object, and summarizes the relevant characteristics of such objects. The concept description is divided into the characteristic description and the distinguishing description, the former describes the common characteristics of some kind of object, the latter describes the difference between different class objects, and the characteristic description of generating a class only involves the commonness of all objects in the class object.
Iii. Concluding remarks
Data mining is considered to be a new, very important, promising and challenging research area because of the urgent need to translate data that exists in databases and other repositories into useful knowledge.and should have a wide range of disciplines (such as databases, artificial intelligence, statistics, data warehousing, online analysis and processing, expert systems, data visualization, machine learning, information retrieval, neural networks, pattern recognition, high-performance computers, etc.) the extensive attention of researchers。 As a new subject, data mining is formed by the intersection of the above-mentioned disciplines and mutual integration. With the further development of data mining, it will inevitably bring greater benefits to users.
"Bi thing" analysis of 13 kinds of commonly used data mining technology