Catalog 1th Chapter Uncover the veil of data mining 11.1 historical Mission 2
the highest level of data mining is to obtain knowledge from data and assist scientific decision-making. 1.2 Data Mining Stories 61.2.1 shock industry Discovery 6---Wal-Mart beer and diaper 1.2.2 Cost reduction 9--Hannifin company machine Parts wear Analysis 1.2.3 surprise little note 11-soccer penalty Data 1.3 What is data mining? 14
data mining is from a large number of incomplete, noisy, fuzzy, random data, the extraction of hidden in it, people do not know beforehand, but also is potentially useful information and knowledge of the process. 1.4 Inevitability of History 17
from the traditional data warehouse, online analysis to the modern database of knowledge discovery KDD, data mining is a necessary direction. 1.5 What can data mining do? 231.5.1 Association (Association) Rule Mining 24
The most classical algorithm of association rules is the APRIORI algorithm, the main idea is to look for all the frequently occurring subsets of events in the event, and then find the higher reliability rules in these frequent event subsets. 1.5.2 Cluster 26
Clustering is the division of data objects into categories, objects in the same class have a higher similarity, and the objects in the same category differ greatly.
The smaller the distance between the two objects, the more similar the two, the most natural way to measure the similarity of objects by distance.
two classical algorithms of Clustering: Partition method and hierarchical clustering method.
Division method--kmean K-medoids
K-mean The basic idea of clustering: the closer the data points within the class, the better the algorithm as far as possible between classes.
① arbitrary selection of K objects from N data Objects as the initial cluster center
② The clustering of each object to each cluster center, assigning the object to the nearest cluster
③ the center of K clusters is recalculated after all object assignments are complete
④ compared to the K cluster centers that were previously computed, if the cluster center changes, go to ②, otherwise go to ⑤.
the K-medoids algorithm uses an object in the cluster closest to the center to represent the cluster, while the K-means algorithm uses the centroid to represent the cluster. The visible K-means algorithm is very sensitive to noise and outlier data because an outlier can have a big impact on the computation of the heart. The k-mediods algorithm can effectively eliminate this effect by using the center point instead of centroid.
When the result cluster is dense, and the difference between cluster and cluster is obvious, the K-means algorithm has better effect. When it comes to large datasets, the algorithm is relatively extensible and has a high degree of efficiency.
Hierarchical Clustering Method:
This method builds clusters by data layering, and forms a tree with cluster nodes. If hierarchical clustering is done from the bottom up, it is called a hierarchical clustering of condensation. If hierarchical clustering is done from top to bottom, it is called the hierarchical clustering of the splitting method.
The condensed hierarchical clustering first takes each object as a cluster, and then gradually merges the clusters into larger clusters until all objects are in the same cluster, or a certain termination condition is met.
The hierarchical clustering of divisions first places all objects in a cluster, and then gradually divides them into smaller clusters until each object becomes a cluster, or a certain termination condition is reached.
hierarchical and other clustering methods can be combined to form multi-stage clustering, which includes birch cure rock chameleon, etc.
Visual Clustering algorithm:
The visual clustering algorithm is based on the scale space theory we have built up, using this algorithm can be used to analyze the original image of the satellite, and the similar attributes of things to cluster into the same cluster.
we use the similarity rate continuous rate close rate symmetry rate as the basic principle of clustering. 1.5.3 Forecast 35
The prediction of data mining is the process of predicting the future data by studying the correlation between the input and output of things, and then using the model to predict the forecast.
The prediction methods include:
1. Decision Tree MethodThe core idea of ID3 and C4.5ID3 algorithm is to adopt the feature selection strategy based on information gain in the process of constructing decision tree, that is, selecting the attribute with the highest gain as the splitting attribute of the current node, so that the amount of information needed to classify the sample in the result partition is minimal. A decision tree with consistent data is constructed and trained to ensure that the decision tree has the minimum number of branches and minimum redundancy. The disadvantage of the ID3 algorithm is that the ①ID3 algorithm cannot retrace the selected attributes in the process of searching, thus converges to the local optimal solution instead of the global optimal solution. The measure of ② information gain is biased towards attributes with a higher number of attributes, which is not reasonable. The ③ID3 algorithm can only handle the properties of discrete numeric values and cannot handle continuous attributes. ④ When the training sample is too small or contains noise, it is easy to have a fitting phenomenon. Aiming at the insufficiency of ID3 algorithm, the C4.5 algorithm is proposed. C4.5 improved the algorithm in the following aspects: ① uses the information gain ratio as the selection criterion, which compensates for the disadvantage of the ID3 algorithm biased to the higher value of the attribute. ② combined continuous attribute ③ can handle training samples with fewer attribute values ④ using different pruning techniques to avoid the cross-fitting of decision Trees ⑤k cross-validation of decision trees the method of solving the problem of over-fitting is mainly to prune the decision tree, and pruning is a technique to overcome the noise. It helps to improve the ability of decision trees to classify new data accurately, and colleagues can simplify the decision tree, make it easier to understand and speed up the classification. Pruning can be divided into pre-pruning and post-pruning. Pre-pruning is mainly through the establishment of certain rules to limit the full growth of decision-making. After pruning is the decision tree after the full growth and then cut off those who do not have a general representative of the leaf nodes or branches. Although the previous method may seem straightforward, the latter approach is more successful in practice. Therefore, the use of post-pruning technology is more in practice.
2. Artificial neural networkThe neural network is independent of the model's adaptive function estimator, which can realize arbitrary function relation.
3, Support vector machine SVMSupport Vector Machine (SVM) is a machine learning method based on the principle of minimizing structural risk in statistical learning theory. It can solve both classification problems and regression problems. Support Vector Machine (SVM) is modeled from the linear Two classification problem, and then gradually to linear irreducible problem, nonlinear problem, and finally to linear and nonlinear regression problem modeling. By introducing kernel functions to deal with the problem of surface of the classification surface. You can use the SMO two-time planning problem solver. SVM Support vector machine has many advantages in solving small sample, nonlinear and high dimensional pattern recognition problems. SVM disadvantage: ① in too large data sets, SVM to solve the convex two times planning and make the algorithm inefficient, even the algorithm can not be done. ②SVM the robustness of singular value ③svm is not sparse, there is a large number of redundant support vectors ④ parameters do not have a better choice strategy
4. Regularization MethodLasso and L1/l2 regularization methods The Lasso method uses the absolute value of the model coefficients as a penalty to compress the model coefficients, and the coefficients with the smaller absolute values are automatically compressed to 0, thus making the resulting model sparse, thus simultaneously achieving the choice of significant changes and estimation of corresponding parameters. belongs to the L1 model. 1.5.4 sequence and time series 49
The time series prediction is similar to the regression problem, except that the time series predicts the future values by the historical values, is a special autoregressive, and is more represented as a memory law describing the observations of the past and the random disturbances at corresponding moments. 1.6 Data Mining Tools 50Intelligent Minerunica Model 1SASSPSS--SPSS contains data management, statistical analysis, icon analysis, output management and so on. weka--Open Source 2nd Chapter Data Mining Flow 572.1 Lee's 582.2 old revolutionary met a new problem 602.3 fishing came up with data mining ideas 622.4 Data Mining projects Project 652.5 Data mining project implementation 70
Business Understanding phase → data understanding phase → data preparation stage → modeling phase → model evaluation phase → deployment phase
2.5.1 Phase of Business Understanding (understanding) 722.5.2 Data Understanding 742.5.3 Data preparation Phase preparation Modeling phase (MODELING) 792.5.5 model evaluation phase (EVALUATION) 832.5.6 deployment phase (DEPLOYMENT) 842.6 Lee's Outlook 86 3rd Chapter 893.1 application prospect of data mining in power industry 903.2 State maintenance of power Equipment 943.3 power system transient stability evaluation 1083.4 load forecast 1153.5 Theft electric detection 1203.6 Power Data Mining System Construction 124 The 4th chapter of data Mining in the field of traffic and aviation 1274.1 railway fare development 1284.2 High-speed rail track overhaul 1374.3 traffic flow forecast 140 The 5th chapter of data Mining in metallurgical industry 1455.1 process industry that's something. 1465.2 product quality control 1505.3 BF Furnace temperature Prediction 1575.4 Grinding grain size prediction 1625.5 coking coal Blending Optimization 168 6th Chapter Application of data mining in tax and finance industry 1736.1 tax Audit 1746.2 anti-money Laundering 1806.3 stock index Tracking 188 7th Chapter Application of data mining in fault diagnosis 1957.1 rocket engine fault diagnosis 1967.2 mechanical equipment fault diagnosis 2037.3 Fault diagnosis of nuclear power equipment 2077.4 ship dynamic fault Diagnosis 218 The 8th chapter of Data Mining in the telecommunications industry 2258.1 Market Segment 2258.1 Market segment 2268.2 Precision marketing 2318.3 business response 2398.4 Customer Churn Analysis 244 9th Web Data Mining 2499.1 Web Data Mining Overview 2509.1 Web Data Mining Overview 2509.2 data Mining in vertical search engine 2529.3 data Mining for e-commerce 2609.4 data Mining in social networks 267 references 274[1]
"Big Liar data Mining" reading