Fortunately with the last two months of spare time to "statistical machine learning" a book a rough study, while combining the "pattern recognition", "Data mining concepts and technology" knowledge point, the machine learning of some knowledge structure to comb and summarize:
Machine learning consists of two major issues 1, what to learn, 2, how to learn.
First of all to comb what to learn
I. What to learn1. What problems do you want to solve? Machine learning mainly solves the following three types of problems:
a) Supervised learning issues: given an input and output set (that is, a sample collection of manually tagged samples), the data set is used to train a selected model, and the resulting model can predict its output for the new input. The specific prediction tasks include classification, labeling and regression problems. B) Semi-supervised learning problem: The model is trained using a sample set of partially manually labeled samples and some samples that are not manually labeled, and the resulting model can predict its output for the new input. C) Unsupervised learning problems: learning from samples that have not been manually labeled to uncover structural knowledge in the data. Cluster analysis and correlation analysis are all of this kind of problem.
2. What model to study: Select practical models and solutions for specific problems
The following is a list of the basic models of various learning problems, and the models used in practical applications have been improved on these basic models for specific business requirements. The basic model of the Labeling problem includes: Hidden Markov, conditional random field. regression problem: Neural network, decision regression tree, logistic regression, and normal linear regression model c) unsupervised learning issues: These include clustering models and association analysis models. In the Problem of association analysis, frequent model mining (discovering the sub-structure frequently appearing in data set) and association rule Mining (often used in shopping cart commodity analysis) are common. Clustering problem mainly from four aspects of cluster mining (1), based on the partition clustering model: K mean, K center point, the principle is mainly based on the similarity of attributes (2) Hierarchical clustering Model: mainly condensed clustering and the inverse process of the Method (Division division), the method is mainly used to form the cluster and division of ethnic groups. (3) Density-based method: The disadvantages of the above (1) (2) method are difficult to find the structure with arbitrary shape in clustering, the density-based method can overcome this shortcoming, and use the high-density Unicom region to identify the clustering structure (which can be used to preprocess the character image in the image processing OCR recognition). (4) A grid-based approach.
A) The generation model (naive Bayesian, neural network) used to supervise the classification problem of learning, discriminant model (k nearest neighbor, Perceptron, decision tree, Logistic regression, SVM, boost, etc.).
Secondly, the basic understanding of the problem. After selecting a model, you need to solve the problem of how the model learns:
second, how to learn
1. Collect data, preprocess data, extract features: preprocessing data usually needs to fill or remove missing values, outliers, and also include appropriate transformations of the original data (e.g. PCA, ICA, wavelet transform, FFT, etc.), as well as conversion of data format and size ( Compress a high-definition image into a fixed-size, specified format, as in processing.
2. What algorithm is used to solve and optimize the model: different models and algorithms determine the cost and timeliness of system learning. Common optimization algorithms include gradient descent algorithm, Newton method, Quasi-Newton method, LM algorithm, and constrained solution algorithm using Lagrange duality. In the process of building the model according to the different needs of the model optimization criteria corresponding method (the distribution parameter estimation using the maximum likelihood method, the implicit variable estimation using the EM method, the decision tree to solve the use of information gain method, etc.), the different model objects their optimization criteria are different, this process is worth in-depth study. At the same time, in order to avoid overfitting as much as possible, the regularization method is usually added to the model.
3. Model evaluation: After the model is solved, a certain criterion is needed to measure the quality of the model, and the commonly used evaluation indexes include: accuracy rate, recall rate, TP, FN, FP, TN, Roc Curve and area, cross-validation, etc., the regression problem will also be measured with fitting residuals and goodness of fit. Not every metric is effective, and measuring with the right metrics for your business problems is the key.
Machine learning and Pattern Recognition Learning Summary (i.)