Machine learning Algorithms

Last Update:2016-11-01 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Several distribution functions:

PMF (Probabilistic mass function): Probability of discrete random variables on each specific value.

PDF (probability density function): For a continuous random variable definition, only the probability of the value of a continuous random variable is integrated.

CDF (cumulative distribution function): The probability distribution of a real random variable x can be fully described and is the integral of the PDF.

Supervised learning : Adjust the parameters of the classifier based on a known class of samples to achieve the required performance of the process, such as SVM, Max Entropy, CRF.

CRF(conditional random field) is compared to hmm( implicit Markov type ) and Memm( maximum entropy implicit Markov type ):

The characteristic is flexible, can hold more contextual information, the global optimal, the disadvantage is the training cost is big, the complexity is high.

Unsupervised Learning : The process of learning the training samples without classification marks to discover the structured knowledge of the training sample set.

kernel-based machine learning algorithms: RBF (radial basis function), LDA, SVM.

Feature Selection method: Chi-square, information gain, average mutual information, expected cross entropy.

feature dimensionality reduction methods: PCA, LDA, deep learning sparseautoencodrer, matrix singular value decomposition SVD, LASSO, wavelet analysis, Laplace feature mapping.

LDA(Linear discriminant analysis) is a kind of supervised learning. will be tagged data (points), through the projection of the method, projected into the lower dimension of the space, so that the projected point, formed by category, a cluster of cases, the same category of points, will be in the projected space closer. is a linear classifier. The goal of the classification is that the closer the point within the category is, the better (concentrated), the farther away the point between categories is.

PCA(principal component Analysis): PCA is a unsupervised learning. LDA usually exists as an independent algorithm, given the training data, will be a series of discriminant functions (discriminate function), and then for the new input, it can be predicted. While PCA is more like a preprocessing method, the goal is to map high-dimensional data to low-dimensional spaces by some kind of linear projection, and expect the data to be the most variance in the projected dimension, using less data dimensions, while preserving more of the original data points. PCA pursues the ability to maximize the intrinsic information of the data after dimensionality reduction, and to measure the importance of that direction by measuring the size of the data variance in the projection direction. But this kind of projection after the data is not a big difference, but may make the data points mixed together can not be differentiated. This is also one of the biggest problems with PCA, which results in the classification of PCA in many cases not good. The transformation matrix of PCA is the covariance matrix.

The multiplication efficiency of a*b and b*c two matrices is a*b*c.

Linear Nonlinear problems :

Pseudo-Inverse method: It is a training algorithm of RBF Neural network, and the radial basis solves the problem of linear non-division.

HK algorithm: The weighted vector is obtained under the minimum mean square error criterion, and it is suitable for the linear and nonlinear sub-conditions. for the case of linear can be divided, the optimal weight vector is given,

For non-linear can be divided, can be identified to exit the iterative process.

Potential function Method: nonlinear.

time series Model :

AR: linear prediction;

MA: Sliding average model, one of model parametric method of spectral analysis;

ARMA: Autoregressive Sliding average model, one of the high-resolution spectral analysis methods of model parametric method, has more accurate spectral estimation and better spectral resolution performance than the previous two, but its parameter estimation is more complicated.

GARCH: Generalized arch model, especially suitable for analysis and prediction of volatility.

discriminant model : Logistic regression, SVM, traditional neural network, nearest neighbor, CRF, LDA, boosting, linear regression.

production model : Gaussian, naive Bayesian, HMMS, sigmoid belief networks, MRF, latent Dirichlet Allocation.

em algorithm : The model parameters are learned only when the observation sequence is not a state sequence;
Viterbi algorithm : Using dynamic programming to solve hmm prediction problem, not parameter estimation;
Forward Backward : calculate the probability;
Maximum likelihood estimation : A supervised learning algorithm that is used to estimate parameters when both the observed sequence and the corresponding state sequence exist.

The match function in Excel returns the location of the specified content, and index can query the data based on the location.

MATCH (Lookup-value,lookup-array,match-type)

INDEX (Array,row-num,column-num)

Clustering algorithm

Clustering algorithm is an important branch of machine learning, is a unsupervised learning, generally used for data exploration, such as group discovery and outlier detection, but also as a preprocessing step of other algorithms. Common clustering algorithms are K-means, K-medoids, GMM, spectral clustering,ncut and so on.

Classification:

1. Partitioning approach:

The different segmentation of the data is established, then the clustering results are evaluated by the same criteria. (such as minimizing squared errors and) targets: Finding a split that makes the distance squared and the smallest

Typical algorithms: K-means, K-medoids

K-means algorithm:

1. Select k points as the initial centroid (randomly generated or selected from D)

2. Repeat

3. Assign each point to the nearest centroid to form a K-cluster

4. Recalculate the centroid of each cluster

5. Until clusters do not change or reach the maximum number of iterations

2. Model-based:

Assume a distribution model for each class, trying to find the best model for each class

Typical algorithm: GMM (mixed Gaussian)

GMM: Mixing k Gaussian models together, the probability of each point appearing is the result of several Gaussian blends. EM is applied to GMM to solve the parameters.

3. dimensionality Reduction Approach:

First dimensionality reduction, then clustering

Typical algorithm: Spectral clustering,ncut

Classifier

Classifier is the general term of the method of classifying samples in data mining, including decision Tree, logistic regression, naive Bayesian, neural network and other algorithms.

The construction and implementation steps of the classifier:

The selected samples (positive and negative samples) are divided into two parts: Training sample and Test sample.
The classifier algorithm is executed on the training sample, and the classification model is generated.
Perform the classification model on the test sample and generate the forecast results.
Based on the predicted results, the necessary evaluation indexes are calculated and the performance of the classification model is evaluated.

(1) Decision Trees (decision tree): Based on the probability of the occurrence of a known variety of circumstances, by constituting a decision tree to find the net present value of the probability of greater than or equal to zero, the evaluation of project risk, judge the feasibility of the decision-making analysis method, is the intuitive use of probabilistic analysis of a graphic method, is a kind of supervised learning. The advantage is good readability, repeated use, the maximum number of calculations per prediction does not exceed the depth of the decision tree.

In machine learning, random forest forest is a classifier that contains multiple decision trees, and its output category is determined by the number of categories the individual tree outputs.

There are two aspects in the construction of random forest: random selection of data and stochastic selection of features to be selected.

1, the random selection of data: First, from the original data set to take back the sample, the construction of the child dataset, the data volume of the sub-dataset is the same as the original data set. Elements of different cubes can be duplicated, and elements in the same child dataset can be duplicated. Second, the child data set is used to construct the sub-decision tree, and the data is placed in each sub-decision tree, and each sub-decision tree outputs a result. Finally, if new data is needed to get the classification result by random forest, the result of random forest can be obtained by voting the decision tree.

2. Random selection of features to be selected: similar to random selection of datasets, each of the sub-trees in a random forest does not use all of the features to be selected, but selects certain features randomly from all selected features, and then selects the optimal features in the randomly selected features. This enables decision trees in random forests to be different from each other, increasing the diversity of systems and thus improving classification performance.

(2) logistic regression: It is not a regression model, but a classification model.

Model Features:
1. Advantages: Fast training, easy to achieve;
2. Disadvantages: Under-fitting, for the complex task effect is not good enough;

The calculation method is simple, divided into two steps: 1, calculate gradient, 2, update weights.

The purpose of logistic regression is to find the weights of W in the best fitting parameters of nonlinear function sigmoid, and the value of W is learned by gradient ascending method. The random gradient rise only deals with a small number of samples at a time, saves computational resources, and makes the algorithm available for online learning.

(3) Bayesian classification: is a class of classification algorithms, such algorithms are based on Bayesian theorem, so collectively referred to as Bayesian classification.

Bayes theorem:

；

The simplest of the Bayesian classifications: naive Bayesian classification. The idea is based on the following: for the given classification, the probability of the occurrence of each category under the condition of this item, which is the largest one, is considered which category the classification belongs to.

For a priori probability P (y):

When P (y) is known, the Bayesian formula is used to calculate the posterior probability.

When P (y) is unknown, the decision surface is calculated using the N-p decision.

The maximum minimum loss rule is the problem that the priori probability is unknown or difficult to calculate when the minimum loss rule is solved.

linear classifier three main categories:

Perceptron criteria function, Svm,fisher guidelines

SVM:

Support Vector Machines (SVM) is a supervised learning model that is commonly used for pattern recognition, classification, and regression analysis.

The main idea can be summed up as two points:

⑴ it is for the linear sub-condition analysis, for the case of linear irreducible, by the use of nonlinear mapping algorithm to the low-dimensional input space linear non-divided sample into a high-dimensional feature space to make it linear, so that the high-dimensional feature space using linear algorithm to the nonlinear characteristics of the sample can be linear analysis possible;

⑵ it constructs the optimal segmentation super-plane in the feature space based on the structure risk minimization theory, which makes the learner get the global optimization, and the expectation risk in the whole sample space satisfies a certain upper bound in a certain probability.

General characteristics⑴SVM Learning problems can be expressed as convex optimization problems, so a known effective algorithm can be used to discover the global minimum value of the target function. While other classification methods (such as rule-based classifiers and artificial neural networks) use a greedy learning strategy to search for hypothetical space, this method generally only obtains local optimal solution. ⑵SVM the ability to control the model by maximizing the edge of the decision boundary. However, the user must provide additional parameters, such as the use of kernel function types and the introduction of relaxation variables. The kernel functions of SVM include linear, polynomial, radial basis function, Gaussian, power exponent, Laplace, ANOVA, two-time rational, multivariate two-time, inverse-multivariate two-time, and sigmoid-kernel function. ⑶ by introducing a dummy variable to each categorical attribute in the data, SVM can be applied to categorical data. ⑷SVM generally can only be used in two types of problems, for many kinds of problems effect is not good.

Machine learning Algorithms

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More