The common algorithm idea of machine learning

Source: Internet
Author: User
Tags svm

Naive Bayes:

Here are a few places to note:

1. If the given eigenvector length may be different, this is the need to normalized to the length of the vector (here is the text classification for example), such as sentence words, the length is the length of the whole vocabulary, the corresponding position is the number of occurrences of the word.

2. The calculation formula is as follows:

One of the conditional probabilities can be independently unfolded by naive Bayesian conditions. One thing to pay attention to is the calculation method, which is based on the naïve Bayes hypothesis, =, so there are generally two kinds, that is, in the category of those samples of CI, find the sum of the number of occurrences of WJ, and then divided by the sum of the sample, the second method is the cluster of those samples of CI, find the sum of WJ Then divide by the sum of the occurrences of all the features in the sample.

3. If one of the items is 0, the joint probability of the product may also be 0, that is, 2 of the formula of the molecule is 0, in order to avoid this phenomenon, the general will be initialized to 1, of course, in order to ensure that the probability is equal, the denominator should be initialized to 2 (because it is 2 class, so add 2, if the K class would need to add K , the term is called Laplace smooth, the denominator plus k reason is to satisfy the full probability formula).

The advantages of Naive Bayes:

Very good for small-scale data, suitable for multi-classification tasks, suitable for incremental training.

Disadvantages:

is sensitive to the form of input data expression.

Decision Tree:

One of the important points in a decision tree is to select an attribute for branching, so pay attention to the calculation formula for the information gain and understand it in depth.

The entropy of information is calculated as follows:

where n means there are n categorical categories (for example, a 2-class problem, then n=2). The probability P1 and P2 of the 2 samples in the total sample are calculated separately, so that the information entropy before the branch of the unselected attribute can be calculated.

Now select an attribute XI for branching, at which point the branching rule is: if XI=VX, divide the sample into one branch of the tree, or enter another branch if it is not equal. Obviously, the sample in the branch is likely to include 2 categories, calculate the entropy H1 and H2 of the 2 branches respectively, calculate the total information entropy H ' =p1*h1+p2*h2 after branching, then the information gain δh=h-h '. Taking the information gain as the principle, all the attributes are tested on one side, and a property that maximizes the gain is selected as the Branch property.

The advantages of the decision tree:

The calculation is simple and can be interpreted strongly, and it is suitable for processing the samples with missing attribute values, and can deal with the irrelevant characteristics.

Disadvantages:

Easy overfitting (subsequent random forests, reduced overfitting);

Logistic regression:

The logistic is used to classify, is a kind of linear classifier, the place needing attention is:

1. The logistic function expression is:

Its derivative form is:

2. The LOGSITC regression method is mainly learned with maximum likelihood estimation, so the posterior probability of a single sample is:

The posterior probability to the entire sample:

which

The further simplification of the logarithm is:

3. In fact, its loss function is-l (θ), so we need to make loss function minimum, can be obtained by gradient descent method. The gradient descent method has the following formula:

Logistic regression Benefits:

1, the realization is simple;

2, the calculation of the classification is very small, fast, low storage resources;

Disadvantages:

1, easy to fit, the general accuracy is not too high

2, can only deal with two classification problems (on this basis, derived from the Softmax can be used for multi-classification), and must be linearly divided;

Linear regression:

Linear regression is really used for regression, and unlike logistic regression is used for classification, its basic idea is to use the gradient descent method to optimize the form of the least square error, of course, you can also use normal equation directly to obtain the solution of the parameter, the result is:

In LWLR (locally weighted linear regression), the calculation expression for a parameter is:

Because at this point the optimization is:

This shows that unlike LR, LWLR is a non-parametric model, because every time a regression is performed, the training samples are traversed at least once LWLR.

Advantages of linear regression:

Simple implementation and simple calculation;

Disadvantages:

Can not fit nonlinear data;

KNN algorithm:

KNN is the nearest neighbor algorithm, and its main process is:

1. Calculate the distance of each sample point in the Training sample and test sample (the common distance measures are European distance, Markov distance, etc.);

2. Sort all of the above distance values;

3. Select the first k minimum distance sample;

4. Vote according to the labels of the K-samples and obtain the final classification category;

How to choose an optimal K-value, depending on the data. In general, a large k value at the time of classification can reduce the effect of noise. However, the boundaries between categories are blurred. A good K-value can be obtained through a variety of heuristic techniques, such as cross-validation. In addition, the existence of noise and non-correlation eigenvector can reduce the accuracy of K-nearest neighbor algorithm.

The nearest neighbor algorithm has strong consistency results. As the data tends to infinity, the algorithm guarantees that the error rate will not exceed twice times the error rate of the Bayesian algorithm. For some good k values, the K-Nearest neighbor guarantee error rate will not exceed the Bayesian theory error rate.

Note: The Markov distance must first give the statistical properties of the sample set, such as mean vector, covariance matrix and so on. The introduction of the Markov distance is as follows:

The advantages of KNN algorithm:

1. Simple thinking, mature theory, can be used to do the classification can also be used to do regression;

2. Can be used for non-linear classification;

3. The training time complexity is O (n);

4. High accuracy, no assumptions about the data, not sensitive to outlier;

Disadvantages:

1. Large computational capacity;

2. Sample imbalance problem (i.e., there are a large number of samples in some categories, while the number of other samples is very small);

3. Requires a lot of memory;

Svm:

To learn how to use LIBSVM and some parameter tuning experience, it is also necessary to understand some of the ideas of the SVM algorithm:

1. The optimal classification surface in SVM is the maximum geometric margin for all samples (why choose the maximum interval classifier, from a mathematical point of view?). NetEase Deep Learning Post interview process has been asked. The answer is that there is a relationship between the geometric interval and the number of errors in the sample: the denominator is the sample-to-classification distance, and R in the molecule is the longest vector value in all the samples, i.e.:

After a series of derivations can be obtained to optimize the following original target:

2. Here's a look at the Lagrange theory:

The optimization target in 1 can be converted to Lagrange form (through various dual optimizations, KKD conditions), and the final objective function is:

We only need to minimize the above objective function, where α is the constrained Lagrange coefficients of inequalities in the original optimization problem.

3. The last formula in 2 is a derivative of W and B, respectively:

As can be known from the 1th formula above, if we optimize the alpha, then we can directly find the W, that is, the parameters of the model is done. The 2nd formula above can be used as a constraint for subsequent optimizations.

4. The last objective function in 2 can be converted to optimize the following objective function by using the dual optimization theory:

This function can be obtained by using the usual optimization method to obtain α, and then obtain W and B.

5. As a matter of principle, the simple theory of SVM should end here. However, it should be added that in the forecast there are:

That angle bracket can be replaced with a kernel function, which is why SVM is often pulled together with kernel functions.

6. Finally, the introduction of relaxation variables, so the original objective optimization formula is:

At this point the corresponding dual optimization formula is:

Compared to the previous one, Alpha is more than an upper bound.

Advantages of SVM algorithm:

can be used for linear/nonlinear classification, but also for regression;

Low generalization error;

Easy to explain;

Low computational complexity;

Disadvantages:

It is sensitive to the selection of parameters and kernel functions;

The original SVM is only better at dealing with two classification problems;

Boosting:

Mainly take AdaBoost as an example, first look at the flow chart of AdaBoost, as follows:

As you can see, we need to train several weak classifiers during training (3 in the figure), each weak classifier is trained by a sample of different weights (5 training samples in the figure) (where the weight of the first weak classifier corresponds to the input sample), and each weak classifier has a different effect on the final classification result. Is the weighted average output, the weights are shown in the values inside the triangle. So how do these weak classifiers and their corresponding weights be trained?

The following is an example to illustrate briefly.

In the book (Machine Learning in action) is assumed to be 5 training samples, each training sample dimension is 2, in the training of the first classifier 5 samples weights each 0.2. Note that the weight of the sample is different from the weights of the weak classifier group in the final training, the weighting of the sample is used only during the training process, and α is useful in the training and testing process.

Now suppose that the weak classifier is a simple decision tree with one node, which selects one of 2 attributes (assuming only 2 attributes) and calculates the best value in this attribute for classification.

The simple version of the AdaBoost training process is as follows:

1. Train the first classifier, the weight of the sample is the same mean value D. With a weak classifier, get these 5 samples (please refer to the example in the book, still the machine learning in action) of the Classification prediction label. In contrast to the actual label of the given sample, there may be an error (i.e. error). If a sample predicts an error, its corresponding error value is the weight of the sample, and if the classification is correct, the error value is 0. The sum of the error rates of the last 5 samples is recorded as ε.

2. Use ε to calculate the weight of the weak classifier α, the formula is as follows:

3. The weight of the next weak classifier sample is calculated by α, and if the corresponding sample is correctly classified, the weight of the sample is reduced, and the formula is:

If the sample classification is incorrect, the weight of the sample is increased, and the formula is:

4. Round-robin steps to continue to train multiple classifiers, only their D value is different.

The test process is as follows:

Enter a sample into each of the well-trained weak categories, then each weak classification corresponds to an output label, then the label is multiplied by the corresponding α, and the last sum to get the value of the symbol is the Predictor tag value.

Advantages of the boosting algorithm:

Low generalization error;

Easy to achieve, high classification accuracy, not too many parameters can be adjusted;

Disadvantages:

more sensitive to outlier;

Cluster Type:

Divided according to the idea of clustering:

1. Clustering based on partitioning:

K-means, K-medoids (Find a sample point in each category to represent), Clarans.

K-means is the smallest of the following expression values:

Advantages of the K-means algorithm:

(1) K-means algorithm is a classical algorithm to solve the clustering problem, and the algorithm is simple and fast.

(2) for processing large datasets, the algorithm is relatively scalable and efficient, because its complexity is approximately o (NKT), where n is the number of all objects, K is the number of clusters, and T is the number of iterations. Usually k<<n. This algorithm is usually locally convergent.

(3) The algorithm tries to find the K division that minimizes the value of the squared error function. When clusters are dense, spherical or clustered, and the difference between clusters and clusters is obvious, the clustering effect is better.

Disadvantages:

(1) The K-average method can only be used if the mean of the cluster is defined, and the data for some categorical attributes is not appropriate.

(2) Require the user to give the number of clusters to be generated in advance K.

(3) Sensitivity to initial value, for different initial values, may lead to different clustering results.

(4) Not suitable for the discovery of non-convex shaped clusters, or large-sized clusters.

(5) for "noise" and outlier data sensitivity, a small amount of such data can have a significant impact on the average.

2. Hierarchy-based clustering:

Bottom-up condensation methods, such as Agnes.

A method of splitting from the top down, such as Diana.

3. Density-based clustering:

Dbsacn,optics,birch (Cf-tree), CURE.

4. Grid-Based approach:

STING, Wavecluster.

5. Model-based clustering:

Em,som,cobweb.

These algorithms can be referred to the introduction of clustering (Baidu Encyclopedia).

Recommended system:

The implementation of recommendation system is mainly divided into two aspects: Content-based implementation and collaborative filtering implementation.

Content-based implementations:

Different people's ratings for different movies can be seen as an ordinary regression problem, so each movie needs to extract a eigenvector (i.e. X-value) in advance, and then model for each user, that is, each user scores as the Y value, using these existing scores y and the movie eigenvalue x can train regression models ( The most common is linear regression). This allows you to predict the scores of those movies that users don't score. (It's worth noting that each user is required to build his own regression model.)

From another point of view, it is also possible to give each user the degree of preference for a particular film (i.e. weight), then learn the characteristics of each film, and finally use the regression to predict those who have not been rated film.

Of course, it is also possible to optimize the level of each user's passion for different types of movies and the characteristics of each film. Specifically, you can refer to the ML tutorial on Coursera on ng: https://www.coursera.org/course/ml

Based on the implementation of collaborative filtering:

The co-filtering (CF) can be regarded as a classification problem or a matrix decomposition problem. Collaborative filtering is mainly based on the characteristics of each person's own preferences, it does not depend on the basic information of the individual. For example, in the case of the film scoring just now, predicting the number of films that have not been scored depends only on the points that have been scored, and does not require learning the characteristics of those films.

SVD decomposes a matrix into the product of three matrices, as follows:

The middle Matrix Sigma is the diagonal matrix, the value of the diagonal element is the singular value of the data matrix (note that singular values and eigenvalues are different), and has been arranged from large to small. Even if the characteristics of small eigenvalues are removed, the original matrix can still be reconstructed well. As shown in the following:

The darker color represents the three matrices that are removed when the small eigenvalue is reconstructed.

The fruit m represents the number of items, n represents the number of users, then each row of the U matrix represents the attributes of the commodity, and now the attributes of each commodity can be represented by a lower dimension (assuming K-dimension) after descending the dimension U matrix (taking the darker part). This way, when a new user's product recommendation vector x, you can be based on the formula X ' *u1*inv (S1) A k-dimensional vector, and then in V ' to find the most similar to the user (similarity measurement available cosine formula, etc.), based on the user's rating to recommend (mainly recommended for new users of those products not scored). Specific examples can be found in the Web page: SVD in the recommendation System application.

In addition to the SVD decomposition of the actual meaning of each matrix can refer to Google Wu "mathematical Beauty" a book (but personally feel Wu explain UV two matrix when it seems to be reversed, do not know how people think). or refer to the SVD section of machine learning in action.

pLSA:

The PLSA is developed by the LSA, and the early LSA implementations are mainly decomposed by SVD. The PLSA model diagram is as follows:

The meanings in the formula are as follows:

Refer to the 2010 Dragon Star Program: the corresponding topic model in machine learning that talk

Lda:

The topic model, the probability diagram is as follows:

Unlike pLSA, many prior distributions are assumed in LDA, and the prior distributions of general parameters are assumed to be Dirichlet distributions, the reason being that the conjugate distributions have the same form of prior and posterior probabilities.

GDBT:

GBDT (Gradient boosting decision tree), also known as MART (multiple Additive Regression tree), seems to be used more internally in Ali (so Ali algorithm post interview may ask), It is an iterative decision tree algorithm, which consists of multiple decision trees, and the output of all the trees is summed up as the final answer. It is considered to be a strong generalization capability (generalization) algorithm with SVM at the beginning of the proposed method. In recent years, the machine learning model, which is used in the search sort, has aroused people's concern.

GBDT is a regression tree, not a classification tree. The core is that each tree is learned from the residuals of all previous trees. To prevent overfitting, as with adaboosting, the boosting was added.

An introduction to GDBT can be found in: GBDT (MART) Iteration decision tree Getting Started Tutorial | Brief introduction.

Regularization:

The role is (NetEase phone interview when asked):

1. More easily solved in numerical value;

2. More stable when the number of features is too large;

3. Control the complexity and smoothness of the model. The smaller and smoother the complexity, the more generalization ability of the objective function. The addition of the rule term can reduce the complexity of the objective function and make it more smooth.

4. Reduce the parameter space, the smaller the parameter space, the lower the complexity.

5. The smaller the coefficient, the simpler the model, and the simpler the model, the greater the generalization capability (explained by Ng macroscopic).

6. You can see that it is a Gaussian priori of the weights.

Anomaly Detection:

You can estimate the density function of a sample, calculate its density directly for a new sample, and if the density value is less than a certain threshold value, the sample exception is represented. And the density function generally uses the multi-dimensional Gaussian distribution. If the sample has n dimensions, then each dimension's characteristics can be considered to be in accordance with the Gaussian distribution, even if these features are not visualized to conform to the Gaussian distribution, it can also be mathematically converted to look like a Gaussian distribution, such as X=log (X+c), x=x^ (1/C) and so on. The algorithm flow for anomaly detection is as follows:

The ε is also obtained by cross-validation, that is, in the case of anomaly detection, the previous p (x) learning is unsupervised, the following parameter ε learning is supervised. So why not use the usual supervised approach to learning (that is, consider it an ordinary two classification problem)? The main reason is that in anomaly detection, the number of abnormal samples is very small and the number of normal samples is very large, so it is not enough to learn the parameters of good abnormal behavior model, because the new anomaly sample may be completely different from the pattern in the training sample.

In addition, the above is the characteristics of each dimension as a separate Gaussian distribution, in fact, such an approximation is not the best, but its computational capacity is small, so it is often used. A better approach should be to synthesize the feature into a Dovigos distribution, when there is a correlation between the features, but the computational complexity will become complex, and the covariance matrix of the sample may appear irreversible (mainly when the number of samples is smaller than the number of features, or the sample feature dimension has a linear relationship).

The above content can be referenced by Ng's https://www.coursera.org/course/ml

EM algorithm:

Sometimes because of the production of samples and implied variables (hidden variables are not observable), and the parameters of the model is generally used maximum likelihood estimation, because of the implied variables, so the likelihood function parameter derivation is not out, then can use the EM algorithm to find the parameters of the model (corresponding model parameter number may have multiple) , the EM algorithm is generally divided into 2 steps:

E-Step: Select a set of parameters to find out the conditional probability value of the implied variable under this parameter;

M step: The maximum value of the likelihood function (essentially a desired function) is obtained by combining the conditional probabilities of the implied variables obtained by the E step.

Repeat 2 steps above until convergence.

The formula is as follows:

The derivation process of the Nether function in M-Step formula:

A common example of the EM algorithm is the GMM model, where each sample is likely to be produced by K-Gaussian, except that each Gaussian produces a different probability, so each sample has a corresponding Gaussian distribution (one of the k's), at which point the implied variable is a Gaussian distribution corresponding to each sample.

GMM's e-step formula is as follows (calculates the probability that each sample corresponds to each Gaussian):

The more specific formula is:

The M-step formula is as follows (3 parameters for calculating the specific gravity, mean, variance of each Gaussian):

The EM algorithm can refer to Ng's cs229 course material or NetEase Public course: Stanford University Open Class: Machine learning course.

Apriori:

Apriori is a relatively early method in association analysis, which is mainly used to excavate those frequent item sets. The idea is:

1. If a project collection is not a frequent collection, then any project collection that contains it must not be a frequent collection;

2. If a collection of items is a frequent collection, then any non-empty set of it is also a frequent set;

Aprioir need to scan the project table many times, starting from a project to scan, to remove those not frequent items, the resulting collection is called L, and then each element in the L is self-assembled, generating a collection of items more than the last scan, the collection is called C, and then the scan to remove those non-frequent items, repeat ...

Look at the following example:

Element Project table:

If each step does not remove the non-frequent item set, the tree structure of its scanning process is as follows:

In one of these procedures, a non-frequent set of items can be removed (represented by a shadow) as:

The above content mainly refers to the machine learning in action this book.

FP Growth:

FP growth is a more efficient method of mining frequent items than Apriori, it only needs to scan the project table 2 times. The 1th scan gets the frequency of the item, removes items that do not meet the support requirements, and sorts the remaining items. The 2nd scan is to build a fp-tree (Frequent-patten tree).

The next job is to dig on the fp-tree.

For example, there are the following tables:

It corresponds to the following fp_tree:

Then, starting with the lowest frequency single p, we find the conditional pattern base of p, construct the fp_tree of the conditional pattern base of p with the same method of constructing fp_tree, and find the frequent itemsets containing p in this tree.

From the m,b,a,c,f conditional mode base mining frequent itemsets, some items need to be recursive to dig, more trouble, such as M node, the specific process can refer to the blog: frequent Pattern mining two (FP growth algorithm), which is described in detail.

Resources:

    1. Harrington, P. (2012). Machine learning in Action, Manning publications Co.
    2. Nearest neighbor algorithm (Wikipedia)
    3. Markov distance (Wikipedia)
    4. Cluster (Baidu Encyclopedia)
    5. Https://www.coursera.org/course/ml
    6. The application of SVD in Recommender system
    7. Wu and Google (2012). The beauty of mathematics, the people's post and telecommunications publishing house.
    8. 2010 Dragon Star Program: machine learning corresponding video tutorial: 2010 Dragon Star Plan machine Learning video Tutorial
    9. GBDT (MART) Iteration decision tree Getting Started Tutorial | Brief introduction
    10. Ng's cs229 course materials
    11. Stanford University Open Class: Machine learning Course
    12. Frequent Pattern Mining II (FP growth algorithm)

Transferred from: http://blog.csdn.net/puluotianyi/article/details/42395835

The common algorithm idea of machine learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.