A brief introduction to the principle of machine learning common algorithm (LDA,CNN,LR)

Source: Internet
Author: User

1.LDA

LDA is a three-layer Bayesian model, and three layers are: Document layer, subject layer and word layer respectively. The model is based on the following assumptions:
1) There are k independent themes in the entire document collection;
2) Each theme is a multi-item distribution on the word;
3) Each document consists of a random mix of k themes;
4) Each document is a multi-item distribution on a K-theme;
5) The prior distribution of the subject probability distribution of each document is Dirichlet distribution;
6) A priori distribution of the probability distributions of each subject morphemes is Dirichlet distribution.
The documentation is generated as follows:
1) for Document Set M, sample topic from the Dirichlet distribution with the parameter β to generate the distribution parameters of Word φ;
2) for the document m in each m, the distribution parameter θ of the doc pair topic is sampled from the Dirichlet distribution with the α parameter.
3) for the nth Word in document M w_mn, first the implied topic z_m of the document M is sampled according to the θ distribution, followed by a word w_mn for the topic z_m the φ distribution sample.

Thus the joint distribution of the entire model is as follows:

To calculate the integral of the joint distribution, remove the partial hidden variables:

The intermediate parameters θ and φ can be eliminated by indirectly calculating the transfer probability, so the transfer probability of the topic is as follows:

So we can use the Gibbs sampling for each iteration of the iteration, the iterative process is: first generated by a uniform distribution of random numbers, and then calculate the probability of each transfer topic according to the above, by the cumulative probability to determine the random grumble under which new topic, update the parameter matrix, so iterative until convergence.

2.CNN

2.1 Multi-layer Perceptron Basics

An example of the structure of a single perceptron is as follows:


Where function f is an activation function, generally with the sigmoid function.
When multiple cells are combined and have a hierarchical structure, a multilayer perceptron model (neural network) is formed. is a neural network with an implicit layer (3 nodes) and a single-node output layer.

2.2 convolutional Neural Networks

2.2.1 Structural features

In image processing, the image is often represented as a vector of pixels, such as a 1000x1000 image, which can be represented as a 〖10〗^6 vector. In the above-mentioned neural network, if the number of hidden layers is the same as the input layer, that is also 〖10〗^6, then the input layer to the hidden layer of the parameter data is 〖10〗^12, so too much, basically can not be trained. Therefore, the network parameters need to be reduced.
Convolutional network is a kind of multilayer perceptron which is specially designed for recognizing two-dimensional shape, which is highly invariant to the transformation of translation, scale, inclination or co-form. These good performance is the network in supervised mode, the structure of the network has a sparse connection and weight sharing two features, including the following forms of constraints:
1) Feature extraction. Each neuron gets input from the local accepted domain of the previous layer, forcing it to extract local features. Once a feature is extracted, its exact position becomes less important as long as its position relative to other features is preserved approximately.
2) Feature mapping. Each computing layer of a network is composed of multiple feature mappings, each of which is planar, and the individual neurons in the plane share the same set of weights under constraints.
3) sub-sampling. Each convolution layer is followed by a computational layer that implements local averaging and sub-sampling, and the resolution of the feature map is reduced. This operation has the effect of reducing the sensitivity of the output of the feature map to translation and other forms of deformation.
Ownership values in all layers of a convolutional network are learned through supervised training, and the network can automatically extract features during the learning process.
A convolutional neural network is usually composed of a convolution layer alternating with a sub-sampling layer. is an example:

The input image is passed through the convolution layer, the sub-sampling layer, the convolution layer, the sub-sampling layer, and then the output is obtained by an all-connected.

2.2.2 convolutional layer

Convolutional layers are implemented through weight sharing. The units of the shared weights form a feature map, as shown in.

In the figure, there are 3 hidden layer nodes, which belong to the same feature map. The weights of the same color links are the same, you can still use the gradient descent method to learn these weights, only need to make some small changes to the original algorithm, the sharing weights of the gradient is the sum of all the shared parameters of the gradient.

2.2.3 Sub-sampling layer

The sub-sampling layer is implemented by local perception. It is generally believed that the cognition of people to the outside world is from local to global, and the spatial relation of image is more closely related to pixels, while the pixel correlation is weaker than that of distance. Thus, it is not necessary for each neuron to perceive the global image, but to perceive it locally, and then to synthesize the local information at higher levels to get a global message. As shown: The left image is full, and the image on the right is a local connection.

3.LR

Linear regression model, generally expressed as h_θ (x) =θ^t x form, the output domain is the entire real number field, can be used to carry out two classification tasks, but the actual application of the two classification problem people generally want to obtain a [0,1] range of probability values, such as the probability of illness is 0.9 or 0.1,sigmoid function g (z) can satisfy this requirement by converting the output of the linear regression to [0,1].

with G (z), you can get the probability p (y = 1 |x,θ) of the sample X belonging to Category 1 and category 0, p (y = 0|x,θ), into the form of logistic regression:

The classification threshold is 0.5, and the corresponding decision function is:

Different classification thresholds can be obtained by different classification results, if the accuracy of the positive example of high requirements, can choose a larger threshold value, such as 0.6, the recall requirements for a positive case is high, you can choose a smaller threshold value, such as 0.3.
The converted classification surface (decision boundary) is equivalent to the original linear regression

3.1 Parametric Solution

After the mathematical form of the model is determined, the rest is how to solve the parameters in the model. One of the most common methods in statistics is the maximum likelihood estimation, which is to find a set of parameters, so that the likelihood value (probability) of our data is greater under this set of parameters. In a logistic regression model, the likelihood value can be expressed as:

Logarithmic likelihood values can be obtained by taking the logarithm:

On the other hand, in the field of machine learning, we often encounter the concept of loss function, which measures the error of model prediction, the smaller the value, the better the model prediction. The commonly used loss function has 0-1 loss, log loss, hinge loss and so on. Where the log loss is defined as a single sample point:

To define the average log loss on the entire data set, we can get:

That is, in the logistic regression model, maximizing the likelihood function and minimizing the log loss function is actually equivalent. There are many methods for solving this optimization problem, which is illustrated by the example of gradient descent. Gradient descent (Gradient descent), also known as the steepest gradient descent, is an iterative solution that approximates the optimal value by selecting One Direction to adjust the parameter's value at each step of the fastest change in the target function. The basic steps are as follows:
Select Descent direction (gradient direction,)
Select step, update parameters
Repeat these two steps until the termination condition is met.

3.2 classification borders

Once you know how to solve the parameters, let's look at what the final result of the model is. It is easy to see from the sigmoid function, take 0.5 as the classification threshold, at that time, Y=1, otherwise y=0. is the classification plane implied by the model (in the high-dimensional space, generally called the super-plane). So the logistic regression is essentially a linear model, but this does not mean that only the linearly-divided data can be solved by LR, in fact, the low-dimensional space can be transformed into the high-dimensional space by the way of feature transformation, and the probability of the linearly being divided into the high-dimensional space is higher in the low-dimensional space. The comparison of the following two graphs shows the linear classification curve and the nonlinear classification curve (through feature mapping).

The left image is a linearly-divided data set, and the right image is linearly irreducible in the original space, but the space after the feature conversion [x1,x2]=>[x1,x2,x21,x22,x1x2] is linearly divided, and the corresponding primitive space is a class elliptic curve with the classification boundary

3.3 Word2vec

Word2vec has two network models, Cbow models (continuous Bag-of-words model) and Sikp-gram models (continuous Skip-gram model).

Two models consist of three layers: input layer, projection layer, and output layer. Where the Cbow model is in the context of the known current word w (t) W (t-2), W (t-1), W (t+1), W (t+2) to predict the word w (t), the Skip-gram model is the opposite, it is in the case of known current word W (t) to predict the context of the current word w ( T-2), W (t-1), W (t+1), W (t+2). For example, "Today/weather/good/sunny", while the current word is "weather". The Cbow model is the probability of predicting the "weather" between "Today", "good" and "Sunny", while the Skip-gram model predicts the probability of "Today", "good" and "sunny" three words around the weather.
The Cbow model is solved by optimizing the objective function as follows, and the objective function is a logarithmic likelihood function.

The input of the cbow is a word vector V (w) that contains 2c words in the context (W), and this 2c word vector accumulates in the projection layer to get the output of the output layer recorded as X_w. The output layer adopts the technique of hierarchical Softmax, which is organized into a Huffman tree based on the word frequency of all the words in the training sample set, the actual word is the leaf node of the Huffman tree. By the length of the path can find the word w, the path can be expressed as a string consisting of 0 and 1, recorded as. Each intermediate node of the Huffman number resembles a discriminant of a logistic regression, and the parameters of each intermediate node are recorded. So, for the Cbow model, there are:
Then, the target function is:

Then the parameters θ and x of the target function are updated by the random gradient descent method, so that the value of the objective function can be maximized.
Similar to the Cbow model, Skip-gram is solved by optimizing the following objective functions.

which

So, the target function of Skip-gram is:

The parameters θ and V (w) of the target function are updated by the stochastic gradient descent method, so that the value of the objective function can be maximized.

A brief introduction to the principle of machine learning common algorithm (LDA,CNN,LR)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.