Machine learning common algorithms and principles summary (dry)

Source: Internet
Author: User
Tags id3 svm

Naive Bayesian

Reference [1]

The probability that events A and b occur at the same time occurs when a occurs in the case of B or occurs in the case of a B

P (A∩B) =p (A)? P (b| A) =p (B)? P (a| B

So there are:

P (a| B) =p (b| A)? P (A) p (B)

For the given items to be categorized, the probability of each target category appearing under the condition of this item, which is the largest, is considered which category the class is to belong To.

Working principle

1, Suppose there is a sample x= (a1,a2,a3,... An) this to be classified (and think that the characteristics of X are Independent)

2, then assume that there are now classified target y={y1,y2,y3,y4. Yn

3. then max (p (y1|x), p (y2|x), p (y3|x): P (yn|x)) is the final category

4, and P (yi|x) =p (x|yi)? P (yi) p (x)

5, because X is the same for each classification target, so it is to ask Max (p (x|yi)? p (YI))

6, p (x|yi)? p (yi) =p (yi) ∏i (p (AI|YI))

7, and the specific P (ai|yi) and P (yi) can be counted from the training sample

    • P (AI|YI) indicates the probability of the occurrence of the feature under this category
    • P (YI) indicates the probability of this category appearing in all categories

8, good, that is the work of ^_^

Work flow

1. Preparation Stage

The feature attributes are determined, and each feature attribute is appropriately divided, and then a part of the category to be classified is manually classified to form a training sample.

2. Training Stage

Calculate the frequency of each category's occurrence in the training sample and the conditional probability estimate of each characteristic attribute partition for each category

3. Application Stage

Classify with classifier, input is classifier and sample to be classified, output is the classification category that the sample belongs to

Attribute characteristics

1, the characteristics of the discrete value of the direct statistics can be (indicating statistical probability)

2, the characteristic is continuous value when the assumption characteristic conforms to the Gaussian distribution: G (x,n,u)

So P (ak|yi) =g (xk,ni,ui)

Laplace calibration (laplace Calibration)

When a feature partition does not appear in a category, there will be P (a|y) = 0, that is, the quality of the classifier is reduced, so the introduction of Laplace check, that is, the count of all divisions under no category Plus 1.

Encountering features that are not independent of each other

Refer to improved Bayesian networks, using Dags to describe probability plots

Advantages and Disadvantages

The advantages of naive bayes:

Very good for small-scale data, suitable for multi-classification tasks, suitable for incremental training.

Disadvantages:

The expression of the input data is very sensitive (discrete, continuous, the value is very small, etc.).

Logistic regression and linear regression

LR regression is a linear two classification model that calculates the probability of an event occurring under a sample feature, such as whether it will buy the product based on the User's browsing purchase, or whether it will click on the Product. The final value of LR is then computed by a SIGMOD function based on a linear and function, and this linear and function weights are summed up with eigenvalues and added biases, so in the training of LR is also the training of linear and function of each weight value W.

HW (x) =11+e? (wtx+b)

About this weight value w is generally used to estimate the maximum likelihood, assuming that there is now a sample {xi,yi}, where Xi represents the characteristics of the sample, yi∈{0,1} represents the sample classification of the real value, yi=1 probability is pi, then the probability of yi=0 is 1?pi, then the observed probability is:

P (yi) =pyii? (1?pi) 1?yi

The maximum likelihood function is:

∏ (HW (xi) yi? ( 1?HW (XI)) 1?yi)

The expression that will be obtained after taking the logarithm of the likelihood function

L (w) =∑i (YI?LOGHW (xi)? ( 1?yi)? log (1?hw (xi))) =∑i ( wtxi)? Log (1+EWTXI))

It is estimated that the maximum value of this L (W) can be estimated by W.

In practice, a negative sign is usually added to the minimum

So solving the problem becomes the optimization problem of the maximum likelihood function, which is usually sampled by random gradient descent method and Quasi-Newton iterative method to optimize it.

Gradient Descent method

The loss function for LR is:

J (w) =?1nn∑i=1 (yi?log (hw (xi)) + (1?yi) log (1?hw (xi)))

This becomes the min (J (w))

The process of updating W is

W:=w?α?▽j (w) w:=w?α?1n? N∑i=1 (HW (xi) yi)? xi)

where α is the step length, until J (w) can no longer stop

The biggest problem with the gradient descent method is that it will fall into the local optimum, and each time the cost of the current sample is calculated, it is necessary to traverse all the samples in order to get the costing value, so the calculation speed will be much slower (although the calculation can be converted to matrix multiplication to update the entire W Value)

So now, in many frameworks (mahout), the random gradient descent method is used, which calculates the cost only at the current price, which is the sum of the iterations of the entire sample, and he does not iterate through the sample when updating the current parameter W. Instead of randomly selecting one of the samples for calculation, the method converges faster (typically using the maximum number of iterations), avoids local optimization, and is easily parallelized (in parallel with the parameter server)

w:=w?α? (HW (xj) yj)? xi); j∈1 nandrandomly

Where SGD can be improved is the use of dynamic steps

α=0.04? (1.0+n+i) +r

Other optimization methods

    • Quasi-Newton method (remember to use Hessian matrix and Cholesky Decomposition)
    • Bfgs
    • L-bfgs

pros and Cons: no need to choose the learning rate α, faster, but more complex

over-fitting problems with Lr:

If we have a lot of features that fit well on the training set, we don't get that effect on the prediction set.

1, reduce the number of feature (manual definition of how many feature, algorithm selection of these Feature)

2, regularization (for The convenience of solving, L2 use More)

    • After adding regularization, the loss function is: J (w) =?1n∑ni=1 (yi?log (hw (xi) + (1?yi) log (1?hw (xi))) +λ| | w| | 2
    • And the update of W becomes w:=w?α? (HW (xj) yj)? xi)? 2Α?WJ
    • Note: the w0 here are not affected by regularization

Multi-classification of Lr: Softmax

Assuming that the set of values for the discrete random variable y is {,.., k}, the LR of the multi-class is

P (y=a|x) =exp (wa?x) (∑ki=1 (wi?x));1<a<k< p= "" >

This will output the probability of which class the current sample is subordinate to, and satisfy all probabilities plus =1

About the selection of Softmax and K-LR

If the categories are mutually exclusive (for example, music can only belong to classical music, country music, Rock and roll of a kind of month) with Softmax

Otherwise the category is connected (for example, a song may have an audio or video soundtrack, or it may contain vocals, or dance), which is more appropriate when using K Lr.

Advantages and Disadvantages

Logistic regression benefits:

1, the realization is simple;

2, The calculation of the classification is very small, fast, low storage resources;

Disadvantages:

1, easy to fit, the general accuracy is not too high

2, can only deal with two classification problems (on this basis, derived from the Softmax can be used for multi-classification), and must be linearly divided;

PS Additional LR can also refer to this and multi-classification can see this article

KNN algorithm

Give a training dataset and a new instance, find the K training instances closest to this new instance in the training data set, and then count the class of the most recent K training instances that have the most category counts, that is, the class of the new instance

Three Elements:

1, the choice of K value

2, Distance measurement (common distance measurement has European distance, Markov distance, etc.)

3. Classification decision rules (majority voting rules)

Selection of K-values

1. The smaller the K value, the more complex the model is, and the easier it is to fit.

2, but the greater the value of k, the simpler the model, if the k=n indicates that no matter what point is the most training concentration category of the class

So generally k takes a smaller value and then uses Cross-validation to determine

Here the so-called cross-validation is to divide the sample into a prediction sample, such as 95% training, 5% predictions, and then k to take 1,2,3,4,5 and the like, to predict, calculate the final classification error, Select the smallest error K

The return of KNN

After finding the nearest k-instance, you can calculate the average of the k-instances as the predicted Values. Or you can add a weight to the k-instance and then average, which is inversely proportional to the metric distance (the closer the weight is).

Advantages and Disadvantages

The advantages of KNN algorithm:

1, simple thinking, mature theory, can be used to do the classification can also be used to do regression;

2, can be used for non-linear classification;

3, The training time complexity of O (n);

4, The accuracy is high, has no hypothesis to the data, is not sensitive to the outlier;

Disadvantages:

1. Large amount of calculation;

2, Sample Imbalance Problem (that is, Some categories of samples of a large number, and the number of other samples is very small);

3, need a lot of memory;

KD Tree

A kd tree is a binary tree that represents a division of a k-dimensional space that can be retrieved quickly (that is, The KNN calculation does not require a distance calculation of the whole sample).

Constructing KD Tree

The process of dividing the median number of sub-regions in K-dimensional space.

Suppose you now have a data set of k-dimensional space t={x1,x2,x3,... xn},xi={a1,a2,a3. Ak

1, first constructs the root node, takes the coordinate A1 the median B as the segmentation point, divides the root node corresponding rectangle local area into two regions, the region 1 a1<b, the region 2 a1>b

2, the construction of leaf nodes, respectively, the above two areas of the a2 of the median as a segmentation point, they are divided into 22 again, as the depth of 1 of the leaf node, (if a2= median, then the A2 instance falls on the split Surface)

3, repeated 2 of the operation, the depth of the leaf node of j, the i=j%k+1 of the claimed ai, until the two sub-regions no instance when the stop

KD Tree Search

1, first start from the root node recursively find the leaf node containing x, each layer is to find the corresponding XI

2 This leaf node is considered to be the current "approximate nearest point"

3, recursive upward fallback, if the ball with the radius of "approximate nearest point" intersects with the boundary of the other half of the root node at the center of x, then the other half of the Sub-Area has a point closer to the x, then it goes into another sub-region to find the point and updates the approximate nearest point

4. Repeat step 3 until another sub-region does not intersect the sphere or returns the root node

5. Last updated "approximate nearest point" with x true nearest point

kd Tree for KNN search

By searching the KD tree to find the closest point to the search target, the KNN search can be confined to the local area of the space, which can greatly increase the Efficiency.

The complexity of KD tree search

When an instance is randomly distributed, the complexity of the search is log (n), n is the number of instances, and the KD tree is more suitable for KNN search with a larger number of instances than the spatial dimension, and if the spatial dimension of the instance is about the same as the number of instances, its efficiency is based on equal to linear scanning.

Later I have implemented KD tree, can see the application of the KNN algorithm in the KD tree

SVM, SMO

For the sample point (xi,yi) and the Hyper-plane of the svm: wtxi+b=0

    • function Interval:yi (wtxi+b)
    • geometric interval:yi (wtxi+b) | | w| |, where | | w| | As the L2 norm of w, The geometry interval does not change due to the change in the parameter scale

The basic idea of SVM is to solve the super-plane which can correctly divide the training sample and maximize its geometrical interval.

Linear SVM problem

First look at the SVM problem:

Argmaxw,bγst.yi (wtxi+b) | | w| | ≥γ

So suppose? γ=γ?| | w| |

Then turn the problem to:

argmaxw,b?γ| | w| | St.yi (wtxi+b) ≥1

Because the ratio of gamma to increase or decrease does not affect the actual spacing, so the γ=1 here, and because Max (1| | w| |) =min (12?| | w| | 2)

So the final question becomes

argminw,b12?| | w| | 2st.yi (wtxi+b) ≥1

This becomes a convex two-time programming, which can be converted to Lagrange functions, and then using the dual algorithm to solve

Dual Solution

Introduction of Lagrange multiplier α={α1,α2. αn}, define the Lagrangian function:

L (w,b,a) =12?| | w| | 2?∑i=1n (αi?yi (wtxi+b)) +∑ (like)

The primal problem of duality is the minimax of duality problem

MAXΑMINW,BL (w,b,α)

The first is to seek the w,b of the small, and then the greater the ALPHA.

MINW,BL (w,b,α), which is equivalent to the w,b of the derivative and another equal to 0

▽WL (w,b,α) =w?∑i=1n (αiyixi) =0▽bl (w,b,α) =∑i=1n (aiyi) =0

You can get it after substituting

MINW,BL (w,b,α) =?12? N∑i=1n∑j=1 (αiαjyiyj (xi?xj)) +n∑i=1αi

MINW,BL (w,b,α) is a duality problem for the great α:

maxα?12? N∑i=1n∑j=1 (αiαjyiyj (xi?xj)) +n∑i=1αist.∑i=1n (aiyi) =0α≥0,i=1,2,3 ... N

The maximum conversion is to be minimized, and the equivalent formula is Obtained:

minα12? N∑i=1n∑j=1 (αiαjyiyj (xi?xj))? n∑i=1αist.∑i=1n (aiyi) =0α≥0,i=1,2,3 ... N

If the solution of α is α?= (α?1,α?2,... Α?n)

The optimal w,b are

W?=n∑i=1 (α?iyixi) b?=yj? N∑i=1 (α?iyi (xi?xj))

therefore, the final decision-making classification surface is

F (x) =sign (n∑i=1 (α?iyi (x?xi) +b?)

In other words, the categorical decision function relies only on the inner product of input x and the training Sample.

Ps: above is the svm's hard spacing maximization, There is a soft pitch maximization, referring to the relaxation variable ζ, the secondary SVM problem becomes:

argminw,b12?| | w| | 2+cn∑i=1ζist.yi (wtxi+b) ≥1?ζiζi≥0,i=1,2 ... N

The rest of the solution is consistent with the hard pitch ~

Also: the closest sample point to the detached hyper-plane is called the support vector

Loss function

The loss function is (optimization target):

N∑i=1[1?yi (wtxi+b)]++λ| | w| | 2

Where [1?yi (wtxi+b)]+ is called the fold loss function because:

[1?yi (wtxi+b)]+={0if1?yi (wtxi+b) ≤01?yi (wtxi+b) otherwise

Why to introduce dual algorithm

1, dual problems are often more easily solved (combined with Lagrange and Kkt Conditions)

2, can be a natural reference to the kernel function (lagrange expression has an inner product, and the kernel function is mapped by the inner Product)

Kernel function

The input feature x (linear Non-fractal) is mapped to the High-dimensional feature R space, which allows the SVM to be fed into the R space.

The line can change, that's the function of the Kernel.

    • Polynomial kernel function: K (x,z) = (x?z+1) p
    • Gaussian kernel function: K (x,z) =exp (? ( X?z) 2σ2)
    • String kernel function: seemingly used for string processing, etc.

Advantages and disadvantages of SVM

Advantages:

1. Use kernel functions to map to high-dimensional spaces

2, using the kernel function can solve the non-linear classification

3, The classification idea is very simple, is to maximize the interval between the sample and the decision plane

4. Good Classification effect

Disadvantages:

1, the training of large-scale data is more difficult

2, cannot directly support multi-classification, but can use indirect method to do

Smo

SMO is used to quickly solve the SVM

It chooses the two variables of the convex two-time plan, the other variables remain unchanged, and then constructs a two-time planning problem according to these two variables, this two-time plan about these two variable solutions will be more close to the original two-time programming solution, through such Sub-problem division can greatly increase the overall algorithm calculation speed, about these two variables:

1, one of which is a serious violation of the KKT condition of a variable

2, the other variable is determined according to the free constraint, as if to seek the maximization of the remaining variables to Determine.

Multi-classification of SVM

1. Direct method

By modifying the parametric solution of multiple classification polygons directly on the target function, the multi-classification can be achieved by solving the optimization problem (the computational complexity is very high and the implementation is difficult).

2. Indirect method

    • One-to-many

One of the classes is a class, the rest of the N-1 class is another class, such as A,b,c,d four classes, the first time A is a class, {b,c,d} for a class training a classifier, the second B is a class, {a,c,d} is another class, in this way there is a total need to train 4 classifiers, At the end of the test, the test sample passes through the 4 classifiers F1 (x), F2 (x), f3 (x) and F4 (x), and takes its maximum value as a classifier (this way, because it is a 1 m class, there will be bias, very impractical)

    • One-to-one (LIBSVM Implementation METHOD)

If any two classes are trained on a classifier, then n classes are required to N-1/2 SVM Classifiers.

Take A,b,c,d as an example, you need {a,b},{a,c},{a,d},{b,c},{b,d},{c,d} to target a total of 6 classifiers and then vote on the final result after predicting that the test sample passes through the 6 Classifiers. (this method is good, but requires n (N-1)/2 classifiers to be too expensive, but it seems to be improved with a circular graph)

Decision Tree

A decision tree is a tree built on Decision-making.

ID3

1, the first is for the current set, calculate the information gain of each feature

2, then select the most information gain characteristics as the decision-making characteristics of the current node

3, according to different categories of characteristics divided into different sub-nodes (such as the age characteristics of youth, middle-aged, old age, then divided into 3 sub-trees)

4. Then continue recursively on the child node until all features are divided

S (c,ai) =?∑i (pi?log (PI))

The entropy of a category in an attribute pi=p (yi|ai), pi represents the probability of an AI in the case of yi, which is also the statistical probability.

S (c,a) =∑i (P (a=ai)? S (AI))

The entropy of the whole attribute sums the proportions of each category with the weighted sum of the respective entropy.

Gain (c,a) =s (C)? S (c,a)

The gain represents the entropy of the classification target minus the entropy of the current attribute, the greater the gain, the stronger the classification ability

(the former is called empirical entropy, which indicates the uncertainty of the data set classification c, the latter is the empirical condition entropy, which indicates that the uncertainty of the data set classification C under the condition given a is called mutual information, and the gain of the decision tree is equivalent to the mutual information).

For example, whether the current property is owning a property, and whether the classification is debt repayment

Right now:

    • Useful property is 7, 4 can repay debt, 3 cannot repay debt
    • Then no property is 3, 1 of which can repay the debt and 2 cannot repay the DEBT.

And then

there is the entropy of the house: S (have_house) =? (47?log47+37?log37)

entropy of no house: S (no_house) =? (13?log13+23?log23)

Entropy of Classification: S (classifier) =? (510?log510+510?log510)

The final gain =s (classifier)? (710?) S (have_house) +310? S (no_house)) Max the better

About loss functions

The number of leaf nodes in the tree is T,t is one of the leaf nodes, the leaf node has NT samples, where the K class sample has ntk, H (t) is the empirical entropy on the leaf node, the loss function is defined as

Ct (T) =∑ (Nt? H (t)) +λ| T|

which

H (t) =∑ (ntknt?log (ntknt))

Substituting can be obtained

Ct (T) =∑ (∑ (ntk?log (ntk/nt))) +λ| T|

λ| T| is a regularization term, λ is used to adjust the ratio

The decision tree generation only considers the information gain

C4.5

It is an improved algorithm for ID3, which uses the information gain rate to select the Attributes.

Splitinformation (s,a) =?∑i (| si| | s|? Log2 (| si| | S|)) Gainratio (s,a) =gain (s,a) splitinformation (s,a)

Advantages and Disadvantages

The accuracy rate is high, but the sub-structure tree needs to be scanned and sorted several times, so it is inefficient in Operation.

Cart

Categorical regression trees (classification and Regression tree) are a decision-making binary tree, which is established in a recursive way, and each node in the split is hoping to divide the remaining samples into two categories in the best way, the classification indicators:

1. Classification tree: Gini index minimization (gini_index)

2. Regression tree: minimization of squared errors

Classification Tree:

1, the first is to calculate their Gini gain based on the current characteristics

2. Select the Least-gain feature of the Gini as the dividing feature

3. Finding the smallest classification category of the Gini index from this feature as the optimal dividing point

4, The current sample is divided into two categories, one is the classification of the characteristics of the category equals the best dividing point, the other is not equal to

5, for these two types of recursion for the above division work, direct all leaves to the same sample target or the number of leaves less than a certain threshold value

Gini is used to measure the heterogeneity (or impure) of the distribution, the more cluttered the general category, the greater the Gini index (similar to the concept of Entropy)

Gini (ai) =1?∑i (p2i)

Pi ratio of Class I samples in current data set

The smaller the gini, the more evenly the sample distribution (0 indicates only one type), the larger the less uniform

Gini gain

Gini_gain=∑i (nin?gini (AI))

Represents a chaotic nin of the current property that represents the probability of the current category in all categories

Final cart selection Ginigain minimum features as a partitioning feature

A sample of the tree in the ID3 loan:

Gini index property:Gini (have_house) =1? (37) (47) 2)

Gini index no property:Gini (no_house) =1? (13) (23) 2)

Gini gain:gini_gain=710?gini (have_house) +310?gini (no_house)

Regression tree:

Regression trees are divided into two regions by minimizing the squared error

1, The ergodic feature calculates the best dividing point s,

The squared error to minimize is: min{min (r1.sigma ((yi?c1) 2) +min (r2.sigma ((yi?c2) 2))}

Calculates the sum of squares and minimums based on the difference between the target and the predicted values of the S division to the left and right subtrees, where the predicted value is the mean value of the Input XI sample corresponding to the Yi on the two sub-tree

2. Find the smallest partition feature J and its optimal dividing point s, divide the existing sample into two regions according to the feature J and the dividing point s, one is less than or equal to s on feature J and the other is greater than s on feature J

R1 (j) ={x|x (j) ≤s}r2 (j) ={x|x (j) >s}

3, enter two sub-regions continue to divide according to the above method until the stopping condition is reached

The smallest of these, I remember. you can use the least squares method to find

About Pruning: pruning (ex Post-pruning) The tree of training set growth with a separate validation data set.

Stop condition

1. Stop until each leaf node has only one type of record (this method is easy to Fit)

2, another time when the leaf node of the record tree is less than a certain threshold, or the information gain of the node is less than a certain threshold value stop

About features and target values

1. Discrete discrete target values: Id3,cart can be used

2, the characteristic continuous target value is discrete: the continuous feature discretization can be used Id3,cart

3. Characteristic discrete target value continuous

Classification and regression of decision trees

    • Classification Tree

The category that has the most categories in the output leaf node

    • Regression tree

Output average of each sample value in the leaf node

The ideal decision tree

1, the number of leaf nodes as little as possible

2, the depth of the leaf node as small as possible (too deep may be too close to Fit)

Solving the overfitting of decision Trees

1. pruning

    • pre-pruning: in the split node when the design of the more stringent conditions, if not satisfied the direct stop splitting (so that the decision tree can not be optimal, and can not get a better Effect)
    • Post-pruning: after the tree is established, a single node is substituted for the subtree, and the nodes are classified by the main classification of the subtree (this method is a waste of the previous establishment process).

2. cross-validation

3. Random Forest

Advantages and Disadvantages

Advantages:

1, The calculation is simple, can be interpreted strong, more suitable to deal with the missing attribute value of the sample, can deal with irrelevant characteristics;

Disadvantages:

1, The single decision tree classification ability is weak, and the continuous value variable is difficult to deal with;

2, easy to Cross-fit (the subsequent emergence of random forests, reducing the phenomenon of overfitting);

Random Forest RF

Random forests are made up of a lot of random decision trees, and there is no correlation between them. When RF is obtained, each decision tree is judged separately at the time of the prediction, and finally the output of the result is used by the bagging idea (I.E. the idea of Voting)

Learning process

1, now there are n training samples, each sample is characterized by m, need to build K tree

2. Take n samples from n training samples as a set of training sets (the remaining samples are used as the predictive classification to evaluate their errors)

3, from M features to take m features left and right subset features (m<<m) < p= "" >

4, the sampled data using a completely split way to establish a decision tree, such a decision tree each node can either not split, or all the samples are pointing to the same classification

5, repeat the process of 2 K times, you can build a forest

Predictive process

1, The prediction sample input to the K tree to be predicted separately

2, If it is a classification problem, direct use of voting methods to select the category of the highest frequency categories

3, if it is a regression problem, use the mean value after classification as the result

Parameter issues

1, the general take m=sqrt (M)

2, about the number of trees k, generally need hundreds of thousands, but also have specific samples related (such as the number of features)

3, the maximum depth of the tree, (too deep may lead to overfitting??) )

4, the minimum number of samples on the node, the minimum information gain

Generalization error estimation

Using OOB (out-of-bag) to estimate the generalization error, the sample samples of each tree as a prediction sample (about 36.8%), using the established forest to predict the various prediction samples, the final statistical error is counted as the total forecast sample ratio as the OOB error rate of RF.

Learning Algorithms

ID3 algorithm: handling the amount of discrete values

C45 algorithm: The amount of continuous value processing

cart algorithm: Both discrete and continuous are appropriate?

About Cart

The cart can create a classification tree by selecting iterations of the feature, so that each classification plane can best divide the remaining data into two categories

Gini=1?∑ (p2i), which represents the probability of each category occurrence and the difference from 1,

classification problem:Argmax (Gini? ginileft? Giniright)

regression problem:Argmax (Var?) varleft? Varright)

Find best features F already best attribute threshold th less than th on the left, greater than th on the right sub-tree

Advantages and Disadvantages

1, can deal with a large number of characteristics of the classification, and is not used for feature selection

2, after the completion of training can be given which feature more important

1. Fast Training

1, It is easy to parallel

1, the implementation is relatively simple

Gbdt

The essence of GBDT is that training is the goal of the residual of the above tree, which is the difference between the predicted value and the real value of the previous Tree.

For example, The current sample age is 18 years old, then the first one will be 18 years of age to train, but after training, the predicted age is 12 years, the difference is 6,

So the second tree will be trained at the age of 6, if the result predicted after training is 6, then the two trees add up is the real age,

But if the second tree predicted the result is 5, then the remaining residual 1 will be handed over to the third tree to Train.

The advantage of boosting is that every step of the participation is disguised to increase the weight of the wrong instance, and the instance of the pair has been to 0, so that the back of the tree can pay more attention to the instance training of the wrong points

Shrinkage

Shrinkage that the result of a step-by-step approach is more likely to avoid fitting than when one step closer to the Result.

Y (1∼i) =y (1∼i?1) +step?yi

Just like we do the internet, always first solve the needs of 60% of users, and then solve the needs of 35% of users, and finally pay attention to the needs of 5% people, so that the product can be gradually done.

Parameter adjustment

1, the number of trees 100~10000

2, the depth of the leaf 3~8

3. Learning Rate 0.01~1

4. Maximum node tree on leaves 20

5. Training sampling ratio 0.5~1

6. Training feature sampling ratio sqrt (num)

Advantages and Disadvantages

Advantages:

1. High Precision

2. can handle non-linear data

3, can handle multi-feature type

4, suitable for low-dimensional dense data

Disadvantages:

5, Parallel Trouble (because The two trees are connected)

6, Multi-classification When the complexity is very large

Bp

Least squares

The least squares method is a mathematical optimization technique, and the best function matching is found by minimizing squared error.

Suppose there are now two-dimensional observational data (x1,y1), (x2,y2) ... (xn,yn), to find Y=A+BX fitting.

Now set Yi=a+b?xi+ki If there is a a, B can get ∑ni=1 (|ki|) minimum, the line is more ideal

So first it becomes min (∑ni=1 (ki)), which is equivalent to Min (∑ni=1 (k2i))

and ki=yi? (a+b?xi)

So now f=∑i=1n (yi? ( A+b?xi))) 2) to find the smallest

This is the least squares principle, and the method of estimating a, B is called least squares.

first, the deviation of F to a, B is obtained:

▽af=?2? N∑i=1 (yi? ( A+B?XI)) =0

▽bf=?2?xi? N∑i=1 (yi? ( A+B?XI)) =0

Now Set:

X=∑ni=1xiny=∑ni=1yin

Then substitute the above bias:

A? n+b? N? x=n? Ya? N? x+b? N∑i=1 (x2i) =n∑i=1 (xi?yi)

To find the Determinant:

|nn? XN? x∑ni=1x2i|=n? N∑i=1 (xi? X))!=0

So There's only one solution

Last Notes:

L (XX) =n∑i=1 (xi?) X) 2l (yy) =n∑i=1 (yi? Y) 2l (xy) =n∑i=1 (xi? X) (yi? Y))

The

B=l (xy) L (xx) a=y?b? X

Em

The maximum likelihood estimate for the probabilistic model of the implied variable, which is generally divided into two steps: the first step is to expect (E), the second step is the maximal (M),

If the variables of the probabilistic model are all observation variables, then the maximum likelihood method or Bayesian estimation model parameters can be used directly after the given Data.

however, when the model contains implicit variables, it cannot be easily estimated by these methods, and EM is a maximum likelihood estimation method with probabilistic model parameters with implied Variables.

Where applied: mixed Gaussian model, mixed naive Bayesian model, factor analysis model

Bagging

1. Sample n samples from n samples that have been put back

2. Set up a classifier on the full attribute of the N samples (cart,svm)

3, repeat the above steps, the establishment of a m classifier

4, the prediction of the use of voting methods to obtain results

Boosting

Boosting in training will give a weight to the sample, and then make the loss function as far as possible to consider those Sub-error class samples (such as to the sub-class of the weight of the sample to increase the Value)

Convex optimization

The optimal value of a function is often solved in machine learning, but in general, the optimal value of any function is difficult to solve, but the global optimal value can be solved effectively for convex function.

Convex set

A collection of C is, currently only if any x, y belongs to C and 0≤θ≤1, there are θ?x+ (1?θ)? y belongs to C

In Layman's terms, any two points on the C-set segment are also in the C set.

Convex function

A function f whose domain (d (f)) is a convex set, and for any x, y to belong to D (f) and 0≤θ≤1

F (θ?x+ (1?θ) y) ≤θ?f (x) + (1?θ)? f (y)

In Layman's terms, any two-point secant on the curve is above the Curve.

The common convex functions are:

    • exponential function f (x) =ax;a>1
    • Negative logarithm function? logax;a>1,x>0
    • Two-time function of opening up

The decision of the convex function:

1, If F is a first-order, x, y in any data domain satisfies F (y) ≥f (x) +f′ (x) (y?x)

2. If f is a differentiable guide,

Examples of convex optimization applications

    • SVM: which consists of max|w| Turn min (12?| W|2)
    • Least squares?
    • The loss function of LR ∑ (yi?log (hw (xi)) + (1?yi)? ( Log (1?HW (xi))))

Reference

[1]. http://www.cnblogs.com/leoo2sk/archive/2010/09/17/naive-bayesian-classifier.html

[2]. http://www.cnblogs.com/biyeymyhjob/archive/2012/07/18/2595410.html

[3]. http://blog.csdn.net/abcjennifer/article/details/7716281

[4]. http://ufldl.stanford.edu/wiki/index.php/Softmax%E5%9B%9E%E5%BD%92

[5]. "statistical Learning methods". Hangyuan Li

Machine learning common algorithms and principles summary (dry)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.