Machine Learning Common Algorithm personal summary (for interview) "reprint"

Source: Internet
Author: User
Tags id3 svm

Naive Bayesian

Reference [1]

A B the probability of events and simultaneous occurrences is 在A发生的情况下发生B either在B发生的情况下发生A
P (A∩B) =p (A) ∗p (b| A) =p (B) ∗p (a| B

So there are:
P (a| B) =p (b| A) ∗p (a) P (B)

For the given items to be categorized, the probability of each target category appearing under the condition of this item, which is the largest, is considered which category the class is to belong to.

Working principle
    1. Suppose there is now a sample x= (A1,a2,a3,... An) that is to be classified (and that the features in X are independent)
    2. Then assume that there is now a classification target y={y1,y2,y3,y4. Yn
    3. So Max (P (y1|x), P (y2|x), P (y3|x): P (yn|x)) is the final category
    4. and P (yi|x) =p (x|yi) ∗p (yi) p (x)
    5. Because X is the same for each classification target, it is called Max (P (x|yi) ∗p (yi))
    6. P (x|yi) ∗p (yi) =p (yi) ∗∏i (P (ai|yi))
    7. and the specific P (ai|yi) and P (Yi) can be counted from the training sample
      P (ai|yi) indicates the probability of the occurrence of the feature under this category
      P (yi) indicates the probability of this category appearing in all categories
    8. Okay, that's the way it works, ^_^.
Work flow
    1. Preparation phase
      The feature attributes are determined, and each feature attribute is appropriately divided, and then a part of the category to be classified is manually classified to form a training sample.
    2. Training phase
      Calculate the frequency of each category's occurrence in the training sample and the conditional probability estimate of each characteristic attribute partition for each category
    3. Application phase
      Classify with classifier, input is classifier and sample to be classified, output is the classification category that the sample belongs to
Attribute characteristics
    1. Direct statistics when feature is discrete (indicates statistical probability)
    2. The characteristic is a continuous value assuming that the feature conforms to the Gaussian distribution: g (x,n,u)
      So P (ak|yi) =g (Xk,ni,ui)
Laplace calibration (Laplace calibration)

When a feature partition does not appear in a category, there will be P (a|y) = 0, that is, the quality of the classifier is reduced, so the introduction of Laplace check, that is, the count of all divisions under no category plus 1.

Encountering features that are not independent of each other

Refer to an improved Bayesian network, using the DAG description of probability plots

Advantages and Disadvantages

The advantages of Naive Bayes:

    1. Very good for small-scale data, suitable for multi-classification tasks, suitable for incremental training.
      Disadvantages:
    2. The expression of the input data is very sensitive (discrete, continuous, the value is very small, etc.).
Logistic regression and linear regression

reference [2,3,4]

LRRegression is a linear two classification model, mainly 计算在某个样本特征下事件发生的概率 , for example, according to the user's browsing purchase as a feature to calculate whether it will buy this product, or whether it will click on the product. LRthe final value is then calculated by a function of a linear and function, and the sum of the sigmod linear and function weights and eigenvalues, plus the bias, so that in training LR is the linear and the function of the weight of each value w .

HW (x) =11+e− (wtx+b)

About this weight value is w generally used to estimate the maximum likelihood, assuming that there is now a sample {Xi,yi}, where Xi represents the characteristics of the sample, yi∈{0,1} represents the sample classification of the real value, yi=1 probability is pi, then the probability of yi=0 is 1−pi, then the observed probability is:
P (Yi) =pyii∗ (1−pi) 1−yi

The maximum likelihood function is:
∏ (HW (xi) yi∗ (1−HW (xi)) 1−yi)

The expression that will be obtained after taking the logarithm of the likelihood function
L (W) =∑i (YI∗LOGHW (xi) − (1−yi) ∗log (1−HW (xi))) =∑i (yi∗ (Wtxi) −log (1+ewtxi))

It is estimated that the maximum value of this L (W) can be estimated by W.

In practice, a negative sign is usually added to the minimum

So solving the problem becomes the optimization problem of the maximum likelihood function, which is usually sampled by random gradient descent method and quasi-Newton iterative method to optimize it.

Gradient Descent method

LRThe loss function is:
J (W) =−1nn∑i=1 (Yi∗log (HW (xi)) + (1−yi) ∗log (1−HW (xi)))

This becomes the min (J (w))
The process of updating W is

W:=w−α∗▽j (W) w:=w−α∗1n∗n∑i=1 (HW (xi) −yi) ∗xi)

where α is the step length, until J (W) can no longer stop

The biggest problem with the gradient descent method is that it will fall into the local optimal, and each time we calculate the current sample, cost we need to traverse all the samples to get the cost value, so the calculation speed will be much slower (although the calculation can be converted to matrix multiplication to update the entire w value)
So now a lot of frameworks ( mahout ) are generally using the random gradient descent method, which calculates the cost only the current price, and ultimately cost in the entire sample iteration over the sum, and he is updating the current parameter w is not sequentially traversing the sample, Instead of randomly selecting one of the samples for calculation, the method converges faster (typically using the maximum number of iterations), avoids local optimization, and is easily parallelized (in parallel with the parameter server)
W:=w−α∗ (HW (XJ) −yj) ∗xi); j∈1nandrandomly

SGDwhat can be improved here is the use of dynamic steps
Α=0.04∗ (1.0+n+i) +r

Other optimization methods
    • Quasi-Newton method (remember to use Hessian matrix and Cholesky decomposition)
    • Bfgs
    • L-bfgs

Pros and Cons: No need to choose the learning rate α, faster, but more complex

Over-fitting problems with LR:

If we have a lot of features that fit well on the training set, we don't get that effect on the prediction set.

    1. Reduce the number of feature (manually define how many feature, the algorithm chooses these feature)
    2. Regularization (for ease of solution, L2 use more)
      After adding regularization, the loss function is: J (w) =−1n∑ni=1 (Yi∗log (HW (xi) + (1−yi) ∗log (1−HW (xi))) +λ| | w| | 2
      At the same time the update of W becomes w:=w−α∗ (HW (XJ) −yj) ∗xi) −2Α∗WJ
      Note: The w0 here are not affected by regularization
Multi-Classification of LR: Softmax

Assuming that the set of values for the discrete random variable y is {,.., k}, the LR of the multi-class is
P (y=a|x) =exp (wa∗x) (∑ki=1 (wi∗x)); 1<a<k

This will output the probability of which class the current sample is subordinate to, and satisfy all probabilities plus =1

About the selection of Softmax and K-LR

If the categories are mutually exclusive (for example, music can only belong to classical music, country music, rock and roll of a kind of month) with Softmax
Otherwise the category is connected (for example, a song may have an audio or video soundtrack, or it may contain vocals, or dance), which is more appropriate when using K LR.

Advantages and Disadvantages
Logistic regression Benefits:

    1. Simple to implement;
    2. The calculation of the classification is very small, fast, storage resources are low;

Disadvantages:

    1. Easy to fit, general accuracy not too high
    2. Can only deal with two classification problems (based on which the derived softmax can be used for multi-classification), and must be linearly divided;

PS LR also can refer to this and multi-classification can see this article

KNN algorithm

Give a training dataset and a new instance, find the K training instances closest to this new instance in the training data set, and then count the class of the most recent K training instances that have the most category counts, that is, the class of the new instance

Three elements:
    1. Selection of K-values
    2. Measurement of distance (common distance measurement with Euclidean distance, Markov distance, etc.)
    3. Classification decision rules (majority voting rules)
Selection of K-values
    1. A smaller K value indicates that the model is more complex and easier to fit
    2. But the greater the K value, the simpler the model, and if k=n indicates that the class with the most classes in the training set is the most

So generally k takes a smaller value and then uses cross-validation to determine
Here the so-called cross-validation is to divide the sample into a prediction sample, such as 95% training, 5% predictions, and then k to take 1,2,3,4,5 and the like, to predict, calculate the final classification error, select the smallest error K

The return of KNN

After finding the nearest K-instance, you can calculate the average of the K-instances as the predicted values. Or you can add a weight to the K-instance and then average, which is inversely proportional to the metric distance (the closer the weight is).

Advantages and Disadvantages

The advantages of KNN algorithm:

    1. Simple thinking, mature theory, can be used to do the classification can also be used to do regression;
    2. Can be used for nonlinear classification;
    3. The complexity of training time is O (n);
    4. High accuracy, no assumptions on data, no sensitivity to outlier;

Disadvantages:

    1. Large computational capacity;
    2. Sample imbalance problem (that is, there are a large number of samples in some categories, while the number of other samples is very small);
    3. Requires a lot of memory;
KD Tree

A kd tree is a binary tree that represents a division of a k-dimensional space that can be retrieved quickly (that is, the KNN calculation does not require a distance calculation of the whole sample).

Constructing KD Tree

The process of dividing the median number of sub-regions in K-dimensional space.
Suppose you now have a data set of K-dimensional space t={x1,x2,x3,... xn},xi={a1,a2,a3. Ak

    1. First, the root node is constructed, and the A1 of the coordinates is divided into two regions, the Region 1 A1<b, and Region 2 A1>b
    2. Construct the leaf node, respectively, with the median number of A2 in the above two regions as the segmentation point, divide them 22 again, as the leaf node of depth 1, (if a2= median, then the A2 instance falls on the slicing surface)
    3. Repeated the operation of 2, the depth of the leaf node of J, when the request of the AI's i=j%k+1, until two sub-regions no instances when the stop
KD Tree Search
    1. Start by recursively finding the leaf nodes that contain x from the root node, and each layer is looking for the corresponding XI
    2. Think of this leaf node as the current "approximate nearest point"
    3. Recursive upward fallback, if a ball with a radius of approximate nearest points intersects the boundary of the other half of the root node at the center of the X, then the point in the other half of the sub-region that is closer to the X is found, and the point is found in another sub-region and the approximate nearest point is updated
    4. Repeat step 3 until another sub-region does not intersect the sphere or returns the root node
    5. Last updated "Approximate nearest point" with X true nearest point
kd Tree for KNN search

By searching the KD tree to find the closest point to the search target, the KNN search can be confined to the local area of the space, which can greatly increase the efficiency.

The complexity of KD tree search

When an instance is randomly distributed, the complexity of the search is log (n), n is the number of instances, and the KD tree is more suitable for KNN search with a larger number of instances than the spatial dimension, and if the spatial dimension of the instance is about the same as the number of instances, its efficiency is based on equal to linear scanning.

Later I have implemented KD tree, can see the application of the KNN algorithm in the KD tree

SVM, SMO

For the sample point (Xi,yi) and the hyper-plane of the SVM: wtxi+b=0

    • Function interval: Yi (wtxi+b)
    • Geometric interval: Yi (wtxi+b) | | w| |, where | | w| | As the L2 norm of W, the geometry interval does not change due to the change in the parameter scale

The basic idea of SVM is to solve the super-plane which can correctly divide the training sample and maximize its geometrical interval.

Linear SVM problem

First look at the SVM problem:
Argmaxw,bγst.yi (wtxi+b) | | w| | ≥γ

So suppose ˆγ=γ∗| | w| |
Then turn the problem to:

argmaxw,bˆγ| | w| | St.yi (wtxi+b) ≥1

Because the proportion of ˆγ does not affect the actual spacing, so the ˆγ=1 here, and because Max (1| | w| |) =min (12∗| | w| | 2)
So the final question becomes
argminw,b12∗| | w| | 2st.yi (wtxi+b) ≥1

This becomes a convex two-time programming, which can be converted to Lagrange functions, and then using the dual algorithm to solve

Dual Solution

Introduction of Lagrange multiplier α={α1,α2. Αn}, define the Lagrangian function:
L (w,b,a) =12∗| | w| | 2−∑I=1N (Αi∗yi (wtxi+b)) +∑ (like)

The primal problem of duality is the minimax of duality problem
MAXΑMINW,BL (W,b,α)

The first is to seek the w,b of the small, and then the greater the alpha.
MINW,BL (W,b,α), which is equivalent to the w,b of the derivative and another equal to 0
▽WL (w,b,α) =w−∑i=1n (Αiyixi) =0▽bl (w,b,α) =∑i=1n (Aiyi) =0

You can get it after substituting
MINW,BL (w,b,α) =−12∗n∑i=1n∑j=1 (Αiαjyiyj (XI⋅XJ)) +n∑i=1αi

MINW,BL (W,b,α) is a duality problem for the great α:
Maxα−12∗n∑i=1n∑j=1 (Αiαjyiyj (XI⋅XJ)) +n∑i=1αist.∑i=1n (Aiyi) =0α≥0,i=1,2,3 ... N

The maximum conversion is to be minimized, and the equivalent formula is obtained:
Minα12∗n∑i=1n∑j=1 (Αiαjyiyj (XI⋅XJ)) −n∑i=1αist.∑i=1n (Aiyi) =0α≥0,i=1,2,3 ... N

If the solution of α is α∗= (α∗1,α∗2,... Α∗n)
The optimal w,b are
W∗=n∑i=1 (Α∗iyixi) B∗=yj−n∑i=1 (Α∗iyi (XI⋅XJ))

Therefore, the final decision-making classification surface is
F (x) =sign (N∑i=1 (Α∗iyi (x⋅xi) +b∗)

In other words, the categorical decision function relies only on the inner product of input x and the training sample.

PS: Above is the SVM's hard spacing maximization, there is a soft pitch maximization, referring to the relaxation variable ζ, the secondary SVM problem becomes:
argminw,b12∗| | w| | 2+cn∑i=1ζist.yi (wtxi+b) ≥1−ζiζi≥0,i=1,2 ... N


The rest of the solution is consistent with the hard pitch ~

Also: The closest sample point to the detached hyper-plane is called the support vector

Loss function

The loss function is (optimization target):
N∑i=1[1−yi (wtxi+b)]++λ| | w| | 2

where [1−yi (wtxi+b)]+ is called the fold loss function because:
[1−yi (Wtxi+b)]+={0if1−yi (wtxi+b) ≤01−yi (wtxi+b) otherwise

Why to introduce dual algorithm
    1. Duality problems are often more easily solved (combined with Lagrange and kkt conditions)
    2. It is natural to refer to the kernel function (the Lagrangian expression has an inner product, and the kernel function is mapped by the inner product)
Kernel function

By mapping the input feature x (linearly non-fractal) to the high-dimensional feature r space, the SVM can be linearly variable in R space, which is the function of the kernel function.

    • Polynomial kernel function: K (x,z) = (x∗z+1) p
    • Gaussian kernel function: K (x,z) =exp (− (x−z) 2σ2)
    • String kernel function: Seemingly used for string processing, etc.
Advantages and disadvantages of SVM

Advantages:

    1. Use kernel functions to map to high-dimensional spaces
    2. Use kernel functions to solve non-linear classifications
    3. The idea of classification is simple, which is to maximize the interval between the sample and the decision surface.
    4. Good classification effect

Disadvantages:

    1. Difficult to train large-scale data
    2. You cannot directly support multiple classifications, but you can use indirect methods to do
Smo

SMO is used to quickly solve the SVM
It chooses the two variables of the convex two-time plan, the other variables remain unchanged, and then constructs a two-time planning problem according to these two variables, this two-time plan about these two variable solutions will be more close to the original two-time programming solution, through such sub-problem division can greatly increase the overall algorithm calculation speed, about these two variables:

    1. One of them is a variable that seriously violates the KKT condition.
    2. The other variable is determined by the free constraint, as if the remaining variable is maximized.
Multi-Classification of SVM
    1. Direct method
      By modifying the parametric solution of multiple classification polygons directly on the target function, the multi-classification can be achieved by solving the optimization problem (the computational complexity is very high and the implementation is difficult).
    2. Indirect method
      1. One-to-many
        One of the classes is a class, the rest of the N-1 class is another class, such as A,b,c,d four classes, the first time A is a class, {b,c,d} for a class training a classifier, the second B is a class, {a,c,d} is another class, in this way there is a total need to train 4 classifiers, At the end of the test, the test sample passes through the 4 classifiers F1 (x), F2 (x), F3 (x) and F4 (x), and takes its maximum value as a classifier (this way, because it is a 1 m class, there will be bias, very impractical)
      2. One-to-one (LIBSVM implementation method)
        If any two classes are trained on a classifier, then n classes are required to n-1/2 SVM classifiers.
        Take A,b,c,d as an example, you need {a,b},{a,c},{a,d},{b,c},{b,d},{c,d} to target a total of 6 classifiers and then vote on the final result after predicting that the test sample passes through the 6 classifiers. (This method is good, but requires N (n-1)/2 classifiers to be too expensive, but it seems to be improved with a circular graph)
Decision Tree

A decision tree is a tree built on decision-making.

ID3
    1. The first is to calculate the information gain for each feature for the current collection
    2. Then select the feature with the greatest information gain as the decision-making feature of the current node.
    3. Divided into different sub-nodes according to different categories of characteristics (for example, age characteristics have youth, middle-aged, old age, then divided into 3 sub-trees)
    4. Then continue recursion on the child node until all the features are divided

S (C,ai) =−∑i (Pi∗log (PI))

The entropy of a category in an attribute pi=p (Yi|ai), Pi represents the probability of an AI in the case of Yi, which is also the statistical probability.

S (c,a) =∑i (P (A=ai) ∗s (AI))

The entropy of the whole attribute sums the proportions of each category with the weighted sum of the respective entropy.

Gain (c,a) =s (C) −s (c,a)

The gain represents the entropy of the classification target minus the entropy of the current attribute, the greater the gain, the stronger the classification ability
(the former is called Empirical entropy, which indicates the uncertainty of the data set classification C, the latter is the empirical condition entropy, which indicates that the uncertainty of the data set classification C under the condition given a is called mutual information, and the gain of the decision tree is equivalent to the mutual information).
For example, whether the current property is owning a property, and whether the classification is debt repayment
Right now:

    • Useful property is 7, 4 can repay debt, 3 cannot repay debt
    • Then no property is 3, 1 of which can repay the debt and 2 cannot repay the debt.

And then
There is the entropy of the house: S (have_house) =− (47∗log47+37∗log37)
Entropy of No House: S (no_house) =− (13∗log13+23∗log23)
Entropy of classification: S (classifier) =− (510∗log510+510∗log510)
The ultimate gain =s (classifier) − (710∗s (have_house) +310∗s (No_house)) is the largest, the better

About loss functions
The number of leaf nodes in the tree is T,t is one of the leaf nodes, the leaf node has NT samples, where the K class sample has NTK, H (t) is the empirical entropy on the leaf node, the loss function is defined as
Ct (t) =∑ (Nt∗h (t)) +λ| T|

which
H (t) =∑ (Ntknt∗log (NTKNT))

Substituting can be obtained
Ct (T) =∑ (∑ (Ntk∗log (ntk/nt))) +λ| T|

λ| T| is a regularization term, λ is used to adjust the ratio
The decision tree generation only considers the information gain

C4.5

It is an improved algorithm for ID3, which uses the information gain rate to select the attributes.
Splitinformation (s,a) =−∑i (| si| | S|∗LOG2 (| si| | s|)) Gainratio (s,a) =gain (s,a) splitinformation (s,a)

Advantages and Disadvantages
The accuracy rate is high, but the sub-structure tree needs to be scanned and sorted several times, so it is inefficient in operation.

Cart

Categorical regression trees (classification and Regression tree) are a decision-making binary tree, which is established in a recursive way, and each node in the split is hoping to divide the remaining samples into two categories in the best way, the classification indicators:

    1. Classification Tree: Gini index minimization (gini_index)
    2. Regression tree: Minimization of squared errors

Classification Tree:

    1. The first is to calculate their Gini gain based on the current characteristics.
    2. Select the least-gain feature of the Gini as the partitioning feature
    3. Finding the lowest classification category of the Gini index from this feature as the optimal dividing point
    4. The current sample is divided into two categories, one is the classification of the characteristics of the Division is equal to the optimal partition point, the other is not equal to
    5. For these two types of recursion, the above partitioning work, direct all leaves to the same sample target or the number of leaves less than a certain threshold value

Gini is used to measure the heterogeneity (or impure) of the distribution, the more cluttered the general category, the greater the Gini index (similar to the concept of entropy)
Gini (AI) =1−∑i (P2i)

Pi ratio of Class I samples in current data set
The smaller the Gini, the more evenly the sample distribution (0 indicates only one type), the larger the less uniform
Gini Gain Gini_gain=∑i (Nin∗gini (AI))

Represents a chaotic nin of the current property that represents the probability of the current category in all categories
Final cart selection Ginigain minimum features as a partitioning feature

A sample of the tree in the ID3 loan:
Gini Index Property: Gini (Have_house) =1− (37) (47) 2)
Gini index no property: Gini (No_house) =1− (13) (23) 2)
Gini gain: Gini_gain=710∗gini (have_house) +310∗gini (no_house)

Regression tree:

Regression trees are divided into two regions by minimizing the squared error

    1. The ergodic feature calculates the optimal dividing point s,
      The squared error to minimize is: Min{min (R1.sigma ((YI−C1) 2) +min (R2.sigma ((YI−C2) 2))}
      Calculates the sum of squares and minimums based on the difference between the target and the predicted values of the S division to the left and right subtrees, where the predicted value is the mean value of the Input XI sample corresponding to the Yi on the two sub-tree
    2. Find the smallest partition feature J and its optimal dividing point s, divide the existing sample into two regions according to the feature J and the dividing point S, one is less than or equal to s on feature J, and the other is greater than s on feature J
      R1 (j) ={x|x (J) ≤s}r2 (J) ={x|x (j) >s}
    3. Enter two sub-areas continue dividing as described above until the stop condition is reached

The smallest of these, I remember. You can use the least squares method to find

About pruning: Pruning (ex post-pruning) the tree of training set growth with a separate validation data set.

Stop condition
    1. Stop until each leaf node has only one type of record (this method is easy to fit)
    2. Another time when the tree of the leaf node is less than a certain threshold, or the information gain of the node is less than a certain threshold value, stop
About features and target values
    1. Discrete target value dispersion: Id3,cart can be used
    2. Discrete feature continuous target values: discretization of continuous features can be used Id3,cart
    3. Feature discrete target value continuous
Classification and regression of decision trees
    • Classification Tree
      The category that has the most categories in the output leaf node
    • Regression tree
      Output average of each sample value in the leaf node
The ideal decision tree
    1. Minimum number of leaf nodes
    2. The depth of the leaf node is as small as possible (too deep may be too close to fit)
Solving the overfitting of decision Trees
    1. Pruning
      1. Pre-pruning: In the split node when the design of the more stringent conditions, if not satisfied the direct stop splitting (so that the decision tree can not be optimal, and can not get a better effect)
      2. Post-pruning: After the tree is established, a single node is substituted for the subtree, and the nodes are classified by the main classification of the subtree (this method is a waste of the previous establishment process).
    2. Cross-validation
    3. Random Forest
Advantages and Disadvantages

Advantages:

    1. The calculation is simple and can be interpreted strongly, and it is suitable for processing the samples with missing attribute values, and can deal with the irrelevant characteristics.
      Disadvantages:
    2. The classification ability of single decision tree is weak, and it is difficult to deal with continuous value variables.
    3. Easy overfitting (subsequent random forests, reduced overfitting);
Random Forest RF

Random forests are made up of a lot of random decision trees, and there is no correlation between them. When RF is obtained, each decision tree is judged separately at the time of the prediction, and finally the output of the result is used by the bagging idea (i.e. the idea of voting)

Learning process
    1. There are now n training samples, each with a characteristic of M, and a k tree is required.
    2. n samples are returned from n training samples as a set of training sets (the remaining samples are not taken as a predictive classification to assess their errors)
    3. To take m feature left and right subset features from M features (M<<M)
    4. A decision tree is created using a completely fragmented approach to the sampled data, so that each node of the decision tree is either not split or all the samples point to the same category
    5. Repeat 2 of the process K times to build the forest
Predictive process
    1. Predict sample input to K tree separately
    2. If it is a classification problem, select the category with the highest frequency by voting directly
    3. If it is a regression problem, use the mean value after the classification as the result
Parameter issues
    1. Here's the general take M=sqrt (M)
    2. About the number of trees K, generally need hundreds of thousands, but also have specific samples related (such as the number of features)
    3. The maximum depth of the tree (too deep may possibly lead to overfitting??). )
    4. Minimum sample count and minimum information gain on a node
Generalization error estimation

Using OOB (out-of-bag) to estimate the generalization error, the sample samples of each tree as a prediction sample (about 36.8%), using the established forest to predict the various prediction samples, the final statistical error is counted as the total forecast sample ratio as the OOB error rate of RF.

Learning Algorithms
    1. ID3 algorithm: Handling the amount of discrete values
    2. C45 algorithm: The amount of continuous value processing
    3. Cart algorithm: Both discrete and continuous are appropriate?
About Cart

The cart can create a classification tree by selecting iterations of the feature, so that each classification plane can best divide the remaining data into two categories

Gini=1−∑ (P2i), which represents the probability of each category occurrence and the difference from 1,
Classification problem: Argmax (gini−ginileft−giniright)
Regression problem: Argmax (var−varleft−varright)

Find best features F already best attribute threshold th less than th on the left, greater than th on the right sub-tree

Advantages and Disadvantages
    1. Ability to handle a wide range of features and not be used for feature selection
    2. What feature is more important after the training is completed
    3. Training speed is fast.
    4. It's easy to parallel
    5. Relatively simple to implement
Gbdt

The essence of GBDT is that training is the goal of the residual of the above tree, which is the difference between the predicted value and the real value of the previous tree.

The advantage of boosting is that every step of the participation is disguised to increase the weight of the wrong instance, and the instance of the pair has been to 0, so that the back of the tree can pay more attention to the instance training of the wrong points

Shrinkage

Shrinkage that the result of a step-by-step approach is more likely to avoid fitting than when one step closer to the result.
Y (1∼i) =y (1∼i−1) +step∗yi

Just like we do the Internet, always first solve the needs of 60% of users, and then solve the needs of 35% of users, and finally pay attention to the needs of 5% people, so that the product can be gradually done.

Parameter adjustment
    1. Number of Trees 100~10000
    2. The depth of the leaves 3~8
    3. Learning Rate 0.01~1
    4. Max node tree on leaf 20
    5. Training Sample Ratio 0.5~1
    6. Training feature sampling ratio sqrt (num)
Advantages and Disadvantages

Advantages:

    1. High accuracy
    2. Ability to handle non-linear data
    3. Ability to handle multiple feature types
    4. Suitable for low-dimensional dense data
      Disadvantages:
    5. Parallel trouble (because two trees are connected)
    6. A lot of complexity when it comes to classification.
BP least Squares method

The least squares method is a mathematical optimization technique, and the best function matching is found by minimizing squared error.
Suppose there are now two-dimensional observational data (x1,y1), (x2,y2) ... (Xn,yn), to find Y=A+BX fitting.

Now set Yi=a+b∗xi+ki If there is a A, B can get ∑ni=1 (|ki|) Minimum, the line is more ideal
So first it becomes min (∑ni=1 (ki)), which is equivalent to min (∑ni=1 (k2i))
and Ki=yi− (A+B∗XI)
So now set f=∑i=1n ((Yi− (A+B∗XI)) 2) to find the smallest

This is the least squares principle, and the method of estimating A, B is called least squares.

First, the deviation of F to A, B is obtained:
▽af=−2∗n∑i=1 (Yi− (a+b∗xi)) =0

▽bf=−2∗xi∗n∑i=1 (Yi− (a+b∗xi)) =0

Now set:
X=∑ni=1xiny=∑ni=1yin

Then substitute the above bias:
A∗n+b∗n∗x=n∗ya∗n∗x+b∗n∑i=1 (x2i) =n∑i=1 (Xi∗yi)

To find the determinant:
| Nn∗xn∗x∑ni=1x2i|=n∗n∑i=1 ((xi−x))!=0

So there's only one solution

Last notes:
L (XX) =n∑i=1 (xi−x) 2l (yy) =n∑i=1 (yi−y) 2l (XY) =n∑i=1 ((xi−x) (yi−y))

The

B=l (XY) L (XX) a=y−b∗x

Baidu Library-Least squares

Em

The maximum likelihood estimate for the probabilistic model of the implied variable, which is generally divided into two steps: The first step is to expect (E), the second step is the maximal (M),
If the variables of the probabilistic model are all observation variables, then the maximum likelihood method or Bayesian estimation model parameters can be used directly after the given data.
However, when the model contains implicit variables, it cannot be easily estimated by these methods, and EM is a maximum likelihood estimation method with probabilistic model parameters with implied variables.

Where applied: Mixed Gaussian model, mixed naive Bayesian model, factor analysis model

Bagging
    1. Sample n samples from n samples with put back
    2. Create a classifier on the full attribute of the N samples (CART,SVM)
    3. Repeat the above steps to create a M classifier
    4. Use voting methods to get results when predicting
Boosting

Boosting in training will give a weight to the sample, and then make the loss function as far as possible to consider those sub-error class samples (such as to the sub-class of the weight of the sample to increase the value)

Convex optimization

The optimal value of a function is often solved in machine learning, but in general, the optimal value of any function is difficult to solve, but the global optimal value can be solved effectively for convex function.

Convex set

A collection of C is, currently only if any x, y belongs to C and 0≤θ≤1, there are θ∗x+ (1−θ) ∗y belong to C
In layman's terms, any two points on the C-set segment are also in the C set.

Convex function

A function f whose domain (d (f)) is a convex set, and for any x, y to belong to D (f) and 0≤θ≤1
F (θ∗x+ (1−θ) ∗y) ≤θ∗f (x) + (1−θ) ∗f (y)

In layman's terms, any two-point secant on the curve is above the curve.

The common convex functions are:

    • exponential function f (x) =ax;a>1
    • Negative logarithm function −logax;a>1,x>0
    • Two-time function of opening up

The decision of the convex function:

    1. If f is a first order, X, y satisfies f (y) ≥f (x) +f′ (x) (y−x) in any data domain
    2. If f is a differentiable guide,
Examples of convex optimization applications
    • SVM: Which is shifted from max|w| to Min (12∗|w|2)
    • Least squares?
    • Loss function of LR ∑ (Yi∗log (HW (xi)) + (1−yi) ∗ (log (1−HW (xi))))
Reference

[1].http://www.cnblogs.com/leoo2sk/archive/2010/09/17/naive-bayesian-classifier.html
[2].http://www.cnblogs.com/biyeymyhjob/archive/2012/07/18/2595410.html
[3].http://blog.csdn.net/abcjennifer/article/details/7716281
[4].http://ufldl.stanford.edu/wiki/index.php/softmax%e5%9b%9e%e5%bd%92
[5]. "Statistical learning methods". Hangyuan Li

Reprinted from: http://www.kuqin.com/shuoit/20160419/351618.html

Machine Learning Common Algorithm personal summary (for interview) "reprint"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.