A summary of 9 basic concepts and 10 basic algorithms for machine learning

Last Update:2018-08-18 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

41901387

1. Basic Concepts:

　　(1) 10 percent cross-validation: The English name is 10-fold cross-validation, which is used to test the accuracy of the algorithm. is a common test method. Divide the data set into 10 parts. In turn, 9 of these were used as training data and 1 as test data for testing. Each test will be given a corresponding accuracy (or error rate). The average value of the correct rate (or error ratio) of 10 times results is an estimate of the accuracy of the algorithm, and it is generally necessary to perform multiple 10 percent cross-validation to estimate the accuracy of the algorithm.

　(2) Maximum likelihood estimation: Maximum likelihood estimation is only a kind of probability theory applied in statistics, it is one of the methods of parameter evaluation. It is known that a random sample satisfies a certain probability distribution, but the specific parameters are unclear, the parameters are estimated through several experiments, the results are observed, and the approximate values of the parameters are introduced using the results. Maximum likelihood estimation is based on the thought that a parameter is known to make the sample appear most likely. Of course we are not going to choose other small probability samples, so we simply take this parameter as an estimate of the real value.

(3) in the information theory, entropy represents the measure of uncertainty. the founder of information theory Shannon in his book, "The Mathematical Theory of communication," put forward a statistical model based on the probability of the measurement. He defines information as "something to eliminate uncertainty." The definition of entropy is the expectation of information.

PS: entropy refers to the degree of chaos in the system, it has important applications in cybernetics, probability theory, number theory, astrophysics, life science and other fields, and in different disciplines there is also a more specific definition, is a very important parameter in various fields. The entropy is proposed by Rudolph Clausius and applied in thermodynamics. Later, Claude El-Shannon first introduced the concept of entropy into information theory.

(4) The posteriori probability is one of the basic concepts of information theory. in a communication system, after a message is received, the probability that the message is sent by the receiving end is called the post-validation probability. The posterior probability refers to the probability of re-correcting after obtaining the "result" information, as in the Bayesian formula. Is the question of finding a cause of fruit. The posterior probability and the transcendental probability have an inseparable connection, the posterior calculation should be based on a priori probability, in fact, the probability of the posterior examination is actually conditional probability.

　(5) PCA Principal component analysis:

Pros: Reduce the complexity of your data and identify the most important features.

Cons: Not necessarily required, and may lose useful information.

Applicable type: numeric data.

Technology type: dimension reduction technology.

Description: In PCA, the data is converted from the original coordinate system to the new coordinate system, and the selection of the new coordinate system is determined by the data itself. The first new axis is selected with the most significant direction of the original data, the selection of the second new axis is orthogonal to the first axis and the direction of the maximum variance is the one. The process repeats, repeating the number of features in the original data. You will find that most of the variance is contained in the first few new axes. Therefore, the remaining axes can be ignored, that is, the data is reduced-dimensional processing. In addition to PCA, other dimensionality reduction techniques include ICA (independent component analysis), factor analysis, and so on.

　　(6) combining different classifiers , and this combination result is called the Integration Method (ensemble) or the meta-algorithm (META-ALGORITHM).

　　(7) Regression algorithm and classification algorithm is very similar , but the regression algorithm and classification algorithm output nominal category value is different, the regression method will predict a continuous value, that is, the regression will predict the specific data, and classification can only predict the category.

　(8) SVD (singular value decomposition) singular value decomposition:

Advantages: Simplifying data, removing noise, and improving the results of the algorithm.

Cons: Data conversion can be difficult to understand.

Applicable data type: numeric data.

PS:SVD is a type of matrix decomposition.

Summary: SVD is a powerful tool for dimensionality reduction, and we can use SVD to approximate matrices and extract important features from them. By preserving the energy of the Matrix 80%~90%, important features are obtained and noise is removed. SVD has been applied to multiple applications, and one of the successful application cases is the recommendation engine. Recommendation engines recommend items to users, and collaborative filtering is a recommendation and implementation method based on user preferences and behavioral data. The core of collaborative filtering is the similarity calculation method, and many similarity calculation methods can be used to calculate the similarity between objects and users. By computing similarity in low-dimensional space, SVD improves the effectiveness of the recommendation engine.

　(9) collinearity: The model is estimated to be distorted or difficult to estimate between explanatory variables in the linear regression model due to the existence of precise correlation or highly correlated relationships.

2. Basic algorithms

2.1 Logistic regression:

Advantages: The calculation cost is not high, easy to understand and realize.

Disadvantage: Easy to fit, the classification accuracy may not be high.

Applicable data types: numeric and nominal data.

Category: Classification algorithms.

Trial scenario: Solve the two classification problem.

Summary: The logistic regression algorithm is based on sigmoid function, or sigmoid is a logistic regression function. The sigmoid function is defined as follows: 1/(1+exp (-Z)). The range of function domains (0,1). Can be used to make classifiers.

The function curve of the sigmoid function is as follows:

The logistic regression model is decomposed as follows: (1) First, the attribute values of the different dimensions and the corresponding set of weights are added:

The formula is as follows: Z = w0+w1x1+w2x2+...+wm*xm. (where X1,x2,..., xm is the characteristic of a sample data, the dimension is M)

PS: Here is a linear regression. W weight value is the value that needs to be trained to learn, the solution of the specific w vector, we need to use the maximum likelihood estimation and the likelihood estimation function into the optimization algorithm to solve. The most commonly used final algorithm has the gradient ascending algorithm.

It is visible from the above: Although the logistic regression function is a nonlinear function, the other steps are consistent with the linear regression after removing the sigmoid mapping function.

(2) Then the above-mentioned linear target function Z into the Sigmond logistic regression function, you can get the value range (0,0.5) and (0.5,1) two kinds of values, equal to 0.5 of how to deal with their own set. In fact, there are 2 kinds of data, it also embodies the concept of two classification.

Conclusion: The purpose of logistic regression is to find the best fitting parameters of a nonlinear function sigmoid, and the process of solving the parameters can be accomplished by the optimization algorithm. In the optimization algorithm, the gradient ascending algorithm is the most common one, and the gradient ascending algorithm can be simplified to the random gradient ascending algorithm.

　2.2 SVM (supported vector machines) Support vectors machine:

Advantages: The generalization error rate is low, the calculation cost is small, the result is easy to explain.

Cons: Sensitive to parameter adjustment and kernel function selection, the original classifier is only suitable for handling two classification problems.

Applicable data types: numeric and nominal data.

Category: Classification algorithms.

Trial scenario: Solve the two classification problem.

Briefly speaking, SVM is a two-class classification model, and its basic model is defined as the most spaced linear classifier on feature space, that is, the learning strategy of support vector machine is to maximize the interval, and finally can be transformed into a convex two-time programming problem solution. Or simply, it can be understood that finding a reasonable hyper-plane in a high-dimensional space separates the data points, which involves the mapping of non-linear data to high-dimensional to achieve the purpose of linear divisible data.

Support Vector Concepts:

The above sample map is a special two-dimensional situation, of course, the real situation may be many dimensions. Start with a simple understanding of what a support vector is at a low latitude. Can see 3 lines, the middle of the red line to the other two first distance is equal. The red one is the hyper-plane that SVM looks for in two-dimensional cases for two categorical data. And the point supporting the other two lines is the so-called support vector. As you can see, there is no sample in the middle of the super plane and the other two lines. After finding this super plane, we use the data mathematical representation of super plane to classify the sample data two, which is the mechanism of SVM.

PS: There are several concepts in the "machine Learning Combat" book:

(1) If a straight line (or multidimensional polygon) can be found to separate the sample points, then this set of data is linearly separable. A straight line (or multidimensional polygon) separating the above datasets is called a delimited hyper-plane. Data that is distributed over the side of the plane belongs to a category, and the data that is distributed across the other side of the plane belongs to another category

(2) the support vector (SVM) is the closest point of separation of the hyper-plane.

(3) SVM can be used for almost all classification problems, it is worth mentioning that SVM itself is a two classification classifier, the application of SVM to multi-class problems need to make some changes to the code.

Formula:

SVM has many implementations, but this chapter value focuses on one of the most popular implementations, and the Sequence minimization optimization (sequential Minimal optimization,smo) algorithm.

The formula is as follows:

The objective of the SMO algorithm is to find out the alpha of some columns and, once alpha is obtained, it is easy to calculate the weight vector W and to get the separated hyper plane.

The SMO algorithm works by selecting two alpha per cycle for optimal processing. Once a pair of appropriate alpha is found, increase one and decrease the other. Here the so-called "fit" means that two alpha must meet certain conditions, one of the conditions is that the two alpha must be outside the interval boundary, and the second condition is that the two alpha has not been the interval processing or not on the boundary.

　　The kernel function maps data from a low dimension to a high dimension:

SVM classifies the data by looking for a super plane, but when the data is not linear, it is necessary to use the kernel function to map the data from low-dimensional to high-dimensional so that it can be linearly divided, in the application of SVM theory.

Example:

The two-dimensional data distribution is not linearly divided, and its equation is:

But after mapping through the kernel function dimension, it becomes:

The corresponding equations are:

So the mapped data becomes linearly divided, and the SVM theory can be applied.

Summary: Support vector machine is a classifier. The reason for being a "machine" is that he produces a result of a binary decision, which is a ' decision-making ' machine. Nuclear methods or nuclear techniques will map data (sometimes nonlinear data) from a low-dimensional space to a high-dimensional space, which can be solved by converting a nonlinear problem in a low-dimensional space into a linear problem in a high-dimensional space.

2.3 Decision Tree

: Advantages: The computational complexity is not high, the output is easy to understand, the missing middle value is not sensitive, can deal with irrelevant feature data.

Disadvantage: There may be an over-matching problem.

Applicable data type: numeric and nominal type.

Algorithm type: Classification algorithm.

Data requirements: The structure of the tree only applies to the data of the nominal type, so the numerical data must be discretized.

Description: The first question that we need to solve when constructing a decision tree is which feature on the current dataset plays a decisive role in classifying the data. In order to find the decisive features and to divide the best results, we must evaluate each feature. When the test is complete, the raw data is divided into subsets of data. A subset of these data is distributed across all branches of the first decision point, and if the data under a branch is of the same type, no further cutting of the dataset is necessary. Conversely, further cutting is required.

The pseudo-code to create the branch is as follows:

Detects whether each subkey in the dataset belongs to the same category:

If so return class label;

Else

Finding the best features of a data set

Partitioning data sets

Create a branch node

For each subset of partitions

Call the function createbranch and increase the return result to the branch node.

Return Branch Node

Before we can evaluate which data partitioning method is the best data partitioning, we must learn how to calculate the information gain. The information measure of a set is called Shannon Entropy or simply entropy. Entropy is defined in information theory as the expected value of a message.

The formula for the entropy of information is:

H (information entropy) =-∑p (xi) log2p (xi) PS: where P (xi) represents the probability of selecting the cluster.

The steps to build a decision tree are outlined below:

(1) According to the given training data, according to the maximum entropy principle according to each dimension to divide the data set, find the most critical dimension.

(2) When all the data under a branch is in the same category, the partition is terminated and the class label is returned, otherwise the (1) procedure is repeated on this branch.

(3) The class tag is built into a decision tree in turn.

(4) After constructing the decision tree based on the training data, we can use it to classify the actual data.

PS: Of course, there are more algorithms for generating decision trees than this one, and there are other ways to generate decision trees, such as C4.5 and cart.

Summarize:

The decision tree classifier is like a flowchart with a terminating block, and the terminating block represents the classification result. When we start working with datasets, we first need to measure the inconsistency of the data in the collection, which is entropy, and then find the optimal scheme to partition the dataset until all the data in the dataset belongs to the same classification.

2.4 Naive Bayes:

Pros: It is still valid in the case of less data, and can handle multiple categories of problems.

Cons: Sensitive to the way the input data is prepared.

Applicable data type: nominal type data.

Algorithm type: Classification algorithm

Brief introduction: Naive Bayes is a part of Bayesian theory, the core idea of Bayesian decision-making theory, that is to choose the decision with high probability. Naive Bayes is a simple beginning because it makes two assumptions on the basis of Bayesian theory:

(1) Each feature is independent of each other.

(2) Each feature is equally important.

The Bayesian criterion is based on the conditional probabilities, ps:p (h| x) is the probability that it belongs to class H according to the X parameter value, called the posterior probability. P (h) is the probability of directly judging a sample's H, called a priori probability. P (x| h) is the probability of x being observed in class H (posterior probability), and P (x) is the probability that x is observed in the database. The Bayesian criterion is based on conditional probability and is inseparable from the prior probability and posterior probability of the observed sample.

Summary: For classification, the probability of using something is more effective than using hard rules. Bayesian probabilities and Bayesian criteria provide an effective method for estimating unknown probabilities using known values. It is possible to reduce the need for data volume by assuming the conditional independence between features. Although the hypothesis of conditional independence is not correct, naive Bayes is still an effective classifier.

2.5 k-Nearest neighbor algorithm (KNN):

Advantages: High precision, insensitive to outliers, no data input assumptions.

Disadvantages: High computational complexity, space complexity.

Applicable data range: Numerical and nominal type.

Algorithm type: Classification algorithm.

Description: Algorithm principle, there is a collection of sample data, also known as the training sample set, and each data in the sample set has a label, that is, we know the sample set of each data and the corresponding relationship between the classification. After entering new data without a label, each feature of the new data is compared with the characteristics of the data in the sample set, and the algorithm extracts the classification label of the most similar data (nearest neighbor) in the sample set. In general, we only select the first k most similar data in the sample data set, which is the source of K in the K-nearest neighbor algorithm, usually K is an integer not greater than 20. Finally, the most frequently occurring classification of K most similar data is selected as the classification of new data.

2.6 Linear regression (Linear Regression):

Advantages: The results are easy to understand and the calculation is not complex.

Disadvantage: It is not good to fit nonlinear data.

Applicable data types: numeric and nominal data.

Algorithm type: Regression algorithm.

PS: The regression is different from the classification, that is, the target variable is a continuous numerical type.

Summary: In statistics, linear regression (Linear Regression) is a regression analysis that models the relationship between one or more independent variables and dependent variables using the least squares function called linear regression equations. This function is a linear combination of one or more model parameters called the regression coefficients (the arguments are all one-side). The case of only one argument is referred to as simple regression, which is greater than an independent variable condition called multivariate regression.

The vector representation of the model function of a linear equation is:

The optimal solution of vector coefficients is found by training data set, which is to solve the model parameters. The optimization method for solving the model coefficients can be used to solve the loss function by using the "least squares" and "Gradient descent" algorithms:

Additional: Ridge regression (ridge regression):

Ridge regression is a biased estimation regression method dedicated to collinearity data analysis, which is essentially an improved least squares estimation method, and the regression coefficients are more consistent with the actual and more reliable regression method by discarding the unbiased of least squares and the cost of losing some information and reducing the accuracy. The tolerance of pathological data is much stronger than the least square method.

The Ridge regression analysis method is a statistical method to eliminate the linear effect of complex collinearity fundamentally. By introducing a very small ridge parameter K (1>k>0) into the correlation matrix and adding it to the main diagonal element, the ridge regression model reduces the effect of the complex collinear feature vectors in the least squares estimation of the parameters and reduces the method of least squares estimation of the complex collinear variable coefficients to ensure that the parameter estimation is closer to the real situation. Ridge regression analysis introduces all the variables into the model, providing more information than stepwise regression analysis.

Summary: As with classification, regression is also the process of predicting target values. The difference between regression and classification is that the former predicts a continuous type of variable, while the latter predicts discrete variables. Regression is one of the most powerful tools in statistics. In the regression equation, the method of obtaining the optimal regression system of the characteristic corresponds to the sum of the squares of minimizing errors.

2.7 Tree regression:

Pros: You can model complex and non-linear data.

Cons: The results are difficult to understand.

Applicable data types: numeric and nominal data.

Algorithm type: Regression algorithm.

Description: Linear regression method can fit all sample points effectively (except local weighted linear regression). When the data has many characteristics and the relationship between the characteristics is very complex, it is difficult to construct the regression algorithm of the global model. In addition, many of the problems in practice are non-linear, such as common piecewise functions, which cannot be fitted with global linear model classes. Tree regression splits a dataset into multiple, easily modeled data, and then uses linear regression to model and fit. The more classical tree regression algorithm is the cart (classification and regreesion trees categorical regression tree).

A detailed description of the cart algorithm can be seen in this article: Http://box.cloud.taobao.com/file/downloadFile.htm?shareLink=1GIQrknG (to tell the truth, the general understanding, see not very understand, Who knows the more thoroughly can share under).

2.8 K-means (K-mean-value algorithm):

Pros: Easy to implement.

Disadvantage: It is possible to converge to the local minimum and converge slowly on large datasets.

Applicable data type: numeric data.

Algorithm type: Clustering algorithm.

Ps:k-means and the above classification and regression algorithm, it belongs to unsupervised learning algorithm. Target variables in similar classifications and regressions do not exist beforehand. Unlike the preceding "for data variable x predictor variable Y", the question to be answered by unsupervised learning algorithms is: "What can be found in data x?" The possible question of x that needs to be answered here is: "What are the best 6 data families that make up X" or "which three features in X are the most frequent?".

Basic Steps for K-means:

(1) Randomly initialize K initial points from a data object as centroid. Each point in the dataset is then assigned to a cluster, specifically each point finds its nearest centroid and assigns it to the cluster corresponding to that centroid.

(2) Calculate the mean value of the sample points in each cluster, and then update the centroid of the cluster with the mean value. Then divide the cluster nodes.

(3) iterative repetition (2) process, when the cluster object no longer changes, or the error in the evaluation function estimates the range, stop iteration.

The algorithm has an upper bound of time complexity of O (NKT), where T is the number of iterations.

PS: the initial K centroid selection and the distance calculation formula will affect the overall performance of the algorithm.

Additional:

Two-point K-means algorithm: In order to overcome the problem that K-means algorithm converges to local minimum value, another algorithm called binary K-means (bisecting k-means) is proposed. The algorithm first takes all the points as a cluster and then divides the clusters into split. After selecting one of the clusters to continue dividing, the choice of which cluster to divide depends on whether the division can minimize the value of SSE (sum of squared error, total error sum of two clusters).

2.9 Algorithm Correlation Analysis:

The first two concepts are:

Frequent itemsets (frequent item sets): A collection of items that often appear in a piece.

Association Rule (Association Rules): implies that there may be a strong relationship between the two items.

Support for itemsets: The data set contains the proportion of the recordset record.

The goal of association analysis includes two items: Discovering frequent item sets to discover association rules. The frequent itemsets are first found before the association rules can be obtained.

Apriori algorithm:

Advantages: Easy Coding implementation.

Disadvantage: may be slower on large datasets.

Applicable data type: Numeric or nominal type data.

Principle: If an item set is frequent, then all of his subsets are frequent.

Apriori Use demo example See blog: http://blog.csdn.net/lantian0802/article/details/38331463

Briefly:

Apriori algorithm is a method of discovering frequent itemsets. The two input parameters of the Apriori algorithm are the minimum support degree and the data set. The algorithm first generates a list of all item sets for a single item. The scan list then calculates the item set support for each item, excludes items below the minimum support level, and then combines each item 22, then recalculates the support for the consolidated item list and compares it to the minimum support level. Repeat this process until all itemsets have been removed.

Summarize:

Correlation analysis is a toolset for discovering interesting relationships among elements in a large data set, and there are two ways to quantify these interesting relationships. Discovering the different combinations of elements is a time-consuming task and inevitably requires a lot of expensive computational resources, which requires a smarter approach to finding frequent itemsets within a reasonable time frame. One way to achieve this is the Apriori algorithm, which uses the Apriori principle to reduce the number of collections that are checked on the database. The Apriori principle is that if an element is not frequent, then those containing the superset of that element are infrequent. The Apriori algorithm starts with a single-element itemsets and forms a larger set by combining itemsets that meet the minimum support requirements. The degree of support is used to measure how often a collection appears in the original data.

2.10 Fp-growth algorithm:

Description: Fp-growth is also an algorithm for discovering frequent itemsets, and he uses the structure of the FP tree to store building elements, and other apriori algorithms perform much better. Typically, the performance is better than 2 orders of magnitude. The process of discovering frequent itemsets is as follows: (1) Building the FP tree. (2) Mining frequent itemsets from the FP tree.

Advantages: generally faster than Apriori.

Cons: implementation is difficult and performance degrades on some datasets.

Applicable data type: nominal type data.

Summary: The FP-GROWTH algorithm is an effective method for discovering frequent patterns in data sets. The fp-growth algorithm leverages the Apriori principle for faster execution. The Apriori algorithm generates candidate sets, and then scans the datasets to check if they are frequent. The fp-growth algorithm executes faster because the data set is scanned only two times. In the fp-growth algorithm, datasets are stored in a structure called the FP tree. After the FP tree is built, you can find frequent itemsets by finding the condition of the element item and the FP tree. The process continues to repeat with more elements as a condition until the FP tree contains only one element.

A summary of 9 basic concepts and 10 basic algorithms for machine learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More