Summary of basic concepts of machine learning algorithms

Source: Internet
Author: User
Tags svm
1. Basic concepts:

(1) cross-validation: The English name is 10-fold cross-validation, used to test the accuracy of the algorithm. Is a common test method. Divide the dataset into 10 parts. Take nine of them as training data in turn, and one point as test data for testing. The correct rate (or error rate) is obtained for each test ). The average value of the accuracy rate (or error rate) of the 10 results is used as an estimation of the algorithm accuracy. Generally, the 10-fold Cross verification is required for multiple times, estimate the accuracy of the algorithm.
(2) maximum likelihood estimation: maximum likelihood estimation is only an application of Probability Theory in statistics. It is one of the parameter evaluation methods. It is known that a random sample satisfies a certain probability distribution, but the specific parameters are unclear. The parameter is estimated through several experiments, the results are observed, and the approximate value of the parameters is introduced using the results. The maximum likelihood estimation is based on the idea that a parameter is known to maximize the probability of this sample. Of course we will not select other small probability samples, so we simply take this parameter as the estimated real value. (3) In information theory, entropy represents a measure of uncertainty. In his book mathematical theory of communication, Shannon, the founder of information theory, proposed Information Measurement Based on the probability statistics model. He defined information as "Something to eliminate uncertainty". Entropy is defined as the expected value of information.
PS: entropy refers to the degree of chaos in the system. It has important applications in the fields of control theory, probability theory, number theory, celestial physics, and life science, more specific definitions are also derived from different disciplines and are very important parameters in various fields. Entropy was proposed by Rochelle krausius and applied in Thermodynamic. Later, Claude Elwood Shannon introduced entropy to information theory for the first time.

(4) posterior probability is one of the basic concepts of information theory. In a communication system, after receiving a message, the probability of sending the message learned by the receiving end is called the post-verification probability. Posterior Probability refers to the probability of re-correction after obtaining the "result" information, such as in Bayesian formula. It is a result query problem. The posterior probability is inseparable from the anterior probability. The calculation of the posterior probability is based on the anterior probability. In fact, the posterior probability is actually a conditional probability.

(5) PCA:
Advantage: reduces data complexity and identifies the most important features.
Disadvantage: it is not necessary and may cause loss of useful information.
Applicable type: numeric data.
Technical Type: Dimensionality Reduction Technology.

Description: In PCA, data is converted from the original coordinate system to the new coordinate system. The selection of the new coordinate system is determined by the data. When the first new axis is selected, the direction with the greatest variance in the original data is selected, and the direction with the maximum variance is orthogonal with the first axis. This process is repeated, and the number of repetitions is the number of features in the raw data. Most of the variance is included in the first several new axes. Therefore, the remaining axes can be ignored, that is, the data is downgraded. In addition to PCA, other dimensionality reduction technologies include ICA (Independent Component Analysis) and factor analysis.

(6) combine different classifiers. The combination result is called the integrated method or meta-algorithm ).

(7) The regression algorithm is similar to the classification algorithm, but the regression algorithm and the classification algorithm output the nominal category value is different. The regression method predicts a continuous value, that is, regression predicts specific data, and classification can only predict categories.

(8) SVD (Singular Value Decomposition) Singular Value Decomposition:
Advantage: simplifies data, removes noise, and improves algorithm results.
Disadvantage: Data conversion may be hard to understand.
Applicable data type: numeric data.
PS: SVD is a type of matrix decomposition.
Summary: SVD is a powerful dimension reduction tool. We can use SVD to approach the matrix and extract important features from it. By retaining the matrix 80% ~ 90% of the energy can be used to obtain important features and remove noise. SVD has been applied to multiple applications. One successful application case is the recommendation engine. The recommendation engine recommends items to users. Collaborative Filtering is a recommendation and implementation method based on user preferences and behavior data. The core of collaborative filtering is similarity calculation. Many similarity calculation methods can be used to calculate similarity between items or users. By calculating similarity in a low-dimensional space, SVD improves the performance of the recommendation engine.

(9) collinearity: it refers to the distortion or difficulty of model estimation due to the existence of precise correlation or high correlation between interpretation variables in linear regression models.

2. Basic Algorithms

2.1 logistic regression:
Advantages: low computing cost, easy to understand and implement.
Disadvantage: it is easy to perform underfitting, and the classification accuracy may not be high.
Applicable data types: numeric and nominal data.
Category: classification algorithm.
Trial scenario: Solve the Problem of binary classification.

Description: Logistic Regression Algorithms are based on the sigmoid function, or sigmoid is a logistic regression function. The sigmoid function is defined as follows: 1/(1 + exp (-z )). Function value range: (0, 1 ). It can be used as a classifier.
The function curves of the sigmoid function are as follows:

The logistic regression model is decomposed as follows: (1) first, the attribute values of different dimensions and corresponding weights are added:
The formula is as follows: z = w0 + w1x1 + w2x2 +... + WM * XM. (X1, x2,..., XM is a feature of a sample data, and its dimension is M)
PS: Here is a linear regression. W weight is the value that needs to be learned after training. To solve the specific W vector, we need to use the maximum likelihood estimation and the likelihood estimation function into the optimization algorithm for solving. The most common final algorithms are gradient rise algorithms.
As shown in the preceding figure, although a logistic regression function is a non-linear function, after the sigmoid ing function is removed, other steps are consistent with linear regression.
(2) then, the linear target function z is substituted into the sigmond Logistic Regression Function to obtain two types of values: (0, 0.5) and (0.5, 1, you can determine how to deal with a value equal to 0.5. In this way, two types of data are obtained, which reflects the concept of binary classification.

Conclusion: Logistic Regression aims to find the optimal fitting parameter of a non-linear function sigmoid. The parameter solving process can be completed by the optimization algorithm. In the optimization algorithm, the most common is the gradient rise algorithm, which can be simplified to a random gradient rise algorithm.

2.2 SVM (Support Vector Machines) Support Vector Machines:
Advantage: Low generalization error rate, low computing overhead, and easy interpretation of results.
Disadvantage: sensitive to Parameter Adjustment and kernel function selection. The original classifier is only applicable to binary classification without modification.
Applicable data types: numeric and nominal data.
Category: classification algorithm.
Trial scenario: Solve the Problem of binary classification.

Summary: In layman's terms, SVM is a second-class classification model. Its basic model is defined as the linear classifier with the largest interval in the feature space. That is, the learning strategy of SVM is to maximize the interval, finally, it can be converted into solving a convex quadratic programming problem. Or simply, we can find a reasonable hyperplane to separate data points in a high-dimensional space, which involves ing non-linear data to a high-dimensional space for the purpose of separating data lines.

Support vector concept:

The preceding sample graph is a special two-dimensional scenario. The actual situation may be multidimensional. First, let's take a simple look at what is a support vector from the low latitude. We can see three lines. The red line in the middle is equal to the distance between the other two. This red line is the hyperplane that SVM is looking for in two-dimensional situations. It is used for binary classification data. The point supporting the other two online is the so-called support vector. We can see that there is no sample in the middle of the hyperplane and the other two lines. After finding this hyperplane, we use the mathematical representation of the hyperplane data to perform binary classification of the sample data, which is the SVM mechanism.
PS: The machine learning practice book has the following concepts:
(1) If a straight line (or multi-dimensional surface) can be found to separate the sample points, then this set of data is linearly segmented. A straight line (or multi-dimensional plane) that separates the preceding dataset is called a separation hyperplane. Data distributed on one side of the hyperplane belongs to one category, and data distributed on the other side of the hyperplane belongs to another category.
(2) Support vector is the point closest to the hyperplane.
(3) SVM can be used for almost all classification problems. It is worth mentioning that SVM itself is a binary classification classifier and some modifications need to be made to the Code for applying SVM to multiple categories of problems.

Formula:
SVM has many implementations, but this chapter focuses on one of the most popular implementations and the sequential minimal optimization and SMO algorithms.
The formula is as follows:

The goal of the SMO algorithm is to find the alpha of some columns. Once Alpha is obtained, it is easy to calculate the weight vector W and get the separation of the hyperplane.
The principle of the SMO algorithm is: select two alpha in each loop for optimization. Once an appropriate alpha pair is found, increase one of them and reduce the other. Here the so-called "fit" means that the two alpha must meet certain conditions. one of the conditions is that the two alpha must be out of the interval boundary, the second condition is that the two alpha levels have not been segmented or are not on the boundary.

The core function maps data from low dimensions to high dimensions:
SVM classifies data by looking for a hyperplane. However, when the data is not linearly segmented, the kernel function is used to map the data from a low dimension to a high dimension so that the data can be linearly divided, applying SVM theory.
Example:
The two-dimensional data distribution is not linearly severable, and its equation is:

However, after dimension ing through the core function, it becomes:

The corresponding equation is:

In this way, the ing data becomes linearly segmented, And the SVM theory can be applied.

Summary: SVM is a classifier. It becomes a "machine" because it produces a binary decision-making result, that is, it is a "decision-making" machine. The core method or core technique maps data (sometimes non-linear data) from a low-dimensional space to a high-dimensional space, A non-linear problem in a low-dimensional space can be converted to a linear problem in a high-dimensional space.

2.3 Decision Tree
Advantages: low computing complexity, easy to understand output results, insensitive to missing median values, and the ability to process irrelevant feature data.
Disadvantage: excessive matching may occur.
Applicable data types: numeric and nominal.
Algorithm type: classification algorithm.
Data requirements: tree construction is only applicable to nominal data, so numerical data must be discretization.

Summary: when constructing a decision tree, the first problem we need to solve is which feature of the current dataset plays a decisive role in data classification. To find the decisive feature and divide the best results, we must evaluate each feature. After the test, the raw data is divided into several data subsets. The subset of the data is distributed across all the branches of the first decision point. If the data under a branch belongs to the same type, the dataset does not need to be further cut. Otherwise, further cutting is required.

The pseudocode for creating a branch is as follows:

Check whether each item in the dataset belongs to the same category: If so return class label; else looks for the best feature division of a dataset. The dataset creates a branch node for each subset of the dataset. Call the createbranch function and add the return result to the branch node. Return the branch node.

Before evaluating which data division method is the best data division, we must learn how to calculate information gain. The Information Measurement Method of a set is called Shannon entropy or entropy for short. Entropy is defined as the expected value of information in information theory.
The calculation formula of information entropy is as follows:
H (information entropy) =-Σ p (xi) log2p (xi) PS: P (xi) indicates the probability of selecting the classification.

The following describes how to generate a decision tree:
(1) Based on the given training data and based on the entropy principle, data sets are divided based on each dimension to find the most critical dimension.
(2) When all data under a branch is in the same category, the Division is terminated and the class label is returned. Otherwise, the (1) process is repeated on the branch.
(3) Class labels are constructed into a decision tree after calculation.
(4) After the decision tree is constructed based on the training data, we can apply it to the actual data classification.
PS: Of course, there are more than one decision tree generation algorithm. There are other methods for generating decision trees, such as C4.5 and cart.

Summary:
The decision tree classifier is like a flowchart with a termination block. The termination block indicates the classification result. When processing a dataset, we first need to measure the data inconsistency in the dataset, that is, entropy, and then find the optimal solution to divide the dataset until all the data in the dataset belongs to the same category.

2.4 Naive Bayes:
Advantage: it is still valid when the data volume is small and can handle multiple categories of problems.
Disadvantage: sensitive to input data preparation methods.
Applicable data type: nominal data.
Algorithm type: Classification Algorithm

Summary: Naive Bayes is part of bayesian theory. The core idea of Bayesian decision theory is to select a decision with a high probability. Naive Bayes begins with naive because it makes two assumptions based on Bayesian theory:
(1) Each feature is independent of each other.
(2) Each feature is equally important.
Bayesian criterion is based on conditional probability. Its formula is as follows:

PS: P (H | x) determines the probability of a Class H based on the value of X, which is called posterior probability. P (H) is the probability that a sample belongs to H. It is called a prior probability. P (x | H) is the probability (posterior probability) of X observed in Class H, and p (x) is the probability of X observed in the database. The Bayesian criterion is based on the conditional probability and is inseparable from the observed anterior probability and posterior probability.
Conclusion: For classification, the probability of use is more effective than hard rules. Bayesian probability and Bayesian criterion provide an effective method to estimate the unknown probability by using known values. The conditional independence hypothesis between features can be used to reduce the demand for data volumes. Although the assumption of conditional independence is incorrect, Naive Bayes is still an effective classifier.

2.5 k-Nearest Neighbor Algorithm (KNN ):
Advantages: high precision, insensitive to abnormal values, and no data input assumptions.
Disadvantages: high computing complexity and space complexity.
Applicable data range: numeric and nominal.
Algorithm type: classification algorithm.

Summary: The algorithm principle involves a sample data set, also known as a training sample set, and each data in the sample set has a tag, that is, we know the relationship between each data in the sample set and its category. After entering new data without tags, compare each feature of the new data with the feature corresponding to the data in the sample set, and then extract the classification tag of the most similar feature data (nearest neighbor) in the sample set. Generally, we only select the first K most similar data in the sample dataset. This is the source of K in the K-Nearest Neighbor Algorithm. Generally, K is an integer not greater than 20. Finally, select the category with the most occurrences of K most similar data as the category of new data.

2.6 linear regression (linear regression ):
Advantage: the results are easy to understand and the computation is not complex.
Disadvantage: poor fitting of non-linear data.
Applicable data types: numeric and nominal data.
Algorithm type: regression algorithm.
PS: the difference in regression to classification lies in the continuous numeric type of the target variable.

Summary: in statistics, linear regression is a regression analysis that uses the least square function called a linear regression equation to model the relationship between one or more independent variables and dependent variables. This function is a linear combination of one or more model parameters called regression coefficients (the independent variables are all single parties ). Only one independent variable is called simple regression, and a condition greater than one independent variable is called multiple regression.

The vector representation of model functions of linear equations is:

Find the optimal solution of vector coefficient through the training dataset, that is, solving model parameters. The optimizer method for solving model coefficients can use the "least square method" and "gradient descent" algorithms to solve the loss function:
.

Additional: Ridge Regression ):
Ridge Regression is a biased estimation regression method dedicated to common linear data analysis. In essence, it is an Improved Least Squares estimation method. By giving up the non-biased Least Squares, at the cost of losing part of the information and reducing the accuracy, the regression coefficient obtained is more realistic and reliable, and its tolerance to pathological data is far stronger than that of the least square method.
Ridge Regression Analysis is a statistical method that fundamentally eliminates the influence of the complex collinearity. The ridge regression model introduces a small ridge parameter K (1> K> 0) in the correlation matrix and adds it to the primary diagonal element, in this way, we can reduce the influence of the parameter's least square estimation on the feature vector of the complex collinearity, and reduce the method of the least square estimation of the coefficient of the complex collinearity variable to ensure that the parameter estimation is closer to the actual situation. Ridge Regression analysis introduces all the variables into the model, providing more information than step-by-step regression analysis.

Conclusion: Like classification, regression is also the process of predicting the target value. The difference between regression and classification is that the former predicts continuous variables, while the latter predicts discrete variables. Regression is one of the most powerful tools in statistics. In the regression equation, the optimal regression system for finding a feature is to minimize the sum of squares of errors.

2.7 tree regression:
Advantage: you can model complex and non-linear data.
Disadvantage: The result is hard to understand.
Applicable data types: numeric and nominal data.
Algorithm type: regression algorithm.

Summary: The linear regression method can effectively fit all sample points (except for local weighted linear regression ). When data has many features and the relationship between features is very complex, it is difficult to construct a regression algorithm for the global model. In addition, in practice, many problems are non-linear. For example, common piecewise functions cannot be fitted using the Global Linear Model class. Tree regression divides a dataset into multiple data copies that are easy to model, and uses linear regression for modeling and fitting. The typical tree regression algorithm is cart (Classification and regreesion trees classification regression tree ).

Cart algorithm detailed description can see this article: http://box.cloud.taobao.com/file/downloadFile.htm? Sharelink = 1 giqrkng ).

2.8 k-means (K-means algorithm ):
Advantages: easy to implement.
Disadvantage: It may converge to the local minimum value, which is slow in large-scale data sets.
Applicable data type: numeric data.
Algorithm type: clustering algorithm.
PS: K-means is different from the preceding classification and Regression Algorithms. It is an unsupervised learning algorithm. Target variables in similar categories and regression do not exist in advance. What is different from the preceding "prediction variable Y for data variable X" is that the unsupervised learning algorithm should answer the following question: "What can be found from data X? ", The possible question to be answered here is:" What are the best 6 data clusters of X? "or" which of the three features of X frequently co-exist? ".

Basic Steps for K-means:
(1) initialize K initial points randomly in the data object as the center of the center. Then, each vertex in the dataset is allocated to a cluster. Specifically, each vertex finds the closest center and assigns it to the cluster corresponding to the center.
(2) Calculate the mean value of the sample points in each cluster, and then update the center of the cluster with the mean value. Then divide the cluster nodes.
(3) iterative repetition (2) process. When the Cluster Object does not change, or the error is within the estimated range of the evaluation function, iteration is stopped.
The upper bound of the algorithm's time complexity is O (NKT), where T is the number of iterations.
PS: the initial selection of K centers and the quality of the distance calculation formula will affect the overall performance of the algorithm.
Additional:
Binary K-means algorithm: to overcome the problem that the K-means algorithm converges to the local minimum value, someone proposed another algorithm called bisecting K-means. This algorithm first treats all vertices as a cluster, and then splits the cluster into two parts. Then select one of the clusters to continue the Division. The selection of which cluster to divide depends on whether the sum of squared errors (sum of squared errors) of the two clusters can be minimized.

2.9 algorithm association analysis:
First, we have two concepts:
Frequent Item sets: a set of items that frequently appear in one item.
Association rules: implies a strong relationship between two items.
Support: the proportion of records in a dataset.
Association Analysis targets two items: discovering Frequent Items and discovering association rules. First, the frequent item set can be found before association rules can be obtained.

Apriori algorithm:
Advantages: Easy coding.
Disadvantage: large datasets may be slow.
Applicable data type: numeric or nominal data.
Principle: If an item set is frequent, all its subsets are also frequent.
See the blog: http://blog.csdn.net/lantian0802/article/details/38331463
Brief description:
The Apriori algorithm is a method for discovering frequent item sets. The two input parameters of the Apriori algorithm are the minimum support and dataset. This algorithm first generates a list of all the items of a single item. Then, scan the list to calculate the item set support for each item, exclude the items with lower than the minimum support, and combine each item in pairs, then, calculate the support of the integrated item list and compare it with the minimum support. Repeat this process until all items are removed.

Summary:
Association Analysis is a tool set used to discover interesting relationships between elements in a big data set. You can quantify these interesting relationships in two ways. Discovering different combinations of elements is a very time-consuming task, which inevitably requires a large amount of expensive computing resources, this requires some more intelligent methods to find frequent item sets within a reasonable time range. One way to achieve this goal is the Apriori algorithm, which uses the principle of Apriori to reduce the number of sets for checks on the database. The principle of "Apriori" is that if an element is infrequent, The supersets containing the element are also infrequent. The Apriori algorithm starts from a single element item set and combines items that meet the minimum support requirements to form a larger set. Support is used to measure the frequency of a set appearing in raw data.

2.10 FP-growth algorithm:
Summary: FP-growth is also an algorithm used to discover frequent item sets. It stores building elements in the structure of the FP tree, and the performance of other algorithms is much better. Generally, the performance is better than two orders of magnitude. The process of discovering frequent item sets is as follows: (1) Build a FP tree. (2) Mining Frequent Item sets from the FP tree.

Advantage: It is generally faster than that of Apriori.
Disadvantage: it is difficult to implement and the performance of some datasets will decrease.
Applicable data type: nominal data.

Conclusion: The FP-growth algorithm is an effective method for discovering frequent patterns in a dataset. The FP-growth algorithm uses the Apriori principle for faster execution. The Apriori algorithm generates candidate item sets and then scans the datasets to check if they are frequently used. Since the dataset is only scanned twice, the FP-growth algorithm is executed faster. In the FP-growth algorithm, a dataset is stored in a structure called the FP tree. After the FP tree is built, you can find frequent item sets by searching the conditions of element items and FP tree. This process repeats with more elements as conditions until the FP tree contains only one element.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.