Whenever machine learning is mentioned, everyone is always confused by various algorithms and methods, and feels that there is no way to start. Indeed, there are quite a lot of routines for machine learning, but if you have the right path and method, you still have a lot to follow. Here I recommend this blog from SAS's Li Hui, which explains how to choose machine learning. Ways.
In addition, Scikit-learn also provides a clear road map for everyone to choose from:
In fact, the basic algorithms of machine learning are very simple. Let's take a look at some basic algorithms in machine learning and their principles using 2D data and interactive graphics. (In addition to paying tribute to Bret Victor, his Inventing on principle has deeply influenced me)
All the code demos can be found in my Collection of Codepen.
First, the supervised learning and unsupervised learning of the largest branch of machine learning, simply saying that data has been tagged is supervised learning, while unlabeled learning is unsupervised. From the perspective of large classification, dimensionality reduction and clustering are classified in unsupervised learning, and regression and classification are supervised learning.
Unsupervised learning
If your data is not tagged, you can choose to ask someone to tag your data or use unsupervised learning.
First you can consider whether you want to reduce the dimensionality of the data.
Dimensionality reduction
Dimensioning as the name suggests is to turn high-dimensional data into low dimensions. Common dimensionality reduction methods are PCA, LDA, SVD, etc.
Principal component analysis PCA
The most classic method of dimension reduction is principal component analysis (PCA), which is to find the main components of the data and discard the unimportant components.
Here we first randomly generate 8 data points with the mouse, and then draw a white line representing the main component. This line is the main component of the dimensional reduction of the two-dimensional data, and the blue line is the projection line of the data point in the new principal component dimension, that is, the vertical line. The mathematical meaning of principal component analysis can be seen as finding this white line such that the sum of the lengths of the projected blue segments is the minimum.
Clustering
Because in the unsupervised learning environment, the data is not labeled, the best analysis that can be done on the data, in addition to dimensionality reduction, is to group together data with the same characteristics, that is, clustering.
Hierarchical clustering
This clustering method is used to build a cluster with a hierarchy
As shown in the figure above, the algorithm for hierarchical clustering is very simple:
At the initial moment, all points are themselves a cluster
Find the nearest two clusters (starting at two points) to form a cluster
The distance between two clusters refers to the distance between the two closest points in the cluster
Repeat the second step until all points are clustered into the cluster.
KMeans
KMeans Chinese translation K-means algorithm is the most common clustering algorithm.
Randomly take K (where K = 3) central seed points in the graph.
Then the distances of the K center seed points are obtained for all points in the graph. If the point P is closest to the center point S, then P belongs to the cluster of S points.
Next, we want to move the center point to the center of the "cluster" that belongs to him.
Then repeat steps 2) and 3) until the center point has not moved, then the algorithm converges and finds all the clusters.
The KMeans algorithm has several problems:
How to determine the K value, in the example above, I know that I want to divide into three clusters, so choose K equal to 3, but in practical applications, I often don't know that it should be divided into several classes.
Since the initial position of the center point is random, it may not be correctly classified. You can try different data in my Codepen.
As shown in the figure below, if the distribution of data is spatially specific, the KMeans algorithm cannot be effectively classified. The middle points are respectively orange and blue, and they should all be blue.
DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm.
The DBSCAN algorithm is based on the fact that a cluster can be uniquely determined by any of its core objects.
The specific clustering process of the algorithm is as follows:
Scan the entire data set, find any core point, and expand the core point. The method of expansion is to find all the density-connected data points starting from the core point (note that the density is connected).
Traverse all the core points in the neighborhood of the core point (because the boundary points are not expandable), look for points that are connected to the density of these data points until there are no data points that can be expanded. The boundary nodes of the last clustered clusters are all non-core data points.
Then rescan the data set (not including any data points in the previously found cluster), look for the core points that are not clustered, repeat the above steps, and expand the core points until there are no new core points in the data set. until. Data points that are not included in any cluster in the data set constitute anomalies.
As shown in the above figure, DBSCAN can effectively solve the data set that KMeans cannot correctly classify. And you don't need to know the K value.
Of course, DBCSAN still has to decide two parameters. How to determine these two parameters is the key factor of the classification effect:
One parameter is the radius (Eps), which represents the range of circular neighborhoods centered at a given point P;
Another parameter is the number of min points (MinPts) in the neighborhood centered on point P. If it is satisfied that the number of points in the neighborhood with the point E and the radius of Eps is not less than MinPts, the point P is called the core point.
Supervised learning
The data required for supervised learning requires a label. This means predicting emerging data against existing results. If the content to be predicted is a numeric type, we call it regression. If the content to be predicted is a category or discrete, we call it a classification.
In fact, regression and classification are similar in nature, so many algorithms can be used both as classification and as regression.
Linear regression
Linear regression is the most classic regression algorithm.
In statistics, linear regression is a regression analysis that models the relationship between one or more independent variables and dependent variables using a least squares function called a linear regression equation.
This function is a linear combination of one or more model parameters called regression coefficients. The case of only one independent variable is called simple regression, and the case of more than one independent variable is called multiple regression.
As shown in the figure above, linear regression is to find a straight line that minimizes all point prediction errors. That is, the sum of the blue straight line segments in the figure is the smallest. This picture is very similar to the PCA in our first example. Look carefully and tell the difference.
If the accuracy of the algorithm is relatively high, the recommended regression algorithms include: random forest, neural network or Gradient Boosting Tree.
If speed priority is required, it is recommended to consider decision trees and linear regression.
Classification
Support vector machine SVM
If the accuracy of the classification is relatively high, the algorithms that can be used include Kernel SVM, Random Forest, Neural Network, and Gradient Boosting Tree.
Given a set of training instances, each training instance being marked as belonging to one or the other of the two categories, the SVM training algorithm creates a model that assigns a new instance to one of the two categories, making it non-probabilistic Meta linear classifier.
The SVM model represents an instance as a point in space, such that the mapping separates instances of individual categories by as wide a clear interval as possible. Then, map the new instances to the same space and predict which category they belong based on which side of the interval they fall on.
As shown in the figure above, the SVM algorithm finds a straight line in space and can best split the two sets of data. The sum of the absolute values of the distances of the two sets of data to the straight line is made as large as possible.
The figure above illustrates the different classification effects of different nuclear methods.
Decision tree
If the classification results are required to be explained, consider decision trees or logistic regression.
The decision tree is a tree structure (which can be a binary tree or a non-binary tree).
Each of its non-leaf nodes represents a test on a feature attribute, each branch representing the output of the feature attribute over a range of values, and each leaf node storing a category.
The decision process using the decision tree is to start from the root node, test the corresponding feature attributes in the item to be classified, and select the output branch according to its value until the leaf node is reached, and the category stored by the leaf node is used as the decision result.
Decision trees can be used for regression or classification, and the following figure is an example of classification.
As shown in the figure above, the decision tree divides the space into different regions.
Logistic regression
Logical regression, although the name is regression, is a classification algorithm. Because it is a two-category similar to SVM, the mathematical model is the probability of predicting 1 or 0. So I said that regression and classification are essentially the same.
Pay attention to the difference between logistic regression and linear SVM classification here.
Naive Bayes
The naive Bayesian method is a good choice when the amount of data is quite large.
In the past 15 years, I shared the bayers method with my friends in the company. Unfortunately, the speaker deck was walled. If you are interested, you can find your own way.
As shown in the above figure, you can think about the impact of the green dot on the bottom left on the overall classification results.
KNN
The KNN classification is probably the simplest of all machine learning algorithms.
As shown in the figure above, K=3, when the mouse moves to any point, it finds the K points closest to the point. Then, the K points vote, and the majority vote wins. It's that simple.