Scikit-learn is a very popular open source library in the field of machine learning, written in the Python language. Free to use.
Website: http://scikit-learn.org/stable/index.html
There are a lot of tutorials, programming examples. And also made a good summary, the following figure summarizes the traditional machine learning field of most theories and related algorithms.
We can see that machine learning is divided into four chunks, namely classification (classification), Clustering (clustering), regression (regression), dimensionality reduction (dimensionality reduction).
Given a sample feature x , we want to predict its corresponding property value y If y is discrete, then this is acategoryProblem, conversely, if y is a continuous real number, which is areturnProblem.
If a set of sample characteristics is given S={x∈ R D } , we don't have a corresponding y , but to explore the set of samples D The distribution of dimensional spaces, such as the analysis of which samples are closer to each other and which samples are far apart, isClusteringProblem.
If we want to use the subspace with lower dimensionality to represent the original high-dimensional feature space, then this is the dimensionality reduction problem.
Classification & Regression
Whether it's classification or regression, it's about building a predictive model. H , given an input x , you can get an output y :
y=H(x)
The difference is only in the classification problem, y is discrete; And in the regression problem, y is continuous. So the learning algorithms for both kinds of problems are very similar. So on this graph, we see that the learning algorithms used in the classification problem can also be used in regression problems. The most common learning algorithms for classification problems include SVM (support vector machine), SGD (random gradient descent algorithm), Bayes (Bayesian estimation), Ensemble, KNN, etc. The regression problem can also use SVR, SGD, Ensemble and other algorithms, as well as other linear regression algorithms.
Clustering
Clustering is also an attribute of the analysis sample, somewhat similar to classification, and the difference is that classification is known before predicting y Span style= "Display:inline-block; width:0px; Height:2.279em; " > Scope, or know exactly how many categories, and clustering is not aware of the scope of the property. So classification is also often called supervised learning, and clustering is called unsupervised learning.
Clustering does not know the attribute range of the sample beforehand, it can only analyze the properties of the sample based on the distribution of the sample in the feature space. This problem is generally more complex. The commonly used algorithms include K-means (K-means), GMM (Gaussian mixture model) and so on.
dimensionality reduction
Dimensionality reduction is another important field of machine learning, there are many important applications in dimensionality reduction, the dimension of features is too high, it will increase the burden and storage space of training, dimensionality reduction is the redundancy that wants to remove the feature, and the feature is represented by less dimension. The most fundamental of the dimensionality reduction algorithm is PCA, and many of the algorithms are based on PCA.
Scikit-learn Atlas of Machine learning