When working in data science, you often encounter the problem of choosing the most appropriate algorithm for a specific problem. Although there are many articles on machine learning algorithms that detail the related algorithms, it is still very difficult to make the most appropriate choices.
In this article, I will give a brief introduction to some basic concepts and give some advice on using different types of machine learning algorithms for different tasks. At the end of the article, I will summarize these algorithms.
First, you should be able to distinguish between the following four machine learning tasks:
Supervised learning
Unsupervised learning
Semi-supervised learning
Reinforcement learning
Supervised learning
Supervised learning is the inference of a function from the labeled training data. By fitting the labeled training set, find the optimal model parameters to predict unknown tags on other objects (test sets). If the label is a real number, we call it regression. If the label comes from a finite number of values, these values are unordered, so it is called classification.
Unsupervised learning
In unsupervised learning, we know less about objects, especially if the training set has not been marked. What is the current goal? Observe the similarities between objects and divide them into different groups. Some objects may be very different from objects in other groups, so we think these objects are abnormal.
Semi-supervised learning
Semi-supervised learning includes two of the problems described above: using both labeled and unlabeled data. This is a great way for those who can't tag all the data. This approach can significantly improve accuracy because a small amount of tagged data can be used while using unmarked data in the training set.
Reinforcement learning
Reinforcement learning is not the same as the method mentioned above, because there are no marked or unmarked datasets here. Reinforcement learning involves how software agents should act in certain circumstances to maximize cumulative rewards.
Imagine that you are a robot in an unfamiliar environment, you can perform some actions and get rewards from it. After each action, your behavior becomes more complex and smarter, that is, you are training yourself to behave more effectively after performing each action. In biology, this is called adapting to the natural environment.
Common machine learning algorithm
Now, we have a certain understanding of the types of machine learning. Below, let's look at the most popular algorithms and their applications in real life.
Linear regression and linear classifier
These are probably the simplest algorithms in machine learning. Suppose there are features x1,...xn and labels (vector B) of the object (matrix A). Our goal is to find the optimal weights w1,... wn and the deviation of these features based on some loss function (eg MSE or MAE). In the case of MSE, there is a mathematical formula from the least squares method:
In practice, it is easier and more computationally efficient to use gradient descent for optimization. Although this algorithm is simple, it still works well when there are thousands of features. More complex algorithms may encounter problems with fitted features or data sets that are not large enough, and linear regression is a good choice.
To prevent overfitting, regularization techniques like lasso and ridge can be used. The main idea is to add the sum of the weights and the sum of the squares of the weights to the loss function.
Logistic regression
Logistic regression performs binary classification, so the output labels are binary. Given the input feature vector x, define P(y = 1|x) as the conditional probability when the output y is equal to one. The coefficient w is the weight that the model has to learn.
Since the algorithm needs to calculate the attribution probability of each category, the degree of difference between the probability and 0 or 1 should be considered, and all objects should be averaged as in linear regression. This loss function is the average of the cross entropy:
What are the benefits of logistic regression? It uses a linear combination of features and applies a nonlinear function (sigmoid) to it, so it is a very small neural network example!
Decision tree
Another popular and easy to understand algorithm is the decision tree. Its graphics allow you to see your own ideas, and its engine has a systematic, documented thinking process.
This algorithm is very simple. In each node, we select the best segmentation between all features and all possible segmentation points. Select each split to maximize certain features. Cross entropy and Gini index are used in the classification tree. In the regression tree, the sum of the squared errors between the predicted variable of the target value of the point in the region and the point assigned to it is minimized.
The algorithm will recursively complete this process on each node until the stop condition is met.
K-means
Sometimes you don't know the label, and the goal is to assign the label based on the characteristics of the object. This is called an agglomeration task.
Assuming that all data objects are divided into k clusters, you need to randomly select k points from the data and name them the center of the cluster. Clusters of other objects are defined by the nearest cluster center. The center of the cluster is then transformed and the process repeated until convergence.
Although this technique is very good, it still has some drawbacks. First of all, we don't know the number of clusters. Second, the result depends on the point at which it was randomly chosen at the beginning, and the algorithm cannot guarantee that we can achieve the global minimum of the function.
Principal Component Analysis (PCA)
Have you been preparing for the exam last night or in the last few hours? You can't remember all the information, but you want to remember the information to the maximum possible time, for example, first studying the theorems that often appear in exams.
Principal component analysis is based on similar ideas. The algorithm provides the function of dimensionality reduction. Sometimes, you have a lot of features and are strongly related to each other, and the model can easily adapt to a large amount of data. Then you can apply PCA.
You should calculate the projections on some vectors to maximize the variance of the data and lose as little information as possible. These vectors are the eigenvectors of the correlation matrix from the data set features.
The content of the algorithm is now very clear:
Calculate the correlation matrix of the feature column and find the feature vector of the matrix.
These multidimensional vectors are calculated and the projections of all features are calculated.
The new feature is the coordinates in the projection, the number of which depends on the number of feature vectors projected.
Neural Networks
When talking about logistic regression above, neural networks have already been mentioned. There are many different architectures that are very valuable in some specific tasks. Neural networks are more often a series of layers or components with linear connections and nonlinearity.
Convolutional deep neural networks can show good results if you are working with images. The nonlinearity is represented by the convolutional layer and the convergence layer, which captures the characteristics of the image.
To process text and sequences, it is best to choose a recurrent neural network. The RNN contains LSTM or GRU modules and can be used with data. Perhaps the most famous RNN application is machine translation.
in conclusion
I hope to explain the most commonly used machine learning algorithms and provide advice on how to choose machine learning algorithms for specific problems. In order to make it easier for you to master these contents, I have prepared the following summary.
Linear regression and linear classifiers. Although it seems simple, its advantages are manifested when other algorithms encounter fitting problems on a large number of features.
Logistic regression is the simplest nonlinear classifier with a linear combination of binary classification parameters and nonlinear functions (S-shaped).
Decision trees are often similar to human decision-making processes and are easy to interpret. But they are most commonly used in combinations such as random forests or gradient enhancements.
K-means is a more primitive, yet very easy to understand algorithm.
PCA is an excellent choice for reducing the feature space dimension with the least information loss.
Neural networks are new weapons for machine learning algorithms that can be applied to many tasks, but the computational complexity of their training is quite large.