Random forests are supervised integrated learning models for classification and regression. To make overall performance better, the integrated learning model aggregates multiple machine learning models. Because each model does not perform well when used alone, it is powerful if placed in a single unit. Under the random forest model, a large number of "weak" factor decision trees are used to aggregate their outputs, and the results can represent "strong" integration.
Weighing deviations and variances
In any machine learning model, there are two sources of error: bias and variance. To better illustrate these two concepts, assume that a machine learning model has been created and the actual output of the data is known, trained with different parts of the same data, and as a result the machine learning model produces different parts of the data. Output. To determine the deviation and variance, the two outputs are compared. The deviation is the difference between the predicted and actual values in the machine learning model, and the variance is the distribution of these predicted values.
Deviation is an error that occurs after the algorithm makes too many simplifying assumptions, which causes the model's predicted value to differ from the actual value.
The variance is due to the sensitivity of the algorithm to small changes in the training data set; the larger the variance, the greater the impact of the algorithm on data changes.
Ideally, the bias and variance will be small, which means that the model's predicted values in different data in the same data set are close to true. When this happens, the model can accurately learn the underlying patterns in the data set.
Random forest is an algorithm to reduce variance
Decision trees are known for high variance and low deviation. This is mainly because it can model complex relationships and even noise in over-fitting data. Simply put: the model of decision tree training is usually accurate, but often shows a large degree of variation between different data samples in the same data set.
Random forests reduce the variance that can lead to decision tree errors by aggregating the different outputs of a single decision tree. Through the majority voting algorithm, we can find the average output given by most single trees, thus smoothing the variance, so that the model is not easy to produce results farther from the true value.
The idea of random forest is to take a set of high variance, low deviation decision trees and convert them into a new model with low variance and low deviation.
Why are random forests random?
The random source algorithm in the random forest trains each individual decision tree with different subsets of training data, and segments each node of each decision tree with randomly selected attributes in the data. By introducing this randomness element, the algorithm is able to create models that are not related to each other. This results in a possible distribution of possible errors evenly in the model, meaning that the error is eventually eliminated by the majority voting decision strategy of the random forest model.
How does a random forest actually work?
Imagine that you are tired of listening to the same electronic music over and over again, and strongly want to find some new music that you might like, so you go online to find recommendations and find real people who can give you music recommendations based on your preferences. a website.
So how does it work? First, in order to avoid the randomness of the recommendations, first fill out a questionnaire about your basic music preferences, and provide a standard for the types of music you might like. Then netizens use this information to start analyzing songs based on the criteria (features) you provide. At this point everyone is essentially a decision tree.
Personally, people who make suggestions online don't summarize your musical preferences very well. For example, someone might think that you don't like any songs before the 80s, so you won't be recommending them. But this assumption may not be accurate and may result in suggestions that you will not receive your favorite music.
Why is this happening? Every referee has a limited understanding of your preferences, and they are biased towards their personal musical tastes. To solve this problem, we count recommendations from many individuals (everyone plays the role of a decision tree) and use the majority voting algorithm for their recommendations (essentially creating a random forest).
However, there is another problem – because everyone is using the same data from the same questionnaire, the recommendations will be similar and may be highly biased and relevant. In order to expand the scope of the recommendations, each referee will get a random answer to a set of questionnaires, not all of the answers, which means they have fewer recommendations. Finally, by eliminating the extreme outliers by majority vote, you get an accurate and diverse list of recommended songs.
Sum up
The advantages of random forests:
-
No feature normalization is required;
-
Parallelization: a single decision tree can be trained in parallel;
-
Widely used;
-
Reduce overfitting;
Disadvantages of random forests:
-
Not easy to explain
-
Not the most advanced method
Logistic regression is a supervised statistical model that uses categorical dependent variables to predict results. The value of the categorical variable is the name or label, for example: win/loss, health/illness, or success/failure. The model can also be used for more than two types of dependent variables, a situation known as multiple logistic regression.
Logistic regression is a classification rule that builds a given data set based on historical information, and these data sets are divided into different categories. The model formula is:
Related terms are defined as follows:
c=1,...,C is all possible categories of dependent variable Y;
P(Y=c) is the probability that the dependent variable is category c;
\beta_{{i}}, i=1,...,I is the regression coefficient, which, when converted, indicates the importance of each variable in explaining the probability;
X_{{i}}, i=1,...,I is an independent variable.
We will use the Iris dataset from the previous blog post to illustrate how logistic regression works. These data consist of 150 species of irises, categorized by plant species (three different species in this data set), sepals and petal lengths, sepals and petal widths. We use only sepals and petals to describe each iris. We will also establish a classification rule to determine the types of new plants introduced in the dataset. Figure 1 shows the size of a scorpion's sepals and petals.
First, we must divide the data set into two subsets: training and testing. The training set accounts for 60% of the entire data set and is used to match the model to the data. The test set accounts for the remaining 40% of the data and is used to check if the model matches the given data correctly.
Using the above formula, we fit the data into a logistic regression model. In this case, the dependent variable is the plant species, the number of categories is equal to 3, and the independent variables (x_{{i}}, i=1,...4\right) are the length and width of the sepals and petals. Figure 2 shows a subset of the data.
Table 1 gives an estimate of the coefficients of each independent variable in the three plants. Obviously, the length and width of the petals are the most important variables in the characterization process. Therefore, these two variables are highlighted in the characteristic importance map for each species (Figure 3).
Next, we created a confusion matrix (error matrix) to test the performance of the model. This matrix compares the known iris species in the test data set with the type of iris plant predicted by the fitted model. Our goal is the same. In Table 2, we see that the performance of the model is relatively good, and only two flower plants are misclassified.
Based on these results, we were able to properly classify the various iris species in the dataset. However, as mentioned earlier, we must now develop a classification rule. The probability of new irises belonging to a given category is then calculated by multiplying the independent variable values of the new irises by the coefficient estimates in Table 1. The results for the new tails are shown in Table 3 below:
Then, we used the previous formula to calculate the probability that the iris plants are of various categories. The results confirmed that the above-mentioned iris plants are likely to belong to the Virginia Iris.
Sum up
The advantages of logistic regression:
1. Interpretable;
2. The model is simple;
3. Scalability;
The disadvantages of logistic regression:
1. Assume the relative independence between features;