1. Linear regression
Linear regression is usually used to estimate the actual value (price, number of calls, total sales, etc.) based on the continuous variable. We establish the relationship between the independent variable and the dependent variable by fitting the best straight line. This best line is called a regression, and is represented by the linear equation y= a *x + B.
The best way to understand linear regression is to look back at childhood. Suppose that in the case of not asking the other person's weight, let a five-year child sort the class by weight in a lighter order, what do you think the child would do? He or she is likely to see people's height and body size, combining these visible parameters to arrange them. This is an example of using linear regression in real life. In fact, the child found a relationship between height and size and weight, a relationship that looked like the above equation.
In this equation:
Y: Dependent variables
A: Slope
X: Independent variables
B: Intercept
Coefficients A and B can be obtained by least squares method.
See the following example. We find the best fitting line y=0.2811x+13.9. Given the height of the person, we can find the weight through this equation.
The two main types of linear regression are linear regression and multivariate linear regression. One element of linear regression is characterized by only one independent variable. The characteristics of multivariate linear regression, like its name, exist multiple independent variables. When looking for the best fitting line, you can fit into multiple or curve regression. These are called polynomial or curve regressions.
Python Code
2. Logistic regression
Don't be fooled by its name. This is a classification algorithm rather than a regression algorithm. The algorithm can estimate discrete values based on a series of known variables (for example, binary values 0 or 1, yes or No, true or false). In short, it estimates the probability of an event occurring by fitting the data into a logical function. Therefore, it is also called logistic regression. Because it estimates probability, its output value is between 0 and 1 (as expected).
Let's go through a simple example to understand this algorithm again.
Suppose your friend asked you to solve a riddle. There will only be two results: you untie it or you don't untie it. Imagine you have to answer a lot of questions to find out what you are good at. The results of this study will look like this: Suppose the topic is a trigonometric function in the 10 grade, and you have a 70% chance to solve the problem. However, if the topic is a five-year history question, you are only 30% likely to answer correctly. This is the information that logical regression can provide to you.
Mathematically, in the result, the logarithm of probability uses the linear combination model of the predictive variable.
In the above equation, p is the probability that the feature we are interested in appears. It chooses the value that maximizes the likelihood of observing the sample value as a parameter, rather than by calculating the minimum value of the error squared sum (as used in the general regression analysis).
Now you may ask, why do we ask for logarithms? In short, this approach is one of the best ways to replicate a step function. I could have described it in more detail, but that would be against the gist of this guide.
Python code
3, KNN (k– nearest neighbor algorithm)
The algorithm can be used to classify problems and return problems. However, in the industry, the k– nearest neighbor algorithm is more commonly used for classification problems. The k– nearest neighbor algorithm is a simple algorithm. It stores all the cases and divides the new case by most of the k cases around it. According to a distance function, the new case is assigned to the most prevalent category in its K nearest neighbor.
These distance functions can be Euclidean distance, Manhattan distance, Ming-style distance or Hamming distance. The first three distance functions are used for continuous functions, and the fourth function (the Hamming function) is used to classify variables. If k=1, the new case is directly assigned to the category from which the most recent case belongs. Sometimes, choosing K is a challenge when using KNN modeling.
More information: Introduction to k– nearest neighbor algorithm (simplified version)
We can easily apply it to the KNN in real life. If you want to get to know a complete stranger, you may want to go to his good friends or his circle to get his message.
Before you choose to use KNN, you need to consider things:
KNN is expensive to compute.
Variables should be standardized first (normalized), or they will be biased by higher-range variables.
Before using KNN, it is much more effort to get rid of the value of the opposition and remove the noise.
4. Support Vector Machine
This is a classification method. In this algorithm, we mark each data in an n-dimensional space (n is the sum of all your features), and the value of each feature is a coordinate value.
For example, if we only have two features of height and hair length, we will be able to win these two variables in two-dimensional space, with two coordinates per point (these coordinates are called support vectors).
Now we'll find a line that separates the two sets of data. The distance between the nearest two points to this line is optimized at the same time in two groups.
The black line in the example above optimizes the classification of the data into two groups, and the distance from the nearest point (A, B) in the two groups satisfies the optimal condition of reaching the black line. This line is our dividing line. Next, the test data falls on which side of the line, and we divide it into which category.
See more: Simplification of support vector machines
The algorithm is to play Jezzball in an N-dimensional space. Need to make some small changes to the game:
Now you can draw a line or plane at any angle, rather than before you can draw straight lines in either horizontal or vertical direction.
The goal of the game is to divide the balls of different colours into different spaces.
The position of the ball will not change.
Python code
5. Naive Bayesian
Based on the hypothesis that the variables are independent of each other, the naive Bayesian classification method can be obtained according to the Bayesian theorem. In simpler terms, a naïve Bayesian classifier assumes that the attributes of a taxonomy are not related to other attributes of the classification. For example, if a fruit is round and red, and the diameter is about 3 inches, the fruit may be apple. Even if these properties depend on each other, or depend on the existence of other features, the Naive Bayes classifier will assume that these properties individually imply that the fruit is an apple.
Naive Bayesian models are easy to build and useful for large datasets. Although simple, but naive Bayesian performance is beyond the very complex classification method.
Bayesian theorem provides a method for calculating the posteriori probability P (c|x) from P (c), p (x) and P (x|c). See the following equation:
Over here
P (c|x) is a known predictor of variables (attributes), under the premise of the class (target) of the posteriori probability
P (c) is a priori probability of a class
P (x|c) is the probability that a given class is predicated on a predictor of a variable
P (x) is a priori probability of a predictor variable
Example: Let's use an example to understand this concept. In the following, I have a weather training set and the corresponding target variable "play". Now we need to classify participants who are "playing" and "not playing" according to weather conditions. Let's perform the following steps.
Step 1: Convert the dataset into a frequency table.
Step 2: Create likelihood tables using a probability like "when the overcast probability is 0.29, the likelihood of play is 0.64".
Step 3: Now, use the naive Bayesian equation to calculate the posteriori probabilities for each class. The biggest class of posterior probability is the result of prediction.
Question: If the weather is clear, the participants will be able to play. Is this statement correct?
We can use the method discussed to solve this problem. So P (can play | Sunny) = P (Sunny | Will play) * P (will play)/P (Sunny)
We have P (Sunny | will play) = 3/9 = 0.33,p (Sunny) = 5/14 = 0.36, p (play) = 9/14 = 0.64
Now, P (will play | sunny) = 0.33 * 0.64/0.36 = 0.60, there is a greater probability.
Naive Bayes uses a similar method to predict the probabilities of different classes by different attributes. This algorithm is commonly used for text categorization and problems involving multiple classes.
6. Decision Tree
This is one of my favorite and most frequently used algorithms. This supervised learning algorithm is often used to classify problems. Surprisingly, it applies both to categorical variables and to successive dependent variables. In this algorithm, we divide the population into two or more homogeneous groups. This is based on the most important attribute or the independent variable to be divided into as many groups as possible. Want to know more, can read: simplifying decision tree.
Source: Statsexchange
As you can see in the picture above, people are divided into four different groups based on a variety of attributes to determine whether they will play or not. In order to divide the population into different groups, many techniques are needed, such as Gini, information Gain, chi-square, and entropy.
The best way to understand the working mechanism of a decision tree is to play Jezzball, a classic Microsoft game (see chart below). The ultimate goal of the game is to move the walls in a room where the walls are made to divide the space as large as possible with no balls.
So every time you separate a room with a wall, you're trying to create two different populations in the same room. Similarly, decision trees are dividing the population into different groups as much as possible.
For more information, see: Simplification of decision tree algorithms
Python code
7, K mean value algorithm
k– mean algorithm is a kind of unsupervised learning algorithm, it can solve the problem of clustering. The process of using the k– mean algorithm to classify a data into a certain number of clusters (assuming K clusters) is simple. The data points within a cluster are homogeneous and differ from other clusters.
Remember the activity of finding shapes from ink stains? The k– mean algorithm is similar to this activity in some way. Look at the shape and extend the imagination to find out how many clusters or the population there are.
How the k– mean algorithm forms a cluster:
The k– mean value algorithm selects K points for each cluster. These points are called centroid.
Each data point is formed into a cluster with the nearest centroid of the distance, which is the K cluster.
According to the existing class members, find the centroid of each category. Now we have a new center of mass.
When we have a new centroid, repeat steps 2 and 3. Find the center of mass nearest to each data point and connect to the new K cluster. Repeat the process until the data is convergent, which is when the center of mass is no longer changing.
How to determine the K value:
k– mean algorithm involves cluster, each cluster has its own center of mass. The sum of the squares of the center of mass within a cluster and the distance between each data point forms the sum of the square values of the cluster. At the same time, when the sum of the square values of all the clusters is added, the sum of the square values of the cluster scheme is formed.
We know that when the number of clusters increases, the K value continues to decline. However, if you show the results graphically, you will see that the square sum of the distances is quickly reduced. After a value of k, the reduction of the speed is greatly reduced. Here, we can find the optimal value of the number of clusters.
Python code
8. Random Forest
Random forest is a proper noun for the overall decision tree. In the stochastic forest algorithm, we have a series of decision trees (hence the name "forest"). In order to classify a new object according to its attributes, each decision tree has a classification called the decision tree "vote" to the category. The forest chose to get the highest number of votes in the forest (in all trees).
Each tree is cultivated like this:
If the case number of the training set is N, the sample is randomly sampled from n cases using a replacement sampling method. This sample will serve as a training set for the "nurturing" tree.
If there is an input variable of M, then a numeric m<<m is defined. M indicates that m variables are randomly selected from M, and the best tangent branch of the M variable is used to carve the node. In the process of planting the forest, the value of M remains unchanged.
Plant every tree as much as possible, without pruning the whole process.
Python
9, gradient boosting and AdaBoost algorithm
When we have to deal with a lot of data to make a prediction with high predictive power, we use the GBM and AdaBoost boosting algorithms. Boosting algorithm is an integrated learning algorithm. It combines predictive results based on multiple base estimates to increase the reliability of a single estimate. These boosting algorithms are often useful in data science competitions such as KAGGL, AV hackathon, and Crowdanalytix. Python code
Gradientboostingclassifier and random forests are two different boosting tree classifiers. People often ask the difference between these two algorithms. 10. dimensionality Reduction Algorithm
In the past 4-5 years, information capture has grown exponentially at every possible stage. Companies, government agencies and research organizations are also capturing detailed information in response to new resources.
For example: E-commerce companies capture more detailed information about customers: personal information, web browsing records, their likes and dislikes, purchase records, feedback, and much more than the grocery store salesman around you.
As a data scientist, the data we provide contains a number of features. This sounds like a good material for building a model that can withstand the grind, but there's a challenge: how to identify the most important variables from 1000 or 2000. In this case, the dimensionality reduction algorithm and other algorithms (such as decision tree, random forest, PCA, factor analysis) help us to find these important variables according to the correlation matrix, the ratio of missing values and other factors.