(CHU only national branch) the latest machine learning necessary ten entry algorithm!

Source: Internet
Author: User

Brief introduction

Machine learning algorithms are algorithms that can be learned from data and improved from experience without the need for human intervention. Learning tasks include learning about functions that map input to output, learning about hidden structures in unlabeled data, or "instance-based learning," which generates class labels for new instances by comparing new instances with instances of training data stored in memory. Instance-based learning does not create abstractions from specific instances.

Types of machine learning algorithms

There are three types of machine learning algorithms:

Supervised learning: Supervised learning can be explained as follows:

Learn the mapping function from the input variable (x) to the output variable (y) using the labeled training data.

Y = f (x)

There are two types of supervised learning problems:

A classification: Predicts the result of a given sample, where the output variable is a category. For example, a male or female, morbid or healthy label.

B regression: Predicts the result of a given sample in the form of a real value for the output variable. For example, a label that represents the actual value of rainfall, a person's height, and so on.

The first 5 algorithms we discussed in this blog-linear regression, logistic regression, CART (categorical regression tree), Naive Bayes, KNN (K-Nearest algorithm)-are examples of supervised learning.

Integration (ensembling) is a supervised learning. This means predicting new samples by combining predictions from several different weak machine learning models.

Unsupervised Learning:

Unsupervised learning problems have only input variables (x), but no corresponding output variables. It uses unlabeled training data to simulate the underlying structure of the data.

There are two types of unsupervised learning problems:

A association: The probability of discovering the same occurrence in a collection. Widely used in shopping basket analysis. Example: If a customer buys bread, 80% of them may also buy an egg.

B Clustering: Grouping samples so that objects in the same cluster are more similar to each other than objects from another cluster.

C dimensionality reduction: In terms of its name, dimensionality Reduction (reducing dimensions) means reducing the number of variables in the data set, while ensuring that important information is still communicated. The feature extraction method and feature selection method can be used to reduce dimension. Feature selection selects a subset of the original variables. Feature extraction performs data transformations from high-dimensional space to low-dimensional space. Example: PCA algorithm (principal component analysis algorithm) is a feature extraction method.

The algorithms we introduce here are 6-8--apriori, K-means, PCA (principal component analysis algorithm), which are examples of unsupervised learning.

Intensive Learning:

Reinforcement learning is a machine learning algorithm that allows an agent to determine the best next action based on its current state by learning to maximize reward behavior.

The hardening algorithm usually learns the best action by trying and failing. They are often used in robotics-robots can learn to avoid collisions by receiving negative feedback after colliding obstacles, while in video games you can inspire rewards by trying and using errors to reveal specific actions. The agents can then use these rewards to understand the best state of the game and choose the next action.

Quantifying the prevalence of machine learning algorithms

Some research reports (http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf) have been done to quantify 10 of the most popular data mining algorithms. However, such a list is subjective, as far as the quoted paper is concerned, the sample size of the participants surveyed is very narrow and is a senior practitioner of data mining. Respondents included the ACM KDD Innovation Award, the winner of the IEEE ICDM Research Contribution Award, KDD-06, ICDM ' 06 and SDM ' 06 Program Committee members and ICDM ' 06 of 145 attendees.

The top ten algorithms in this blog are for beginners, mainly during my bachelor's degree in computer Engineering from the "Data storage and Mining" (DWM) course. The DWM course is a good introduction to the field of machine learning algorithms. Based on the popularity of Kaggle competitions, I specifically covered the last 2 algorithms (integration methods). Hope you like this article!

Supervised learning algorithms

    1. Linear regression

In machine learning, we have a set of input variables (x) that determine the output variable (y). There is a relationship between the input variable and the output variable. The goal of machine learning is to quantify this relationship.

Figure 1: Linear regression represented as a line in y = ax + b

In linear regression, the relationship between the input variable (x) and the output variable (y) is expressed as the equation in the form of y = ax + b. Therefore, the goal of linear regression is to find the values of coefficients a and B. Here, A is the intercept, and B is the slope of the line.

Figure 1 shows the x and Y values of the dataset. The goal is to find a line that matches the closest most points. This will reduce the distance between the Y value of the number of points and the line (error).

    1. Logistic regression

Linear regression prediction is a continuous value (for example, in cm units of precipitation), logistic regression prediction is the application of the transformation function after the discrete values (such as whether the students have been tested/Hung branch).

Logistic regression is best suited for datasets with binary classification (y = 0 or 1, where 1 represents the default class). Example: When a prediction event occurs, the events that occur are classified as 1, and the cases are 1 if the patient is not sick. It is named after the conversion function in which it is used, called the logistic function h (x) = 1/(1 + E ^ x), which is an S-shaped curve.

In logistic regression, the output is in the form of the probability of the default class (unlike a linear regression that produces output directly). Because this is a probability, the output is within the range of 0-1. Use the logical function h (x) = 1/(1 + E ^-X) to generate the output (Y value) through the log transform x value. The threshold is then applied to force the probability into a binary classification.

Figure 2: A logistic regression used to determine whether a tumour is malignant or benign. If the probability h (x) > 0.5, then the classification is malignant.

In Figure 2, in order to determine whether the tumour is malignant, the default variable is y = 1 (tumour = malignant). The x variable can be a measure of the tumour, such as the size of the tumour. , the logistic function converts the X value of various instances of a dataset to a range of 0 to 1. If the probability exceeds the threshold of 0.5 (indicated by a horizontal line), the tumour is classified as malignant.

Logistic regression equation p (x) = e ^ (B0 + b1 x)/(1 + E ^ (B0 + b1 x)) can be converted to ln (p (x)/1-p (x) = B0 + b1*x.

The goal of logistic regression is to use the training data to find the values of coefficients b0 and b1 to minimize the error between the predicted results and the actual results. Use the maximum likelihood estimation technique to estimate these coefficients.

    1. CART (Classification and regression tree)

Classification and regression tree (CART) is a method of decision tree implementation.

A non-endpoint is a root node and an internal node. The end node is a leaf node. Each non-terminal node represents a single input variable (x) and a split point on the variable, and the leaf node represents the output variable (y). Use the following model to predict: walk along the tree, reach the leaf node and output the values that exist on the leaf nodes.

The decision tree in Figure 3 classifies whether a sports car or minivan is purchased based on age and marital status. If this person is over 30 years old and not married, we walk along the tree like this: "More than 30 years old?" "-> is married? Finally, the model output is a sports car.

    1. Naive Bayesian

To calculate the probability of the occurrence of an event, given that another event has occurred, we use Bayesian theorem. In order to calculate the probability of the result of a given variable value, that is, the probability that the calculation hypothesis (h) is true, given our prior knowledge (d), we use the Bayes theorem as follows:

P (h|d) = (P (d|h) *p (h))/P (d)

which
P (h|d) = Posteriori probability. Given data d, suppose the probability of H is true, where P (h|d) = P (d1|h) P (d2|h) .... P (dn| h) P
P (d|h) = likelihood. Probability of a given assumption that H is true data D
P (h) = rank priori probability. Suppose the probability of H is true (regardless of the data)
P (d) = predict a priori probability. Probability of data (regardless of assumptions)

This algorithm is called "naive" because it assumes that all variables are independent of each other, which is a naïve (naive) hypothesis in the real world.

Figure 4: Using naive Bayesian predictions to use the "play" state of the variable "weather".

Take Figure 4 as an example, if weather = ' sunny ', what is the result?

Considering the value of the variable weather = ' Sunny ', determine the result of play = ' yes ' or ' no ', calculate P (Yes|sunny) and P (No|sunny), and select the result with a higher probability.

    • > P (yes|sunny) = (P (sunny|yes) P (yes))/P (Sunny)
      = (3/9
      9/14)/(5/14)
      = 0.60

    • > P (no|sunny) = (P (sunny|no) P (NO))/P (Sunny)
      = (2/5
      5/14)/(5/14)
      = 0.40

Therefore, if the weather = "Sunny", the result is play = "yes".

    1. Knn

Instead of dividing the dataset into training and test sets, the K nearest neighbor algorithm uses the entire data set as the training set.

When a new data instance needs results, the KNN algorithm traverses the entire data set to find the K-nearest instance of the new instance, or the K-instance that is most similar to the new record, and then outputs the result of the classification problem (regression problem) or pattern (the most common class). The value of k is user-specified.

The similarity between instances is calculated using measures such as Euclidean distance and Hamming distance.

    1. Apriori

The Apriori algorithm is used in the transactional database to mine frequent itemsets and then generates association rules. It is widely used in shopping basket analysis to check the combination of products that often coexist in the database. In general, we write an association rule for: If a person buys an item X, then he writes the item y, writing: X-> Y.

Example: If a person buys milk and sugar, then he is very likely to buy coffee powder. This can be written in the form of association rules: {Milk,sugar}-> Coffee powder. Association rules are generated after the threshold at which the degree of support and confidence intersect.

Figure 5: Support, confidence, and lift formulas for association rules x> y

The support method helps trim the number of candidate itemsets that need to be considered during frequent itemsets generation. Such support measures are guided by the principle of apriori. The Apriori principle states that if an item set is frequent, then all its subsets must also be frequent.

    1. K-means

K-means is an iterative algorithm that groups similar data into clusters. It calculates the centroid of a K-cluster and assigns a data point to the data point between the centroid and the data point with the minimum distance.

Figure 6:k-means The steps of the algorithm

Step 1:k-means Initialization:

Select a K value. Here, we take k = 3.
Randomly assign each data point to any one of the 3 clusters.
Compute the cluster Center for each cluster. The red, blue, and green stars represent the centroid of each of the 3 clusters.

Step 2: Associate each observation result with the cluster:

Reassign each point to the nearest cluster centroid. Here, the above 5 points are assigned to a cluster with a blue centroid. Follow the same steps to assign a point to a cluster that contains the red and green color centroid.

Step 3: Recalculate the centroid:

Calculates the centroid of the new cluster. The old centroid is shown by the gray stars, while the new centroid is the red, green, and blue stars.

Step 4: Iterate, if it does not change, then exit.

Repeat step 2-3 until there is no point switching from one cluster to another. Once there is no two consecutive steps to switch, exit the K-means algorithm.

    1. Pca

Principal component Analysis (PCA) is used to make data easy to explore and visualize by reducing the number of variables. This is done by capturing the maximum variance in the data into a new coordinate system that has an axis called the "principal component." Each component is a linear combination of the original variables and is orthogonal to each other. The orthogonality between the components indicates that the correlations between these components are zero.

The first principal component captures the direction of the largest change in the data. The second principal component captures the remaining variance in the data, but has a variable that is not related to the first component. Similarly, all successive principal components (PC3, PC4, etc.) capture the remaining variance, which is not related to the previous component.

Figure 7:3 Primitive variables (genes) reduced to 2 new variables called principal components (PCS)

Integrated Learning Technology

Combination means to improve results by voting or averaging, combining the results of multiple learners (classifiers). Voting is used during classification, and averages are used during regression. The overall performance of the learner is better than that of a single learner.

There are three types of integration algorithms: Bagging, boosting, and stacking. We will not state "stacking", but if you want to explain it in detail, I can write a blog for it alone.

    1. Random Forest Bagging

Random Forest (learning more about it) is an improvement to the bagged decision Tree (single learner).

The first step in bagging:bagging is to create multiple models of datasets created using the bootstrap sampling method. In bootstrap sampling, each generated training set consists of a random sub-sample of the original data set. These training processes are the same size as the original dataset, but some records are repeated several times and some records are not displayed at all. The entire original dataset is then used as the test set. Therefore, if the size of the original DataSet is n, the size of each generated training sample set is also N, the number of unique records is approximately (2N/3), and the size of the test set is N.

The second step of bagging is to create multiple models by using the same algorithm on different generated training sets. In this case, let's discuss the random forest. Unlike decision trees, each node is segmented on the best features of minimizing errors, and in random forests we choose randomly selected features to construct optimal segmentation. The reason for randomness is that, even with bagging, when the decision tree chooses the best segmentation feature, they eventually have similar structures and associated predictions. However, the segmentation of bagging on a random subset of features means that there is less correlation between predictions from the subtree.

The number of features to search at each split point is specified as a parameter to the random forest algorithm.

Therefore, when random forest bagging are used, each tree is constructed using a random record sample, and each segment is constructed using a random predictor sample.

    1. AdaBoost

A) bagging is a parallel integration learning, because each model is built independently. On the other hand, boosting is a sequential integration learning, where each model is built on the basis of a bug classification that corrects the previous model.

b) bagging mainly involves "simple voting", where each classifier votes for the final result-a result determined by most parallel models; boosting involves "weighted voting", in which each of the classifiers votes to obtain the final result determined by the majority, However, the order model is built by assigning larger weights to the error classification instances of the previous model.

AdaBoost represents adaptive boosting.

Figure 8: AdaBoost of the decision tree

In Figure 8, steps 1, 2, and 3 involve a weak learner called a decision stump (a decision tree that is based on a value of only one input feature; the root of the decision tree is immediately connected to its leaf). The process of building a weak learner continues until the user-defined number of learners has been built, or until training has not been further improved. Step 4 combines the 3 decision trees of the previous model (so there are 3 tessellation rules in the decision tree).

Step 1: Start with 1 decision trees and make a decision on the 1 input variables:

The size of the data points indicates that we have applied equal weights to classify them as circles or triangles. Decision stumps generate a horizontal line in the upper part to classify these points. We can see that there are two circles that are incorrectly predicted as triangles. Therefore, we will give these two circles a higher weight and apply another decision to the stump.

Step 2: Move to another decision tree stump to make a decision on another input variable:

We observed that the size of the circle of 2 errors in the previous step was larger than the remaining points. Now, the second decision stump will try to correctly predict these two circles.

Because of the higher weights assigned, the two circles have been correctly categorized by the vertical line on the left. But this has now caused the top 3 circles to be incorrectly categorized. Therefore, we will assign a higher weight to these 3 circles at the top and apply another decision tree stump.

Step 3: Train another decision tree stump and make a decision on another input variable:

The 3 error classifications in the previous step are larger than the rest of the data points. Now, the vertical line on the right has been generated to classify the circles and triangles.

Step 4: Combine the decision tree:

We have combined the splitter of the previous 3 models and observed that the complex rules of the model correctly classify data points compared to any single disadvantaged learner.

Conclusion

To summarize, we have learned:

5 kinds of supervised learning techniques-linear regression, logistic regression, CART, naive Bayes, KNN
3 Unsupervised learning techniques--apriori, K-means, PCA
2 kinds of integrated learning techniques-random forest bagging, Adaboost
We introduce the top ten machine learning (ML) algorithms to beginners and attach numbers and examples for easy comprehension.

Brief introduction

A Harvard Business Review article (Https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century) rated "data scientist" as "the sexiest job of the 21st century" , the research of machine learning algorithm has been paid great attention. So for beginners in the field of machine learning, we decided to re-write a 2016 gold blog-ten Algorithms that machine learning engineers must Know (https://www.kdnuggets.com/2016/08/10- algorithms-machine-learning-engineers.html).

Machine learning algorithms are algorithms that can be learned from data and improved from experience without the need for human intervention. Learning tasks include learning about functions that map input to output, learning about hidden structures in unlabeled data, or "instance-based learning," which generates class labels for new instances by comparing new instances with instances of training data stored in memory. Instance-based learning does not create abstractions from specific instances.

Types of machine learning algorithms

There are three types of machine learning algorithms:

Supervised learning: Supervised learning can be explained as follows:

Learn the mapping function from the input variable (x) to the output variable (y) using the labeled training data.

Y = f (x)

There are two types of supervised learning problems:

A classification: Predicts the result of a given sample, where the output variable is a category. For example, a male or female, morbid or healthy label.

B regression: Predicts the result of a given sample in the form of a real value for the output variable. For example, a label that represents the actual value of rainfall, a person's height, and so on.

The first 5 algorithms we discussed in this blog-linear regression, logistic regression, CART (categorical regression tree), Naive Bayes, KNN (K-Nearest algorithm)-are examples of supervised learning.

Integration (ensembling) is a supervised learning. This means predicting new samples by combining predictions from several different weak machine learning models.

Unsupervised Learning:

Unsupervised learning problems have only input variables (x), but no corresponding output variables. It uses unlabeled training data to simulate the underlying structure of the data.

There are two types of unsupervised learning problems:

A association: The probability of discovering the same occurrence in a collection. Widely used in shopping basket analysis. Example: If a customer buys bread, 80% of them may also buy an egg.

B Clustering: Grouping samples so that objects in the same cluster are more similar to each other than objects from another cluster.

C dimensionality reduction: In terms of its name, dimensionality Reduction (reducing dimensions) means reducing the number of variables in the data set, while ensuring that important information is still communicated. The feature extraction method and feature selection method can be used to reduce dimension. Feature selection selects a subset of the original variables. Feature extraction performs data transformations from high-dimensional space to low-dimensional space. Example: PCA algorithm (principal component analysis algorithm) is a feature extraction method.

The algorithms we introduce here are 6-8--apriori, K-means, PCA (principal component analysis algorithm), which are examples of unsupervised learning.

Intensive Learning:

Reinforcement learning is a machine learning algorithm that allows an agent to determine the best next action based on its current state by learning to maximize reward behavior.

The hardening algorithm usually learns the best action by trying and failing. They are often used in robotics-robots can learn to avoid collisions by receiving negative feedback after colliding obstacles, while in video games you can inspire rewards by trying and using errors to reveal specific actions. The agents can then use these rewards to understand the best state of the game and choose the next action.

Quantifying the prevalence of machine learning algorithms

Some research reports (http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf) have been done to quantify 10 of the most popular data mining algorithms. However, such a list is subjective, as far as the quoted paper is concerned, the sample size of the participants surveyed is very narrow and is a senior practitioner of data mining. Respondents included the ACM KDD Innovation Award, the winner of the IEEE ICDM Research Contribution Award, KDD-06, ICDM ' 06 and SDM ' 06 Program Committee members and ICDM ' 06 of 145 attendees.

The top ten algorithms in this blog are for beginners, mainly during my bachelor's degree in computer Engineering from the "Data storage and Mining" (DWM) course. The DWM course is a good introduction to the field of machine learning algorithms. Based on the popularity of Kaggle competitions, I specifically covered the last 2 algorithms (integration methods). Hope you like this article!

Supervised learning algorithms

    1. Linear regression

In machine learning, we have a set of input variables (x) that determine the output variable (y). There is a relationship between the input variable and the output variable. The goal of machine learning is to quantify this relationship.

Figure 1: Linear regression represented as a line in y = ax + b

In linear regression, the relationship between the input variable (x) and the output variable (y) is expressed as the equation in the form of y = ax + b. Therefore, the goal of linear regression is to find the values of coefficients a and B. Here, A is the intercept, and B is the slope of the line.

Figure 1 shows the x and Y values of the dataset. The goal is to find a line that matches the closest most points. This will reduce the distance between the Y value of the number of points and the line (error).

    1. Logistic regression

Linear regression prediction is a continuous value (for example, in cm units of precipitation), logistic regression prediction is the application of the transformation function after the discrete values (such as whether the students have been tested/Hung branch).

Logistic regression is best suited for datasets with binary classification (y = 0 or 1, where 1 represents the default class). Example: When a prediction event occurs, the events that occur are classified as 1, and the cases are 1 if the patient is not sick. It is named after the conversion function in which it is used, called the logistic function h (x) = 1/(1 + E ^ x), which is an S-shaped curve.

In logistic regression, the output is in the form of the probability of the default class (unlike a linear regression that produces output directly). Because this is a probability, the output is within the range of 0-1. Use the logical function h (x) = 1/(1 + E ^-X) to generate the output (Y value) through the log transform x value. The threshold is then applied to force the probability into a binary classification.

Figure 2: A logistic regression used to determine whether a tumour is malignant or benign. If the probability h (x) > 0.5, then the classification is malignant.

In Figure 2, in order to determine whether the tumour is malignant, the default variable is y = 1 (tumour = malignant). The x variable can be a measure of the tumour, such as the size of the tumour. , the logistic function converts the X value of various instances of a dataset to a range of 0 to 1. If the probability exceeds the threshold of 0.5 (indicated by a horizontal line), the tumour is classified as malignant.

Logistic regression equation p (x) = e ^ (B0 + b1 x)/(1 + E ^ (B0 + b1 x)) can be converted to ln (p (x)/1-p (x) = B0 + b1*x.

The goal of logistic regression is to use the training data to find the values of coefficients b0 and b1 to minimize the error between the predicted results and the actual results. Use the maximum likelihood estimation technique to estimate these coefficients.

    1. CART (Classification and regression tree)

Classification and regression tree (CART) is a method of decision tree implementation.

A non-endpoint is a root node and an internal node. The end node is a leaf node. Each non-terminal node represents a single input variable (x) and a split point on the variable, and the leaf node represents the output variable (y). Use the following model to predict: walk along the tree, reach the leaf node and output the values that exist on the leaf nodes.

The decision tree in Figure 3 classifies whether a sports car or minivan is purchased based on age and marital status. If this person is over 30 years old and not married, we walk along the tree like this: "More than 30 years old?" "-> is married? Finally, the model output is a sports car.

    1. Naive Bayesian

To calculate the probability of the occurrence of an event, given that another event has occurred, we use Bayesian theorem. In order to calculate the probability of the result of a given variable value, that is, the probability that the calculation hypothesis (h) is true, given our prior knowledge (d), we use the Bayes theorem as follows:

P (h|d) = (P (d|h) *p (h))/P (d)

which
P (h|d) = Posteriori probability. Given data d, suppose the probability of H is true, where P (h|d) = P (d1|h) P (d2|h) .... P (dn| h) P
P (d|h) = likelihood. Probability of a given assumption that H is true data D
P (h) = rank priori probability. Suppose the probability of H is true (regardless of the data)
P (d) = predict a priori probability. Probability of data (regardless of assumptions)

This algorithm is called "naive" because it assumes that all variables are independent of each other, which is a naïve (naive) hypothesis in the real world.

Figure 4: Using naive Bayesian predictions to use the "play" state of the variable "weather".

Take Figure 4 as an example, if weather = ' sunny ', what is the result?

Considering the value of the variable weather = ' Sunny ', determine the result of play = ' yes ' or ' no ', calculate P (Yes|sunny) and P (No|sunny), and select the result with a higher probability.

    • > P (yes|sunny) = (P (sunny|yes) P (yes))/P (Sunny)
      = (3/9
      9/14)/(5/14)
      = 0.60

    • > P (no|sunny) = (P (sunny|no) P (NO))/P (Sunny)
      = (2/5
      5/14)/(5/14)
      = 0.40

Therefore, if the weather = "Sunny", the result is play = "yes".

    1. Knn

Instead of dividing the dataset into training and test sets, the K nearest neighbor algorithm uses the entire data set as the training set.

When a new data instance needs results, the KNN algorithm traverses the entire data set to find the K-nearest instance of the new instance, or the K-instance that is most similar to the new record, and then outputs the result of the classification problem (regression problem) or pattern (the most common class). The value of k is user-specified.

The similarity between instances is calculated using measures such as Euclidean distance and Hamming distance.

    1. Apriori

The Apriori algorithm is used in the transactional database to mine frequent itemsets and then generates association rules. It is widely used in shopping basket analysis to check the combination of products that often coexist in the database. In general, we write an association rule for: If a person buys an item X, then he writes the item y, writing: X-> Y.

Example: If a person buys milk and sugar, then he is very likely to buy coffee powder. This can be written in the form of association rules: {Milk,sugar}-> Coffee powder. Association rules are generated after the threshold at which the degree of support and confidence intersect.

Figure 5: Support, confidence, and lift formulas for association rules x> y

The support method helps trim the number of candidate itemsets that need to be considered during frequent itemsets generation. Such support measures are guided by the principle of apriori. The Apriori principle states that if an item set is frequent, then all its subsets must also be frequent.

    1. K-means

K-means is an iterative algorithm that groups similar data into clusters. It calculates the centroid of a K-cluster and assigns a data point to the data point between the centroid and the data point with the minimum distance.

Figure 6:k-means The steps of the algorithm

Step 1:k-means Initialization:

Select a K value. Here, we take k = 3.
Randomly assign each data point to any one of the 3 clusters.
Compute the cluster Center for each cluster. The red, blue, and green stars represent the centroid of each of the 3 clusters.

Step 2: Associate each observation result with the cluster:

Reassign each point to the nearest cluster centroid. Here, the above 5 points are assigned to a cluster with a blue centroid. Follow the same steps to assign a point to a cluster that contains the red and green color centroid.

Step 3: Recalculate the centroid:

Calculates the centroid of the new cluster. The old centroid is shown by the gray stars, while the new centroid is the red, green, and blue stars.

Step 4: Iterate, if it does not change, then exit.

Repeat step 2-3 until there is no point switching from one cluster to another. Once there is no two consecutive steps to switch, exit the K-means algorithm.

    1. Pca

Principal component Analysis (PCA) is used to make data easy to explore and visualize by reducing the number of variables. This is done by capturing the maximum variance in the data into a new coordinate system that has an axis called the "principal component." Each component is a linear combination of the original variables and is orthogonal to each other. The orthogonality between the components indicates that the correlations between these components are zero.

The first principal component captures the direction of the largest change in the data. The second principal component captures the remaining variance in the data, but has a variable that is not related to the first component. Similarly, all successive principal components (PC3, PC4, etc.) capture the remaining variance, which is not related to the previous component.

Figure 7:3 Primitive variables (genes) reduced to 2 new variables called principal components (PCS)

Integrated Learning Technology

Combination means to improve results by voting or averaging, combining the results of multiple learners (classifiers). Voting is used during classification, and averages are used during regression. The overall performance of the learner is better than that of a single learner.

There are three types of integration algorithms: Bagging, boosting, and stacking. We will not state "stacking", but if you want to explain it in detail, I can write a blog for it alone.

    1. Random Forest Bagging

Random Forest (learning more about it) is an improvement to the bagged decision Tree (single learner).

The first step in bagging:bagging is to create multiple models of datasets created using the bootstrap sampling method. In bootstrap sampling, each generated training set consists of a random sub-sample of the original data set. These training processes are the same size as the original dataset, but some records are repeated several times and some records are not displayed at all. The entire original dataset is then used as the test set. Therefore, if the size of the original DataSet is n, the size of each generated training sample set is also N, the number of unique records is approximately (2N/3), and the size of the test set is N.

The second step of bagging is to create multiple models by using the same algorithm on different generated training sets. In this case, let's discuss the random forest. Unlike decision trees, each node is segmented on the best features of minimizing errors, and in random forests we choose randomly selected features to construct optimal segmentation. The reason for randomness is that, even with bagging, when the decision tree chooses the best segmentation feature, they eventually have similar structures and associated predictions. However, the segmentation of bagging on a random subset of features means that there is less correlation between predictions from the subtree.

The number of features to search at each split point is specified as a parameter to the random forest algorithm.

Therefore, when random forest bagging are used, each tree is constructed using a random record sample, and each segment is constructed using a random predictor sample.

    1. AdaBoost

A) bagging is a parallel integration learning, because each model is built independently. On the other hand, boosting is a sequential integration learning, where each model is built on the basis of a bug classification that corrects the previous model.

b) bagging mainly involves "simple voting", where each classifier votes for the final result-a result determined by most parallel models; boosting involves "weighted voting", in which each of the classifiers votes to obtain the final result determined by the majority, However, the order model is built by assigning larger weights to the error classification instances of the previous model.

AdaBoost represents adaptive boosting.

Figure 8: AdaBoost of the decision tree

In Figure 8, steps 1, 2, and 3 involve a weak learner called a decision stump (a decision tree that is based on a value of only one input feature; the root of the decision tree is immediately connected to its leaf). The process of building a weak learner continues until the user-defined number of learners has been built, or until training has not been further improved. Step 4 combines the 3 decision trees of the previous model (so there are 3 tessellation rules in the decision tree).

Step 1: Start with 1 decision trees and make a decision on the 1 input variables:

The size of the data points indicates that we have applied equal weights to classify them as circles or triangles. Decision stumps generate a horizontal line in the upper part to classify these points. We can see that there are two circles that are incorrectly predicted as triangles. Therefore, we will give these two circles a higher weight and apply another decision to the stump.

Step 2: Move to another decision tree stump to make a decision on another input variable:

We observed that the size of the circle of 2 errors in the previous step was larger than the remaining points. Now, the second decision stump will try to correctly predict these two circles.

Because of the higher weights assigned, the two circles have been correctly categorized by the vertical line on the left. But this has now caused the top 3 circles to be incorrectly categorized. Therefore, we will assign a higher weight to these 3 circles at the top and apply another decision tree stump.

Step 3: Train another decision tree stump and make a decision on another input variable:

The 3 error classifications in the previous step are larger than the rest of the data points. Now, the vertical line on the right has been generated to classify the circles and triangles.

Step 4: Combine the decision tree:

We have combined the splitter of the previous 3 models and observed that the complex rules of the model correctly classify data points compared to any single disadvantaged learner.

Conclusion

To summarize, we have learned:

5 kinds of supervised learning techniques-linear regression, logistic regression, CART, naive Bayes, KNN
3 Unsupervised learning techniques--apriori, K-means, PCA
2 kinds of integrated learning techniques-random forest bagging, Adaboost

(CHU only national branch) the latest machine learning necessary ten entry algorithm!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.