Ten algorithms for Machine Learning (ii)

Source: Internet
Author: User
Tags svm

Article Source: https://www.dezyre.com/article/top-10-machine-learning-algorithms/202

If you have any errors, please also state your own translation. Follow-up will continue to supplement the example and code implementation.

3. Machine Learning Algorithm Overview 3.1 naive Bayesian classifier algorithm

Manually classifying pages, documents, e-mails, or any other lengthy text annotations will be difficult and virtually impossible. This is the naive Bayesian classifier machine learning algorithm to solve. A classifier is a function that assigns an overall element value from one of the available categories. For example, junk e-mail filtering is a popular application for naive Bayesian classifier algorithms. The spam filter here is a classifier that assigns "spam" or "no SPAM" labels to all e-mail messages.

Naive Bayesian classifier algorithm is one of the most popular learning methods, according to similarity classification, using popular Bayesian probability theorem to build machine learning model, especially for disease prediction and document classification. It is a simple classification of the subjective analysis of the content of the word based on Bayesian probability theorem.

When do I use the machine learning algorithm-naive Bayesian classifier?

(1) If you have a medium or large training data set.

(2) If the instance has several properties.

(3) Given the classification parameters, the attribute describing the instance should be conditionally independent.

A . Application of naive Bayesian classifier

(1) Sentiment analysis-status updates for Facebook analytics that indicate positive or negative emotions.

(2) Document classification-Google uses document classification to index documents and find relevance scores, that is, PageRank. The PageRank mechanism considers pages that are marked as important in databases that are parsed and categorized using document classification techniques.

(3) Naive Bayesian algorithms are also used to classify news articles about technology, entertainment, sports, politics, etc.

(4) Email spam filtering-Google mail uses the Na?vebayes algorithm to classify your e-mail messages as spam or non-spam.

B . Advantages of the naive Bayesian classifier machine learning algorithm

(1) When the input variable is classified, the naïve Bayesian classifier algorithm performs well.

(2) When the naïve Bayes condition independent hypothesis is established, the naive Bayesian classifier converges faster and requires relatively little training data, rather than other discriminant models such as logistic regression.

(3) using naive Bayesian classifier algorithm, it is easier to predict the class of test data set. A good bet for multi-level forecasts.

(4) Although it requires conditional independent hypothesis, naive Bayesian classifier shows good performance in various application fields.

Data Science Library in Python for Na?vebayes-sci-kit learning

The Data Science library implements naive Bayesian-e1071 in R

3.2 k mean-value clustering algorithm

K-means is a widely used unsupervised machine learning algorithm for clustering analysis. K-means is a non-deterministic and iterative approach. The algorithm operates on a given set of data by a predetermined number of cluster K. The output of the K-means algorithm is a K-cluster with input data that is split between clusters.

For example, let's consider the K-mean clustering of Wikipedia search results. The search term "Jaguar" on Wikipedia will return all pages containing the word Jaguar, which can refer to Jaguar called Car,jaguar Mac OS version, Jaguar as an animal. K-mean clustering algorithms can be used to group Web pages that describe similar concepts. So the algorithm will put all the talk about the Jaguar Web page as an animal into a cluster, grouping Jaguar as a car into another cluster, and so on.

A . Advantages of using the K-means Clustering learning algorithm

(1) In the case of globular clusters, K-means produce clusters that are more tightly clustered than hierarchical clusters.

(2) Given a smaller K-value, K-means clustering is faster than the hierarchical clustering of a large number of variables.

B . Application of K-means Clustering

The K Means Clustering algorithm is used by most search engines (such as Yahoo,google) to cluster Web pages by similarity and to identify the "correlation rate" of search results. This helps the search engine reduce the user's computing time.

Data Science Library in Python implements K-mean clustering-scipy,sci-kit learning, Python packaging

R implementation of K-mean clustering in the Data Science Library-statistics

3.3 Support Vector machine learning algorithm

Support Vector Machine (SVM) is a supervised machine learning algorithm for classification or regression problems, in which data sets teach SVM about classes so that SVM can classify any new data. It works by finding the lines (hyper-planes) that divide the training data set into classes to classify the data into different classes. Because there are many such linear hyper-planes, the SVM algorithm attempts to maximize the distance between the various classes involved, and this is known as marginal maximization. If a line is identified that maximizes the distance between classes, the probability of a good generalization of the unseen data is increased.

A . SVM is divided into two categories:

Linear SVM-In linear SVM, the training data, that is, the classifier is separated by the hyper-plane.

Nonlinear SVM in nonlinear SVM, it is impossible to use super plane to separate training data. For example, training data for face detection consists of a set of images that act as a face and another set of images that are not faces (in other words, all other images except the face). Under these conditions, the training data is too complex to be able to find the representation of each eigenvector. Separating the set of polygons from the non-polygon set linearly is a complex task.

B . Advantages of using SVM

(1) SVM provides the best classification performance (accuracy) for training data.

(2) SVM provides higher efficiency for the correct classification of data in the future.

(3) The best thing about SVM is that it does not make any strong assumptions about the data.

(4) It does not over-fit the data.

C . Application of support vector machine

(1) SVM is usually used in stock market forecasts of various financial institutions. For example, it can be used to compare the relative performance of stocks relative to other stocks in the same industry. The relative comparison of stocks helps to manage investment decisions based on classification made by SVM learning algorithms.

(2) Data Science Library in Python implements support vector machine-scikit learning, pyml,svmstruct PYTHON,LIBSVM

(3) Data Science Library in R implement support vector machine-klar,e1071

3.4 Apriori Machine Learning algorithm

The Apriori algorithm is an unsupervised machine learning algorithm that generates association rules from a given set of data. Association rules mean that if Project a appears, then item B also appears at a certain probability. Most of the association rules generated are in if_then format. For example, if people buy an ipad, they also buy an ipad cover. To get the conclusion of the algorithm, it first looked at the number of people who bought the ipad. As a result, the scale is like the 100 people who bought the ipad, and 85 of them bought an ipad case.

A . Apriori The fundamentals of Machine learning Algorithms:

If the item collection occurs frequently, all subsets of the item collection also appear frequently.

If the item collection does not appear frequently, all the superset of the item collection does not appear frequently.

B . Advantages of a priori algorithm

(1) It is easy to implement and can be easily parallelized.

(2) Apriori implementation uses large item set properties.

C.apriori Algorithm Application

Detection of adverse drug reactions

The Apriori algorithm is used for association analysis of medical data, such as the drug taken by the patient, the characteristics of each patient, the adverse adverse effects of the patient experience, and the initial diagnosis. The analysis produces association rules that help identify patient characteristics and combinations of medications leading to adverse side effects of the drug.

Market Basket Analysis

Many e-commerce giants such as Amazon use Apriori to draw data insights, which products may be purchased together and which are the most responsive to promotions. For example, retailers may use Apriori to predict that people buying sugar and flour will probably buy eggs to bake cakes.

Auto-Complete Application

Google AutoComplete is another popular app for Apriori, where-when users type words, search engines look for other associated words that people typically type after a particular word.

Data Science Library in Python implements Apriori machine learning algorithm-there is a Python implementation in PyPI Apriori

The Data Science Library implements the Apriori machine learning algorithm in R –arules

3.5 Linear regression machine learning algorithm

The linear regression algorithm shows the relationship between the 2 variables and how changes in one variable affect the other. The algorithm shows the effect on the dependent variable when the independent variable is changed. The arguments are called explanatory variables because they explain the effect of the dependent variable on the dependent variable. Dependent variables are often referred to as factors or predictors of interest.

A . Advantages of linear regression machine learning algorithms

(1) It is one of the most understandable machine learning algorithms, making it easy to explain to others.

(2) It is easy to use because it requires minimal tuning.

(3) It is the most widely used machine learning technology to run fast.

B . Application of linear regression algorithm

Estimated Sales

Linear regression has a great use in business and is based on trending sales forecasts. If the company's monthly sales grow steadily-a linear regression analysis of the monthly sale data helps the company forecast sales for the next few months.

Risk assessment

Linear regression helps to assess risks involved in the insurance or financial sector. The health insurance company can perform a linear regression analysis of each customer's claim number and age. This analysis helps insurers find that older customers tend to make more insurance claims. Such analytical results play a critical role in important business decisions and are designed to address risk.

Data Science Library in Python for linear regression-Statsmodel and Scikit

The Data Science Library in R implements linear regression-statistics

3.6 Decision Tree Machine learning algorithm

You are making a weekend plan to visit the best restaurants in town as your parents visit, but you are hesitant to decide which restaurant to choose. Whenever you want to go to a restaurant, you ask your friend Tyrion if he thinks you would like a particular place. To answer your question, Tyrion first has to find out what kind of restaurant you like. You give him a list of restaurants you've been to and tell him if you like each restaurant (give a set of training data for a tag). When you ask Tyrion if you want a specific restaurant R, he asks you various questions such as "R" rooftop restaurant? "," Restaurant "R" serving Italian food? ", Live music?" "," is the restaurant R open until midnight? And so on Tyrion asks you to provide several information questions to maximize the benefits of your information and to give a yes or no answer based on your answer to the questionnaire. Here Tyrion is the preferred decision tree for your favorite restaurant.

A decision tree is a graphical representation that uses branching methods to sample all possible results of a decision based on certain conditions. In the decision tree, the inner node represents a test of the property, and each branch of the tree represents the result of the test, and the leaf node represents a specific class label, which is the decision made after all attributes have been computed. Classification rules are represented by paths from the root to the leaf nodes.

A . Types of decision Trees

(1) Classification trees-These are considered decision trees that are used to divide the dataset into the default kinds of classes based on the response variables. These are usually used when the response variable is essentially classified.

(2) regression tree-use a regression tree when the response or target variable is contiguous or numeric. These are typically used to predict the type of problem compared to the classification.

The decision tree can be divided into two types according to the type of the target variable-continuous variable decision tree and binary variable decision tree. It is a goal variable that helps determine what decision tree is needed for a particular problem.

B . Why choose a decision tree algorithm?

(1) These machine learning algorithms help to make decisions under uncertainty and help you improve communication as they provide a visual representation of the decision-making situation.

(2) Decision tree machine learning algorithms help data scientists capture the idea that, if different decisions are taken, how the nature of the operation of the situation or model changes dramatically.

(3) Decision tree algorithms help make the best decisions by allowing data scientists to traverse forward and back computing paths.

C . When to use decision Tree machine learning algorithms

(1) Decision trees are robust to errors, and if the training data contains errors, the decision tree algorithm is best suited to solve such problems.

(2) The decision tree is best suited to the problem represented by an attribute value pair.

(3) If the training data has missing values, you can use a decision tree because they can handle the missing values well by looking at the data in the other columns.

(4) When the target function has discrete output values, the decision tree is the most suitable.

D. The advantages of decision trees

(1) Decision trees are very instinctive and can be easily explained to anyone. People from non-technical backgrounds can also explain assumptions drawn from decision trees because they are self-explanatory.

(2) When using decision tree machine learning algorithms, data types are not constraints because they can handle categorical and numeric variables.

(3) The decision Tree Machine learning algorithm does not need to make any assumptions about linearity in the data, so it can be used in the case of nonlinear parameter correlation. These machine learning algorithms do not make any assumptions about the structure and spatial distribution of classifiers.

(4) These algorithms are useful in data exploration. The decision tree implicitly performs feature selection, which is very important in predictive analysis. When the decision tree is appropriate for the training dataset, the node at the top of the split decision tree is considered to be an important variable within the given dataset, and the feature selection is done by default.

(5) Decision trees help save data preparation time because they are insensitive to missing values and outliers. Missing values do not prevent you from splitting the data that builds the decision tree. Outliers also do not affect the decision tree because data splitting occurs based on some samples in the split range rather than on the exact absolute value.

E. Disadvantages of decision Trees

(1) The greater the number of decisions in the tree, the smaller the accuracy of any expected result.

(2) The main disadvantage of the decision tree Machine learning algorithm is that the results may be based on expectations. When decisions are made in real time, the benefits and results may differ from the expected or planned. There is a chance that this may lead to an unrealistic decision tree leading to erroneous decisions. Any unreasonable expectations can lead to significant errors and flaws in the decision tree analysis, since it is not always possible to plan all the possibilities that may arise from the decision.

(3) Decision trees are not suitable for continuous variables and cause instability and classification of plateaus.

(4) Decision trees are easy to use compared to other decision models, but creating large decision trees with several branches is a complex and time-consuming task.

(5) The decision Tree Machine learning algorithm considers only one attribute at a time and may not be the most appropriate for the actual data in the decision space.

(6) Large size decision trees with multiple branches are not understandable and cause a number of rendering difficulties.

F. Application of machine learning algorithm in decision tree

(1) Decision tree is one of the popular machine learning algorithms, which is very useful to the pricing of options in finance.

(2) Remote sensing is the application field of pattern recognition based on decision tree.

(3) The bank uses the decision tree algorithm to classify the loan applicant according to the probability of default payment.

(4) Gerber Products Inc., a popular baby products company, uses decision tree machine learning algorithms to determine whether they should continue to use plastic PVC (polyvinyl chloride) in their products.

(5) Rush University Medical Center has developed a tool called Guardian, which uses decision tree machine learning algorithms to identify risky patients and disease trends.

The Data Science Library in the Python language implements the decision tree machine learning algorithm-scipy and Sci-kit learning.

Data Science Library implementation in R language the decision tree Machine learning algorithm is the caret.

3.7 Random Forest machine learning algorithm

Let's continue with the same example we used in decision trees to explain how random forest machine learning algorithms work. Tyrion is your restaurant's preferred decision tree. However, Tyrion as a person is not always accurate in promoting your restaurant preferences. To get a more accurate restaurant recommendation, you ask a couple of friends and decide to visit the restaurant R If most people say you will like it. Instead of just asking Tyrion, you want to ask Jon Snow,sandor,bronn and Bran who voted to decide if you like restaurant R or not. This means that you have built the ensemble classifier of the decision tree-also known as the forest.

You don't want all your friends to give you the same answer-so you provide each friend with slightly different data. You are also unsure of your restaurant preference, is in a dilemma. You tell Tyrion you like to open the roof-top restaurant, but maybe, just because it is in the summer when you visit the restaurant, you may already like it. In the cold winter, you may not be a fan of the restaurant. Therefore, all friends should not take advantage of the data points you like to open the rooftop restaurant to present their suggestions to your restaurant preferences.

By providing your friends with slightly different restaurant preference data, you can ask your friends for different questions at different times. In this case, just slightly changing your restaurant preferences, you are injecting randomness at the model level (unlike the decision tree in the case of data level randomness). Your group of friends now forms the random forest of your restaurant preference.

Random forest is a machine learning algorithm that uses bagging methods to create a decision tree of random subsets of data. The model is trained multiple times on random samples of datasets to obtain good predictive performance from random forest algorithms. In this holistic learning approach, the output of all decision trees in a random forest is combined to make a final prediction. The final prediction of a random forest algorithm is derived by polling the results of each decision tree or simply by using the most frequent predictions in the decision tree.

For example, in the above example-if 5 friends decide you will like restaurant R, but only 2 friends decide you won't like the restaurant, then the final prediction is that you will like the restaurant R most always wins.

A. Why use a random forest machine learning algorithm?

(1) There are a lot of good open source, free implementations of the algorithms available in Python and R.

(2) It maintains accuracy in the absence of data and can also withstand outliers.

(3) Simple to use as a basic random forest algorithm can be implemented with only a few lines of code.

(4) Random forest machine learning algorithms help data scientists save time on data preparation because they do not require any input preparation and are able to handle digital, binary, and categorical features without scaling, transformations, or modifications.

(5) Implicit feature selection, because it gives an estimate of what variables are important in the classification.

B. advantages of using random forest machine learning algorithms

(1) Unlike decision tree machine learning algorithms, overfitting to random forests is not a problem. There is no need to prune random forests.

(2) These algorithms are fast, but not in all cases. The random forest algorithm is run on a 800MHz machine with a dataset of 100 variables, and 50,000 cases generate 100 decision trees within 11 minutes.

(3) Random forests are one of the most effective and versatile machine learning algorithms for various classification and regression tasks because they are more robust to noise.

(4) It is difficult to build a bad random forest. In the implementation of random forest machine learning algorithms, it is easy to determine which parameters are used because they are insensitive to the parameters used to run the algorithm. A man can easily build a decent model without too much adjustment

(5) Random forest machine learning algorithms can grow in parallel.

(6) This algorithm runs efficiently on large databases.

(7) with high classification accuracy.

C. disadvantages of using random forest machine learning algorithms

They may be easy to use, but it is difficult to analyze them theoretically.

A large number of decision trees in random forests can slow down the algorithm for real-time prediction.

If the data consists of categorical variables with different levels of quantity, the algorithm prefers those attributes with more levels. In this case, the variable importance score does not seem reliable.

When using the Randomforest algorithm for regression tasks, it does not exceed the range of response values in the training data.

D. application of random forest machine learning algorithm

(1) The stochastic forest algorithm is used by banks to predict whether a loan applicant may be a high risk.

(2) They are used in the automotive industry to predict failure or failure of mechanical components.

(3) These algorithms are used in the healthcare industry to predict whether a patient may develop into a chronic disease.

(4) They can also be used for regression tasks, such as predicting the average of social media shares and performance scores.

(5) Recently, the algorithm has been used to predict the patterns in speech recognition software and classify images and text.

Data Science Library in Python language the realization of random forest machine learning algorithm is Sci-kit learning.

The Data Science Library of R language realizes the random forest machine learning algorithm randomforest.

3.8 Logistic Regression

The name of this algorithm may be a bit confusing, in the sense that the logistic regression machine learning algorithm is a categorical task rather than a regression problem. The name "regression" here means that the linear model fits into the feature space. The algorithm applies logic functions to the linear combination of features to predict the results of categorical dependent variables based on predictor variables.

A function that describes the likelihood or probability of a result of a single experiment being modeled as an explanatory variable. The logistic regression algorithm helps to estimate the probability of falling into a specific level of a categorical dependency variable based on a given predictor variable.

Suppose you want to predict whether there will be snow in New York tomorrow. Here, the predicted result is not a sequential number, because there will be snowfall or no snowfall, so linear regression cannot be applied. The result variable here is one of several categories, and using logistic regression is helpful.

(Remaining content waits for update to continue to supplement)

Ten algorithms for Machine Learning (ii)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.