Summary of machine learning algorithms

Source: Internet
Author: User
Tags svm

    1. Bayesian classifier (Bayesian Classifier)

(How to create a document classification system for spam filtering, or to divide a set of documents based on a fuzzy search for keywords)

A Bayesian classifier is typically used for document processing, but can actually be applied to any other form of data set, as long as it can be transformed into a set of feature lists. A characteristic refers to something that exists or is missing in a given item. (In a document, a feature is a word in a document)

* Training

Training with samples (as with the supervisory algorithm)

The classifier records all characteristics and the number probabilities associated with a particular classification, i.e. a list of features with corresponding probabilities.

* Classification

After training, the new project is automatically classified;

A method is required to combine the probabilities of all features together to form the probability of a whole.

    1. Decision Trees Classifier (decision tree Classifier)

(How to model user behavior based on server logs)

The algorithm constructs a decision tree from the root, selects an attribute at each step, and uses this property to split the data in the best possible way

In order to create (root) nodes, the first step is to try each variable, and finally get the best split effect (but in the face of a larger set of data, there will not always be a clear split results, in order to measure the merits of a split, need to know the concept of entropy)

Entropy (entropy of each set is used to calculate information gain): The entropy in the set is small, which means that most of the elements in the set are homogeneous; the entropy equals 0, then all the elements are of the same type.

    1. Neural network (neural Networks)

(How to build a neural network based on the links of the user's previous clicks, to adjust the ranking of the search results, the neural network can identify the most important combinations of words, and which words are not important for a query, not only can use neural networks for classification, but also can be used for numerical prediction problems)

Neural networks--multilayer perceptual neural networks (multilayer Perceptron Network)

Multilayer perceptual Neural Networks: one-layer input neurons, multilayer hidden neurons, one-layer output neurons

The layers and layers are connected by synapses, each synapse has a weight associated with it, and the greater the weight, the greater the effect on the neuron output.

* Simple example: Junk e-mail filtering problem

First of all, you need to set the weight of the synapse (how to set the synapse??? ), neurons in the first layer respond to words used as input, such as the presence of a word in the message, the neuron that is strongest associated with the word is activated, the second neuron receives input from the first neuron, and therefore responds to the combination of words (because each neuron in each layer is connected), and finally , these neurons will output the results, forming strong or weak associations, and the final decision is to determine which output is the strongest

Example summary: Multilayer neural networks can easily handle different combinations of features that represent different things

Training

The real power of neural networks is the ability to start with random weights, and then learn from the samples by training, the most common way to train a neural network is "reverse propagation":

First, start with a sample and its correct answer (not junk e-mail/spam)

The sample is then fed into the neural network to observe its current speculative results.

At first, the network may give spam a higher weight than non-spam, which is not correct and needs to be corrected, that is, the need to tell the network, spam the right value is close to 0, non-spam is closer to 1

The synaptic weights that point to spam are trimmed down according to the contribution of each hidden layer node, while the weights pointing to non-spam are fine-tuned, and the synaptic weights between the input and hidden layers are adjusted according to their contribution to the important nodes in the output layer (adjustment formula??? )

    1. Support Vector Machine (support-vector machines)

(SVM)

The SVM accepts the data set as a digital input and tries to predict which classification the data belongs to

SVM constructs a predictive model by looking for a dividing line between two categories

The dividing line that is found by the support vector machine allows the data to be clearly divided, which means that the dividing line reaches the maximum possible distance from the coordinate point near it.

The best line is the one that best behaves: the only point that is needed to determine where this line is located is the one closest to it, which is called the support vector.

Nuclear techniques (the Kernel Trick)

If there is no line in the graph that can be divided into data, the "linear classifier" cannot be used to find the effective partition before the data is transformed in some way.

Method One: (You need some methods (by applying different functions on each axis, transforming the data into another different space, or even more than two-dimensional space)) "Polynomial transformation" (???). ), the new distribution map can be obtained by using this method, and the dividing line can be obtained by linear classifier.

Method Two: (In reality, in most cases in order to find the dividing line is often to change the sitting punctuation to more complex space, these spatial dimensions are thousands of dimensions, some infinite dimensions, polynomial transformation is not always feasible) "nuclear technique"-no longer the transformation of space, but a new function (???? ) instead of the original point function, the function returns the corresponding dot product result after the data has been transformed into a different space

    1. K-Nearest neighbor (K-nearest NEIGHBORS,KNN)

(How to construct a price forecast model for a given set of samples)

* Working principle:

Accepts a new data item for numerical prediction and compares it to a set of data items that have already been assigned a value, and the algorithm finds several items closest to the item to be measured and evaluates it to achieve the final result.

k--number of items, that is, the optimal number of matches for averaging

* Extended

Depending on the distance between the nearest neighbor weighted average, close proximity to the nearest neighbor will be higher than the weight of a bit

* Variable scaling and excess variables

The big problem with KNN algorithm--considering all the variables

By adjusting the data before the distance is calculated, the values of some variables are magnified, some are scaled down, and the fully-padded variables are multiplied by 0, scaling to a more comparable extent to a variable of value but with a wide range of values

Scaling--the need to cross-validate the predictive algorithm to determine the pros and cons of a set of scaling factors, that is, which factors should be used to predict new data

Cross-validation: First remove part of the data from the data set, and then use the rest of the data to infer this part of the data, the algorithm will attempt to evaluate the results of the prediction, by cross-validation of different scaling factors, you can get the error rate for each data, so that you can determine which scaling factors should be used for data guessing

Example: Removing a data 349, using k=2, inferred value = (399+299)/2, error = (inferred value-removed actual data value) squared =0

# #分级聚类, K-means clustering is an unsupervised learning technique: No data samples are required for training, because these methods are not intended to be used for prediction

(How to select a group of popular blogs and automatically cluster them, from which you can see which blogs are taken for granted, which have similar descriptive topics or use similar words)

    1. Clustering (clustering)
    2. Hierarchical clustering

* Working mode

Find the nearest data item, merge, and the location of the new cluster equals the average of the original two data item positions, continuing until each data item is contained in a large cluster

Level

Because eventually a hierarchy can be formed, and the hierarchy may be displayed as a tree (Dendrogram), we can select any of the minor points and decide whether to be a valuable group

    1. K-Means clustering

K-means clustering is really about splitting the data into different groups, asking for the number of groups you want before you start to perform the operation

* Operation Process

    1. Random position generation of two center points
    2. Each data is assigned to the nearest center point
    3. (center point Change) calculates the average of a previously allocated set of data, obtains a new position, moves the center point to that position
    4. When redistribution occurs, a group of items is found to be closer to the center point of the other group, and this data item is grouped into another group

# #多维缩放也是非监督技术, not to make predictions, but to make it easier to understand the degree of association between different data items

    1. Multidimensional Scaling (Multidimensional scaling)

Multidimensional scaling constructs a low-dimensional representation of the dataset and makes the distance values as close as possible to the original dataset, and for the printout of the screen or paper, multidimensional scaling usually means that the data is moved from multidimensional to two-dimensional

Example: a 4-dimensional data set (each item has 4 related values), using Euclidean distance formula to get each of the two direct distance values, so that all the data items can be plotted on the 2-dimensional graph, the figure element is calculated between each of the distance values, and then, due to the existence of errors, the need to constantly repeat the node movement, Knowing that you cannot reduce the total error value by moving nodes

# #也是一种非监督算法, because it does not predict classification or numerical values, but helps us to identify the characteristics of the data

    1. Non-negative matrix factorization (non-negative matrix FACTORIZATION,NMF)

(Observe the different topics that make up news stories, to learn how to break down the volume of stocks into a series of news events that will immediately affect a single or multiple stocks)

The role of *NMF

(Condition: Not knowing the weights and characteristics of the data (observations))

NMF for us to find the characteristics and weights of the possible values, NMF's goal is to automatically find the "feature matrix" and "Weight Matrix", Li Na reform matrix multiplication can get "Data set Matrix" (???? )--starts with a random matrix and is based on a series of update laws (????). ) to update these matrices, knowing that the product of the feature matrix and the weight matrix is close enough to the data matrix

* Meaning of the result

The feature matrix can tell us the many factors behind the data.

# #优化不是要处理数据集, but try to find a value that minimizes the output of the cost function

For some cost problems, once the cost function is designed, the algorithm (simulated annealing, genetic algorithm) can be used to solve

    1. Optimization (optimization)

Cost function:

(The cost function for optimization has many variables to consider, and it is sometimes unclear exactly which of these variables to modify in order to make the final result better.)

#此处只考虑一个变量的函数

First, you can easily draw the lowest point of a function (a complex function with multiple variables is not applicable)

Note: There are usually a lot of local minimums, and if we try to randomly select the solution and go down the slope, it is not always possible to find the optimal solutions (it is very likely to fall into a region that contains a local minimum value, so that the minimum value in the global range can never be found)

A Simulated annealing

(Alloy cooling in the field of physics)

Start with an immediate inference, then select a direction randomly, and find another approximate solution to determine its cost value;

If it becomes smaller, the new puzzle will replace the original;

If it becomes larger, the probability of replacing the old one with the new one depends on the current temperature value, where the temperature will begin to slow down at a relatively high value, which is why the algorithm is more receptive to relatively poor performance in the early stages of execution, so that we can effectively avoid the possibility of falling into the local minimum, when the temperature reaches 0, The algorithm returns the current

b Genetic algorithm

(Evolutionary theory)

Start with a set of stochastic equations called populations;

The most outstanding members of the population-the lowest cost-are selected and modified by changing (mutation) or feature combinations (cross-pairing).

To get a new group, called The Next generation;

Repeat, reach a threshold, or the population undergoes a few generations without any improvement, or the genetic algebra reaches its maximum, terminates

Summary of machine learning algorithms

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.