Machine Learning Summary (1), machine learning Summary
Intelligence:
The word "intelligence" can be defined in many ways. Here we define it as being able to make the right decision based on certain situations. Knowledge is required to make a good decision, and this knowledge must be operable, for example, interpreting sensor data and using it for decision making.
Artificial Intelligence:
Thanks to the programs that humans have written and allow them to do what we think is useful. In this case, the computer has gained a certain degree of intelligence. At the beginning of the 21st century, there were still many tasks that humans and animals could do very easily, but computers could not. Many of these tasks fall under the label of artificial intelligence, including many sensor and control tasks. Why can't we write programs to do these things? I believe this is because we do not really know how to do these tasks, even if our brain can. The current implicit knowledge involved in doing this, but we get this information through data and samples, such as observing how humans do it under certain input. How can we let machines get the kind? Using data and samples to establish operational knowledge is machine learning.
Machine Learning:
Machine Learning has a long history and many textbooks have explained many useful principles. Here we focus on several of the most relevant topics.
Formalizing learning:
First, let's formalize the most general machine learning framework. The following sample is given:
D = {z1, z2,..., Zn}
Zi is a sample from an unknown process p (z. Given a loss function L, we have two parameters: decision function f and Sample z, which return a real number. We want to find the minimum value of L (f, Z.
Supervised Learning:
In supervised learning, the input parameter of f of every sample Z = (X, Y) function is x. The most common examples here include:
-Regression: Y is a real number or vector, and f output and y are a set. We often use the loss function as the mean variance:
L (f, (x, y) = | f (x)-y | ^ 2
-Classification: y is a finite positive integer corresponding to the classification sequence number. We often use this loss function as a logarithm function, and fi (x) = p (y = I | x ), that is to say, the probability of X, y = I is given.
L (f, (x, y) =-log (Fy (X) the constraint here is fy (x)> = 0, sum (fi (x) = 1
Unsupervised learning:
In unsupervised learning
In unsupervised learning, we learned a function f, which helps describe an unknown probability distribution p (z ). some functions directly estimate p (z) itself (this is called density estimation ). In other examples, f is an attempt to describe where the density is mainly concentrated. The clustering algorithm divides the input space into a region (usually a sample is centered ). Other clustering algorithms create a hard partition (such as k-means), while the other creates a soft partition (such as the gaussian mixture model). This soft partition gives the probability that z belongs to each classification. Other unsupervised learning algorithms are the expressions that build new z. Many deep learning algorithms belong to this class, so does PCA.
Direct generalization:
Most general learning algorithms recommend a single principle for direct generalization. It assumes that if sample a is close to sample B, the output f (a) and Output f (B) are close. This is the basic principle of direct generalized interpolation. This principle is very powerful, but it has limitations. What if we have multiple functions? If the target function has more than one output than the trained sample? In this way, direct generalization will not be true, because we need at least as many samples as the target function to cover multiple functions, so that we can generalize through this principle. In other words, given knowledge D, the human brain does not necessarily just learn a function and does not do it. Instead, it learns a lot of functions. At this time, direct generalization is not true.
This problem is deeply related to the so-called curse of dimensions due to the following reasons.
When the input space is high latitude and exponential growth, it is very likely that multiple functions need to be learned. For example, we want to differentiate 10 different outputs of a given input, and we care about the n configuration of 10 for all n variables. To use direct generalization only, we need at least one sample to generalize the n-power samples of 10 to achieve all generalization.
Distributed expression vs local expression and indirect Generalization
A simple Binary Expression of integer N is a B-bit sequence, and N <B, all bits are 0, except for the nth bit, A simple binary distributed expression of integer N is a log2b-bit sequence accompanied by a normal binary encoding of N. In this example, we can see that distributed expressions can be more efficient than local tables to achieve an exponential level. With the same number of free parameters, the distributed expression can reach the exponential level. They provide the potential for better generalization, because the learning theory shows the number of samples o (B) tuning.
Another difference is clustering and principal component analysis, or limiting the botnet. The front is local, and the back is distributed.
Using k-means clustering, we maintain a parameter vector for each prototype. For example, each region has one. In PCA, we express its distribution by recording the main direction of its target possibility. Now imagine a simple Principal Component interpreter. in every direction, whether the projection in that direction is above or below a threshold value, in d direction, we can distinguish the nth region of 2. RBMs is similar. It defines D superplanes and associates a single bit to identify the side of the plane. An RBM is then associated with the configuration of an input interval to each I expression bit. (These bits are the terms of hidden units in neural networks .)
The number of RBM parameters is approximately the number of hidden units and the input dimension. We can see that the number of RBM regions or the number of PCA regions increases exponentially with the number of input parameters, however, the number of regions expressed in the traditional k-means clustering range is linear only and increases with the number of parameters. In other words, when we realize that an RBM can generalize to a new region corresponding to the hidden unit configuration, this sample is not seen, that is, the clustering algorithm is impossible.