Keywords: machine learning, basic terminology, hypothetical spaces, inductive preferences, machine learning uses
I. Overview of machine learning
Machine learning is a process of computing a model from data , and the resulting model can not only reflect the rules contained in the training data set, but also can be applied to data outside the training set. The direction of machine learning is to solve the problem of"What algorithm we should adopt in order to get this model" .
If the training set is the "experience" in our life, then the model is our "empirical solution", and the data outside the training set is the "new problem" in life.
Ii. Basic Terminology
While explaining the basic terminology, we use the example of life "Watermelon Selection" to supplement the description
Watermelon Selection Three examples (instance):
1. (color = green, with pedicle = curled up, knocking = turbid sound)
2. (color = black, with pedicle = slightly curled, knocking = dull)
3. (color = plain, with pedicle = stiff, knocking = crisp)
DataSet (Data set ): datasets consist of multiple records, each consisting of a set of key-value pairs, each of which is an event or object. The form is as follows:
A record: (Key1=value1,key2=value2, ..., Keyn=valuen)
samples (sample) and Example (instance): Each pair of key-value pairs is a description of a property of a thing (attribute) and its attribute value (attribute value) , A key becomes an attribute or feature (feature). Such a record is called an "example (instance)" or "Sample", but it is important tonote that sometimes the entire dataset is also called a " sample " because it can be thought of as a sampling of the sample space, and we can use the context to determine whether the concept of" sample "refers to a single example (instance) or the entire dataset (data set).
Sample space: The space of the attribute spanned is called " attribute Space (attribute space)", "Sample space" or "input space ", and tag space Relative. For example, we put the color of the watermelon, with the clitoris, knocking as three axes, then the attribute is a three-dimensional space to describe the watermelon. Each watermelon can find its own coordinate position in this space. For example: (color = green, with pedicle = curled up, knocking = dull) is a turquoise curled up with the sound of a dull watermelon in this three-bit space coordinates.
eigenvector (Feature vector): Since each point in space corresponds to a coordinate vector, we also refer to an example as a "eigenvector (featurevector)". The eigenvector is the coordinate vector of the example in the attribute space.
presentation methods and properties for datasets: Generally, d = {X1,x2,...,xn} represents a DataSet with n samples , each example is described by a D attribute, then each example XI = {XI1,XI2,...,XID} is D The first vector of the dimensional sample space X , and D is called the " dimension (dimensionality)" of the sample XI. Where Xij is the value of the first I example on the J attribute .
For example, in "Watermelon selection", we use (color, heel, knock) three properties of the watermelon, d=3, sample dimension =3,x12= "curl up."
tag/Label: The result data of the training sample, such as "((color = green, with pedicle = curl, knocking = turbid), good melon )", a good melon is a label, but the label is sometimes used as a verb, so it can also be translated with the verb meaning of the mark .
tag Space: General (Xi, yi) denotes The example I, where Yi belongs to Y is the "tag/label" of Example X, and Y is a collection of all tags, also known as " Tag space" (Label Space) "," Output spaces ".
Learning (learning): from from the data is called Study (learning) or training (training). This process is usually done by executing a specific learning algorithm. The data used in the training process is called " training data ", where each sample is called a " Training sample (training sample ) "," Training example (training instance ) "or" Training example . The set of training samples is called " training set (training set )". The learning model corresponds to some latent law of the data, the law is called "hypothesis (hypothesis)"; data , called "truth" or "real" (Ground-truth). The process of learning is to find out or approach the truth. The instantiation model of the learning algorithm on a given data and parameter space is sometimes referred to as " Learner" (learner )
discrimination: "hypothesis hypothesis" and "truth Ground-truth" are the reflection of the law of the data, but the hypothesis is that the model obtained from the data learning is not necessarily our concern, want to respond to the real rules. And the truth is the data inherent, we expect to get the real law.
Classification of machine learning tasks:
1. According to the characteristics of the data classification:
If you are predicting a discrete value : called a classification (classification) task, if it is a continuous value : called the regression (regression) task. The classification task can also be divided into "Two classification (binary classification)" task or "multi-Classification (multi-classification)" task according to the number of categories. For two classification tasks, a class is usually called " Positive Class" and the other is " Inverse class (negative Class)".
The usual predictive task is to want to pass to the training set {(X1,Y1), (x2,y2) ... (Xn,yn)} to learn to create a mapping from the input space to the output space. F:x→y. For two classification tasks, usually y={0,1} or y={-1,+1}. For multi-classification tasks,|y|>2; for regression tasks, Y=r,r is a set of real numbers .
2. Classify by the training data whether there are labels:
If the training data is tagged learning, it belongs to "supervised learning (supervised learning)" and "unsupervised learning (unsupervised learning)" if not marked. Among them, classification (classification) and regression (regression) are the former representatives, and clustering is the latter.
generalization capability (generalization): the ability to adapt a model to a new sample, called a generalization capability (generalization).
Independent Distribution : Independent and identically distributed, Jane writing i.i.d (meaning reference any probability theory textbook)
overfitting: refers to the situation where the model is too fit for the training set data and is not consistent with the truth (Ground-truth) . It's a lot like the people in life.
Ii. hypothetical space and version space
1. Induction and deduction: induction (induction) refers to the "generalization (generalization)" process from the special to the general, from the individuality, the deduction (deduction) is from the general to the special " (specialization) ". "Learning from a sample" is, of course, a "inductive learning" (inductive learning) .
2. The narrow and generalized points of inductive learning: The narrow inductive learning requires "concept (concept)" From training data, also called "Concept learning". Concept Learning uses a set of concepts to describe new knowledge for a set of concepts, such as bool-valued judgments . However, because it is difficult to learn the concepts of generalization and semantic definition, most of the reality is to produce "black box model", that is, the concept itself is not necessarily well understood, but as long as the new data can be used in the prediction. For example, we get the model can choose good watermelon, but the color, with the pedicle, how to knock the good melon is good melon, for the outside of the black box is unknown.
3. Inductive learning and hypothetical space: The inductive learning process can be seen as a search process that chooses a hypothesis that matches the training set (FIT) from all possible assumptions . Once a hypothetical representation is determined, it is assumed that the size of the space and its size are determined. For example, suppose space is shaped like "(color =? ) Λ (with pedicle =?) ) Λ (knocking =?) ) "is formed by the assumption that the possible values of the For example, the color has "green", "black", "plain" the three possible values, also need to consider the color may be what value can be, we use the wildcard "*" to indicate this situation. of course, there is an extreme situation where the concept of "good melon" may not exist at all, and we use φ to denote this situation . If "color", "heel", "knock" respectively have the 3,2,2 kind of possible value, then our hypothetical space size is (3+1) * (2+1) * (2+1) +1=37.
The search for hypothetical space can be either top-down or bottom-up. The top-down corresponds from general to special, while bottom-up corresponds from special to General . In the process of searching, the hypothesis that does not conform to the positive example or the inverse example is constantly deleted, so we can get the result of learning, and the result can make the correct judgment to the data in the training concentration.
However, in the real world we often face a lot of hypothetical space, so there may be a number of assumptions and training set is consistent , there is a series of assumptions that conform to the data characteristics of the training set composed of a set of "assumptions ", such a collection we call " Version Space .
The principle of the Ames Razor: to choose the simplest hypothesis in multiple assumptions that match the results of the experiment . This principle is widely adopted in the fields of natural science, physics and astronomy. One reason Copernicus insists on "heliocentric", for example, is that its than Ptolemy Geocentric is simpler and simpler and conforms to astronomical observations.
Iii. inductive preference (inductive bias) and NFL theorem (No free Lunch theorem)
inductive preference (inductive bias): any algorithm needs to have a specific inductive preference (inductive bias), referred to as "preference". Because the algorithm must produce a model, but often in defining what is "good" above we are prone to divergence, therefore, for a given data set, the role of inductive preference (inductive bias) is equivalent to "values", And for datasets that can produce multiple possible assumptions, inductive preference is important because it ensures that the same results are obtained every time .
NFL theorem: There is no free lunch theorem in the world. The content of this theorem proves that no matter how clever or clumsy the chosen algorithm is, its expected performance is the same . But we need to note that the premise of this theorem is that all "problems" appear equal in probability, that is, the importance of the problem is equal. But in reality, we are only concerned about the problem we are currently solving, which is a specific problem, so different algorithms will have different performance.
In conclusion, the significance of NFL theorem is that we realize that it is meaningless to discuss "What Learning algorithm is better" from the specific problem, because the theorem proves that if you want to consider the potential problem, all the learning algorithms have the same expected performance. Algorithms that perform well on some issues will be stretched out on other issues, and other algorithms will show their strengths , which is the biggest meaning of the NFL theorem.
The first chapter on the summary here, ~ tired Ah, first came to this kiss ~ Bye ~ ~ has the experience in more ~?
"Machine Learning" (chapter I) preface chapter