By RaySaint 2011/06/17
Concept Learning and inductive bias
I feel that concept learning is rarely mentioned now, probably because it is rarely used in practical applications of machine learning, but it is easy to introduce the concept of inductive bias from concept learning, inductive bias is a very important concept, so this time we will briefly talk about the concept learning and focus on inductive bias. We can see the importance of inductive bias for machine learning.
Concept Learning
How to automatically deduce the general definition of the same sample set and whether each sample belongs to a certain concept. This problem is called Concept Learning.
A more accurate definition:
Concept LearningIt refersBoolean FunctionIn the input/output training example, this Boolean function is truncated. Note: As mentioned in the previous article "design of basic concepts and learning systems of machine learning", the exact type of knowledge to be learned in machine learning is usually a function, in concept learning, this function is limited to a Boolean function, that is, its output only {0, 1} 0 indicates false, and 1 indicates true )), that is to say, the form of the target function is as follows:
F: X-> {0, 1}
According to the above definition, it is obvious that conceptual learning belongsSupervised Learning.
Here is an example from machine learning By mitchell to better understand concept learning.
Target concept: Aldo people's name) the days when they will go swimming by the sea. Note that this is not a good description here, it is easy to understand that the function we want to get to indicate the target concept outputs a string of dates. The objective of learning not in line with the concepts mentioned above is to deduce a Boolean function. In fact, here is a day. Based on the various attributes of the day, it is inferred whether Aldo will go swimming on this day. Table 1 below describes a series of daily samples, each of which is a set of attributes. The attribute EnjoySport indicates whether Aldo is willing to exercise Water on the day. It is also an attribute to be predicted. Sky, AirTemp, Humidity, Wind, Water, and Forcast are known attributes, it is based on these attributes to determine whether Aldo will go swimming by the sea on this day.
Table 1 positive examples and inverse examples of the target concept EnjoySport
Example |
Sky |
AirTemp |
Humidity |
Wind |
Water |
Forecast |
EnjoySport |
1 |
Sunny |
Warm |
Normal |
Strong |
Warm |
Same |
Yes |
2 |
Sunny |
Warm |
High |
Strong |
Warm |
Same |
Yes |
3 |
Rainy |
Cold |
High |
Strong |
Warm |
Change |
No |
4 |
Sunny |
Warm |
High |
Strong |
Cool |
Change |
Yes |
Next we need to determine the form of assuming the target function). We can first consider a relatively simple form, that is, the sum of the attributes of the instance. Each constraint can be assumed as a vector of six constraints, which specify the values of Sky, AirTemp, Humidity, Wind, Water, and Forcast. Each attribute can be set:
- By "?" Any value acceptable to this attribute.
- Specify the value of the attribute, such as warm ).
- "Accept" indicates that no value is accepted.
If some instances x meet all the constraints of h, h classifies x as a positive example (h (x) = 1 ). For example, to determine that Aldo only performs on-demand motion on cold and wet days and has nothing to do with other attributes, such assumptions can be expressed as the following expressions:
<?, Cold, High ,?, ?, ?>
That is to say:
If AirTemp = Cold your Humidity = High then EnjoySport = Yes
The most common assumption is that every day is a positive example, which can be expressed:
<?, ?, ?, ?, ?, ?>
The most special assumption is that every day is an inverse example, which indicates:
<Average, minimum, minimum, maximum, minimum, and maximum>
Note the following points:
1. <average, minimum, maximum, minimum, maximum, maximum> and <average ,?, ?, ?, ?, ?> , <Strong, strong ,?, ?, ?, ?> The assumption is the same, that is, as long as there is an attribute of "negative" in the assumption, then this assumption indicates that every day is an inverse example.
2. you may wonder why the assumption here is the combination of attributes. If the Humidity has three values: High, Normal, and Low, in this case, it is impossible to express that Aldo will go swimming at the seaside when the humidity is Normal or High, because this is an analytical method:
If Humidity = Normal then Humidity = High then EnjoySport = Yes
The assumption of assertion is that the combination of attributesInductive biasIt makes our learnerBiasedIf the learner is unbiased, then it cannot classify non-existing instances. This is very important and will be explained later.
Now we have made some glossary to facilitate the subsequent presentation:
- (1) The concept or function to be learned is called the target concept and is recorded as c.
- (2) The concept is defined on an instance set, which is expressed as X. When learning the concept of objective, you must provide a set of training samples, which must be recorded as D ), in the training set, each person in the training set has an instance X in x and its target concept value c (x) as shown in example in table 1 above ). C (x) = 1, x is called positive; c (x) = 0, x is called negative ). The training sample can be described using sequence even <x, c (x)>.
- (3) Given the training sample set of objective concept c, the problem faced by the learner is to assume or estimate c. The symbol H is used to represent all possible hypothetical sets, also known as the hypothetical space. For our problem, the hypothesis space is the union of all attributes. H indicates the Boolean function defined on X, that is, h: X-> {0, 1 }. The goal of machine learning is to find a hypothesis h so that for all X in x, h (x) = c (x ).
Inductive Learning Hypothesis
The task of machine learning is to look for h, which is the same as objective concept c, on Instance set X. However, the only information we have for c is its value in training combination. Therefore, the inductive learning algorithm can only ensure that the output assumptions can fit the training sample. If there is no more information, we can only assume that the best assumption that no instance is available is the best fit assumption with the training sample data. This is a basic assumption of inductive learning. An accurate definition is given below:
Inductive Learning HypothesisAny assumption that if the target function is well approached in a sufficiently large training sample set, it can also well approach the target function in no instance.
Now let's talk about how to approach the target function, that is, the search strategy previously mentioned in "design of basic concepts and learning systems of machine learning, that is, how to search for the hypothesis space H to obtain the hypothesis h consistent with the objective concept c.
General assumptions include special order, variant space, and candidate elimination algorithms.
General assumptions are in special order.
Many conceptual learning algorithms, such as candidate elimination algorithms, are described here. The search hypothesis space method relies on a structure that is effective for any conceptual Learning: The assumptions are generally in special order.
Consider the following two assumptions:
H1 = <Sunny ,?, ?, Strong ,?, ?>
H2 = <Sunny ,?, ?, ?, ?, ?>
Obviously, instances divided by h1 into positive examples are divided by h2 into positive examples. Therefore, h2 is more common than h1, and h1 is more special than h2. Now we need to define "comparison ...... More general "this relationship
Definition:Make hj and hk the Boolean Functions Defined on X. HjMore_general_than_or_1__toHk is recorded as hj ≥ ghk ).
650) this. width = 650; "style =" background-image: none; padding-left: 0px; padding-right: 0px; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px; border-right -: "title =" image "border =" 0 "alt =" image "width =" 383 "height =" 61 "src =" http://www.bkjia.com/uploads/allimg/131228/1H5463351-0.png "/>
If hj ≥ ghk Digest (hk =g hj), then hj is strict.More_general_thanHk writing hj> ghk), we can get some simple properties of ≥g:
(1) Self-defense, hj ≥ ghj
(2) Objection: If hj is ≥ghk, then hk is not ≥g hj
(3) transmission. If hi ≥ghj and hj ≥ghk, hi ≥ghk
Obviously, ≥g is a partial order relation on space H.
≥G is very important because it provides an effective structure for any concept learning problem in the hypothesis space H, and can search for the hypothesis space more effectively.
Variant Space
To better describe the variant space, a definition is given first:
Assume that h is consistent with D in the training sample set. if and only if <x, c (x)> has h (x) for each sample in D) = c (x ).
As
650) this. width = 650; "style =" background-image: none; padding-left: 0px; padding-right: 0px; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px; border-right -: "title =" image "border =" 0 "alt =" image "width =" 491 "height =" 45 "src =" http://www.bkjia.com/uploads/allimg/131228/1H54CR6-1.png "/>
The candidates to be mentioned now reflect all the assumptions that the algorithm can represent consistent with the training sample. This subset in the hypothesis space is called the variant space about the hypothesis space H and the training sample D, because it contains all reasonable variants of the target concept.
Definition:The variant space of the hypothesis space H and the training sample set D is marked as VSH and D, which is a subset of all the assumptions consistent with the training sample D in H.
650) this. width = 650; "style =" background-image: none; padding-left: 0px; padding-right: 0px; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px; border-right -: "title =" image "border =" 0 "alt =" image "width =" 349 "height =" 61 "src =" http://www.bkjia.com/uploads/allimg/131228/1H5461527-2.png "/>
The structure in general to special order described above can be used to represent the variant space in a more concise form. The variant space can be expressed as itsExtremely generalAndExtremely Special.
Let's take a look at the following assumption ),
H = <Sunny, Warm ,?, Strong ,?, ?>
This h is consistent with the four training samples in Table 1. In fact, this is only one of the six assumptions consistent with the training sample. Figure 1 below provides the 6 hypothesis:
650) this. width = 650; "style =" background-image: none; padding-left: 0px; padding-right: 0px; display: block; float: none; border-top-width: 0px; border-bottom-width: 0px; margin-left: auto; border-left-width: 0px; margin-right: auto; padding-top: 0px; border-right-: "title =" image "border =" 0 "alt =" image "width =" 795 "height =" 361 "src =" http://www.bkjia.com/uploads/allimg/131228/1H54BA0-3.png "/>
Figure 1
The six assumptions in Figure 1 constitute a variant space. The six assumptions are consistent with the training sample set.) The arrows indicate the more_general_than relationship between instances. S is the set of extremely large special assumptions, and G is the set of extremely large general assumptions. It is easy to see in the figure that, if G and S are given, it is easy to generate all the assumptions between S and G through the General to special partial order structure. Therefore, you only need to specify a set of extremely large special assumptions and a set of extremely large general assumptions, it completely represents a variant space.
The following is an accurate definition:
General Boundary:About the hypothesis space H and training data DGeneral boundary)G is a set of extremely large general members consistent with D in H.
650) this. width = 650; "style =" background-image: none; padding-left: 0px; padding-right: 0px; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px; border-right -: "title =" image "border =" 0 "alt =" image "width =" 865 "height =" 65 "src =" http://www.bkjia.com/uploads/allimg/131228/1H5463146-4.png "/>
Special boundaries:About the hypothesis space H and training data DSpecial boundary)S is a set of extremely large special members consistent with D in H.
650) this. width = 650; "style =" background-image: none; padding-left: 0px; padding-right: 0px; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px; border-right -: "title =" image "border =" 0 "alt =" image "width =" 865 "height =" 94 "src =" http://www.bkjia.com/uploads/allimg/131228/1H5463001-5.png "/>
Variant space representation theorem:Make X an arbitrary instance set, and H is the combination of Boolean assumptions defined on X. Set c: X-> {0, 1} to any target concept defined on X, and set D to a set of any training samples {<x, c (x)> }. For all X, H, c, D and well-defined S and G:
650) this. width = 650; "style =" background-image: none; padding-left: 0px; padding-right: 0px; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px; border-right -: "title =" image "border =" 0 "alt =" image "width =" 580 "height =" 62 "src =" http://www.bkjia.com/uploads/allimg/131228/1H5463403-6.png "/>
Candidate elimination algorithm
With some of the above preparations, we can now describe candidate elimination algorithms. The idea of the algorithm is as follows: to obtain the variant space VSH and D, first initialize the G Boundary and S boundary as the most general and special assumptions in H. That is:
G0 <-{<?, ?, ?, ?, ?, ?>}
S0 <-{<average, minimum, maximum, minimum, minimum, and maximum>}
Then, each training sample is processed so that S is generalized and G is specialized, so that the variant space is gradually reduced and the assumptions in the variant space that are inconsistent with the sample space are eliminated.
The pseudocode is described as follows:
Candidate elimination algorithm, input training sample D, output variant space represented by G and S
G <-{<?, ?, ?, ?, ?, ?>}
S <-{<strong, strong>}
Foreach d in D
{
If (d = positive)
{
Foreach g in G
{
If (g and d are inconsistent)
{
Remove G from g;
}
}
Foreach s in S
{
If (s and d are inconsistent)
{
Remove S from s;
The very small general formula of foreach s h
{
If (h is the same as d & a member of G is more general than h)
{
Add h to S;
}
}
Remove all such assumptions from S: it is more general than another assumption in S;
}
}
}
Else
{
Foreach s in S
{
If (s and d are inconsistent)
{
Remove S from s;
}
}
Foreach g in G
{
If (g and d are inconsistent)
{
Remove G from g;
Foreach g's extremely small special formula h
{
If (h is the same as d & a member of S is more special than h)
{
Add h to G;
}
}
Remove all such assumptions from G: it is more special than another assumption in G;
}
}
}
}
Briefly summarize the above algorithms. The positive sample gradually extends the S boundary of the variant space, while the role played by the reverse sample makes the G Boundary more specialized. For each input training sample, the S and G boundaries move monotonically and are close to each other to create a smaller variant space.
Execute the candidate elimination algorithm on table 1 to obtain the result in Figure 1.
Use incomplete learning concepts for classification
Assume that only four training samples in Table 1 are provided, and no more training samples are provided. Now we need to classify the instances we have not seen. Figure 1 shows that the variant space still contains multiple assumptions, that is, the target concept has not been fully learned yet, but it is still possible to classify the new sample with a certain degree of reliability. To demonstrate this process, Table 2 lists the new instances to be classified:
Table 2 new instances to be classified
Instance |
Sky |
AirTemp |
Humidity |
Wind |
Water |
Forecast |
EnjoySport |
A |
Sunny |
Warm |
Normal |
Strong |
Cool |
Change |
? |
B |
Rainy |
Cold |
Normal |
Light |
Warm |
Same |
? |
C |
Sunny |
Warm |
Normal |
Light |
Warm |
Same |
? |
D |
Sunny |
Cold |
Normal |
Strong |
Warm |
Same |
? |
First look at A. Each hypothesis in the current variant space shown in Figure 1 classifies A as A positive example. Since all the assumptions in the variant space agree that instance A is A positive sample, the learner divides instance A as the credibility of the sample, just like the concept of A single target.
The premise is that the concept of the target must be in the hypothetical space, and the training sample is correct.). In fact, as long as each member in S divides the instance into a positive example, it can be asserted that each hypothesis in the variant space divides it into a positive example, because it is defined by more_general_than, if the new instance meets all the members of S, it must also meet these more general assumptions. Similarly, every assumption in B's space to be changed is divided into inverse examples. You can safely divide B into inverse examples, even if the concept learning is incomplete. As long as the instance does not meet all the members in G, the instance can be asserted. When encountering instance C, you must note that half of the variant space divides C into positive examples and half into inverse examples. Therefore, the learner cannot classify the same sample in a trusted manner unless more training samples are provided. Instance D is divided into positive examples by two assumptions in the variant space, and other assumptions are divided into positive examples. The classification reliability in this example is smaller than that of A and B, and larger than that of C. Voting tends to classify it as a counterexample. Therefore, you can output the classification with the maximum number of votes, and attach a credibility ratio to indicate the degree of inclination of voting. NOTE: If all the assumptions in H have equal prior probability, the voting method can obtain the most likely classification of the new instance.) Now, we can talk about a very important concept, inductive bias.
Inductive biasAs mentioned above, if the training sample has no errors, the candidate elimination algorithm can converge to the target concept if enough training samples are provided when the initial hypothesis space contains the concept target. As mentioned above, assertion is assumed in the form of attribute union, in fact, to narrow the scope of the hypothetical space to be searched. One consequence of doing so is that it is possible that the target concept is not in such an initial hypothetical space. To ensure that the hypothetical space contains the target concept, an obvious method is to expand the hypothetical space so that every possible assumption is included. Take EnjoySport as an example again, where the assumed space is limited to a Union that only contains attribute values. Due to this limitation, it is assumed that the space cannot represent the simplest form of analysis, for example, "Sky = Sunny or Sky = Cloudy ". So the problem is that we make the learner
BiasedConsider only the assumptions of the Union.
The useless nature of unbiased LearningWell, this kind of bias may cause the hypothetical space to miss the target concept, so we can provide a space with stronger expressive ability, which can express all the concepts that can be taught. In other words, it can express all possible subsets of instance set X. Generally, we call the set of all subsets of set X The power set of X ). Assume that AirTemp, Humidity, Wind, Water, and Forcast have only two possible values, and Sky has three possible values, then the instance space X contains 3X2X2X2X2X2 = 96 different instances. Based on the set knowledge, the size of the Power Set of X in this instance set is 2 ^ | X |, where | X | is the number of X elements. Therefore, you can define 2 ^ 96 or about 10 ^ 28 different target concepts in this instance space, we call it a hypothesis that contains 2 ^ | X |.
Unbiased hypothetical Space. Previously, we would assume that the space is limited to the Union that only contains the attribute values, then it can only represent 1 + 4 × 3 × 3 × 3 × 3 = 973 assumptions. Ha, it seems that our previous space is actually a hypothetical space with a large bias. In terms of perception, although the unbiased hypothesis space must include the concept of the target, it contains a large number of assumptions, and it is time-consuming to search for such a space. However, you will soon find that there is still a fundamental problem: if you use a non-biased hypothetical space, the concept learning algorithm will not be able to generalize from the training sample, to obtain the concept of a single target, all instances in X must be provided as training samples.
We cannot obtain the classification of unknown instances from such a learner.. Now let's see why. Suppose we provide the learner with three positive examples (x1, x2, x3) and inverse examples (x4, x5 ). At this time, the S boundary of the variant space contains the assumption that the analysis of three positive examples is: S: {(x1 %x2 %x3 )} this is the most special assumption that can cover three positive examples. Similarly, the G boundary is composed of assumptions that can be excluded. G: {negative symbol (x4 1_x5)} The problem arises. In this very expressive hypothetical representation, the S boundary is always the analytic formula of all positive examples, the G boundary is always the negative form of all negative examples. In this way, only the existing training samples can be classified by S and G without ambiguity. To obtain a single target concept, you must provide all the instances in X as training samples. Well, to avoid this problem, we only use the variant space that is not completely learned, as we used the Member voting method to classify non-existing instances. Unfortunately, you will immediately find that voting does not work for those instances you have never seen before. Why? If no instance is seen, just half of the deformation space is divided into positive examples, and the other half is divided into inverse examples. The reason is as follows: if H is the power set of X, x is an unused instance. Assume that x is overwritten in the variant space. There must be another hypothesis, H', which is almost the same as h, except that the classification of x is different. In addition, if h is in the variant space, then h is also in place, because it divides the previously trained sample into the same class as h. The above discussion illustrates a basic attribute of Inductive Reasoning:
If the learner does not preassume the form of the target concept, it cannot classify non-existing instances.This is the useless nature of unbiased learning. In our original EnjoySport task, the candidate elimination algorithm can be generalized from the training sample. The only reason is that it is biased, it is implicitly assumed that the target concept can be expressed by the Union of attribute values. Induction learning requires some form of preassumptions, or
Inductive bias)We can use inductive bias to describe the features of different learning methods. There is also a more Glossary of inductive bias, which is not mentioned here. In short, inductive bias is an additional premise set B. It will be mentioned later that there are two conditions for this premise set B. The first is to limit the hypothetical space, like the candidate elimination algorithm, the second is to assume that the space is a complete hypothetical space, but not a thorough search. Many greedy algorithms are like this, as will be mentioned later.
Decision Tree Algorithm. The previous inductive bias is called
Limited offsetThe latter is called
Preferred offset. When studying other inductive inference methods, it is necessary to keep in mind the existence and strength of such inductive bias. If an algorithm is more biased, the more inductive it can be, and more instances are not found. Of course, the correctness of classification also depends on the correctness of inductive bias.
References:
Machine Learning By M. Mitchell
This article is from the "UnderTheHood" blog, please be sure to keep this source http://underthehood.blog.51cto.com/2531780/590838