The following is excerpted from Wikipedia semi-supervised learning. A conceptual and intuitive experience for semi-supervised learning.
Semi-supervised learning is a class of supervised learning tasks and techniques The also make use of the unlabeled data for T Raining-typically a small amount of labeled data with a large amount of unlabeled data. Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning ( With completely labeled training data). Many machine-learning researchers has found that unlabeled data while used in conjunction with a small amount of labeled Data, can produce considerable improvement in learning accuracy. The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g. to transcribe an audio s Egment) or a physical experiment (e.g. determining the 3D structure of a protein or determining whether there are oil at a particular location). The cost associated and the labeling process thus may render a fully labeled training set infeasible, whereas acquisition Of unlabeled data is relatively inexpensive. In SuCH situations, semi-supervised learning can be of great practical value. Semi-supervised learning is also of theoretical interest in machine learning and as a model for human learning.
Mainly said that the actual, there is a label sample needs to be done manually, the cost is larger, and for the non-label sample is easier to obtain, so the mean thought, the use of part of the cost of labeling samples, part of the collection of non-tagged samples, so that learning constitutes a semi-supervised learning, In practice, it will have good application value. is actually a learning process that mimics human beings.
As in the supervised learning framework, we is given a set of L independently identically distributed examples x 1 ,... .., x l ∈X with corresponding labels y 1 ,... .., y l ∈Y . Additionally, we are given U unlabeled examples x l + Span class= "mn" id= "mathjax-span-761" style= "font-size:70.7%; Font-family:mathjax_main; " >1 ,... ..,x l + Span class= "Mi" id= "mathjax-span-771" style= "font-size:70.7%; Font-family:mathjax_math-italic; " >u ∈X . Semi-supervised learning attempts to make use of this combined information to surpass the classification performance Could be obtained either by discarding the unlabeled data and doing supervised learning or by discarding the labels and do ing unsupervised learning.
The two-part sample composition of semi-supervised learning is described by the formula, which says that semi-supervised learning is better than the result of unsupervised learning and discarding the label sample without the label sample.
Semi-supervised Learning may refer to either transductive learning or inductive learning. The goal of Transductive learning is to infer the correct labels for the given unlabeled data x l + Span class= "mn" id= "mathjax-span-782" style= "font-size:70.7%; Font-family:mathjax_main; " >1 ,... ..,x l + Span class= "Mi" id= "mathjax-span-792" style= "font-size:70.7%; Font-family:mathjax_math-italic; " >u Only. The goal of inductive learning is to infer the correct mapping from X to Y.
Semi-supervised learning can be inferred from transductive learning (direct push learning) and inductive learning (inductive learning), and direct-push learning is the right label for the introduction of untagged samples, while inductive learning is the introduction of maps from X to Y. The following paragraph has an intuitive explanation.
Intuitively, we can think of the learning problem as an exam and labeled data as the few example problems that the teacher Solved in class. The teacher also provides a set of unsolved problems. In the Transductive setting, these unsolved problems is a take-home exam and you want to does well on them in particular. In the inductive setting, these is practice problems of the sort you'll encounter on the In-class exam.
Here, for example, students test questions, for semi-supervised learning is like a student exam, there is a label sample is the teacher in class has given the answer to the question, and no label sample is given in the class without the answer to the exam review reference questions. Here, the direct push type of study is the students directly to review the reference questions to do out, right when is familiar with the test questions, and inductive learning is the students think this is the actual question to be encountered in the examination questions, he did a sort of classification, see what questions.
It is unnecessary (and, according to Vapnik ' s principle, imprudent) to perform transductive learning Classification rule over the entire input space; However, in practice, algorithms formally designed for transduction or induction is often used interchangeably.
In practice, it is usually a direct-push and inductive alternating learning.
assumptions used in semi-supervised learning
In order to do any use of unlabeled data, we must assume some structure to the underlying distribution of data. Semi-supervised learning algorithms make use of at least one of the following assumptions.
Smoothness assumption
Points which is close to each other is more likely to share a label. This was also generally assumed in supervised learning and yields a preference for geometrically simple decision boundaries . In the case of semi-supervised learning, the smoothness assumption additionally yields a preference for decision Boundarie s in low-density regions, so, there is fewer points close to each other but in different classes.
Cluster Assumption
The data tend to form discrete clusters, and points in the same cluster is more likely to share a label (although data sh Aring a label may spread across multiple clusters). This was a special case of the smoothness assumption and gives rise to feature learning with clustering algorithms.
Manifold Assumption
The data lie approximately on a manifold of much lower dimension than the input space. In this case we can attempt to learn the manifold using both the labeled and unlabeled data to avoid the curse of Dimensio Nality. Then learning can proceed using distances and densities defined on the manifold.
The manifold assumption is practical when high-dimensional data was being generated by some process. del directly, but which only have a few degrees of freedom. For instance, human voice was controlled by a few vocal folds,[2] and images of various facial expressions was controlled b y a few muscles. We would like in these cases to use distances and smoothness in the natural space of the generating problem, rather than I n the space of all possible acoustic waves or images respectively.
Mainly in order to enable semi-supervised learning, there are three assumptions to illustrate: first of all, smoothing hypothesis, similar samples should be more likely to have the same label. This is somewhat similar to the definition of generalization capability in supervised learning: similar inputs get similar output, otherwise the machine cannot learn. This is the real hypothesis that requires a sample to be produced should be smooth. Then the cluster hypothesis is that the data tends to form discrete clusters, and the samples in the same cluster are more likely to have the same label. This is related to the characteristics of feature learning through the clustering algorithm in unsupervised learning. Finally, Manifold assumes that the input sample space is redundant. This is guaranteed by the use of labeled and untagged samples to learn the manifold to avoid dimensional disasters. Just like the voice and expression of a person, control occurs only a few vocal folds, and the muscles that control facial expression are the few, we prefer to use these small amount of key factor, rather than go directly in the sonic or facial image space to learn.
2015-8-28
Less art
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Semi-supervised learning (Wikipedia)