Semi-Supervised Learning

Last Update:2018-10-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In fact, although unlabeled samples do not directly contain tag information, if they are sampled independently and distributed from the same data source as labeled samples, the data distribution information they contain will greatly benefit the creation of models. An intuitive example is provided. If we only use one positive sample and one inverse sample in the figure, we can only guess randomly because the sample to be identified is in the middle of the two; if you can observe the unlabeled samples in the figure, the positive samples are surely identified.

Semi-Supervised Learning (semi-supervised learning) enables machines to improve learning performance without relying on external interactions and automatic use of unlabeled samples ).

To use unlabeled samples, you must make some assumptions that associate the data distribution information revealed by unlabeled samples with the category tags. The most common is the "clustering hypothesis ", assume that the data has a cluster structure, and the samples of the same cluster belong to the same category. Another common assumption in semi-supervised learning is manifold assumption, which assumes that data is distributed in a single manifold Structure, and neighboring samples have similar output values. The degree of closeness is often characterized by the degree of similarity. Therefore, manifold hypothesis can be seen as a generalization of clustering hypothesis, but the manifold hypothesis has no limit on the output value, therefore, it is more applicable than clustering hypothesis and can be used for more types of learning tasks. In fact, both clustering hypothesis and manifold hypothesis are essentially the basic assumption that "similar samples have similar outputs.

Semi-supervised learning can be further divided into pure semi-supervised learning and direct learning (transductive Learning). The former assumes that unlabeled samples in the training data are not the data to be predicted, the latter assumes that the unlabeled samples considered in the learning process are exactly the data to be predicted. The purpose of learning is to obtain the optimal generalization performance on these unlabeled samples. In other words, pure semi-supervised learning is based on the "open world" hypothesis. We hope that the learning model can be applied to data not observed during the training process, while direct learning is based on the "closed world" assumption, only attempts to predict the unlabeled data observed during learning.

Generate Method

Generative methods is a method directly based on generative models. This method assumes that all data (whether marked or not) is "generated" by the same potential model. This assumption allows us to associate unlabeled data with learning objectives through the parameters of the potential model, while unlabeled data can be seen as model parameters, generally, maximum likelihood estimation can be implemented based on the EM algorithm. The difference between these methods lies in the assumption of generative models. Different model assumptions will generate different methods.

Common generative models include Gaussian mixture model, hybrid expert model, and Naive Bayes model.

Semi-supervised SVM

Semi-supervised Support Vector Machine (s3vm) is a promotion of semi-supervised learning. Without considering unlabeled samples, SVM tries to find the maximum interval to divide the hyperplane, and after considering unlabeled samples, s3vm tries to find two types of labeled samples that can be separated, the basic assumption here is the "low-density separation" (low-density separation)

The most famous semi-supervised SVM is the transductive Support Vector Machine. Like the standard SVM, tsvm is also a Learning Method for binary classification. Tsvm tries to consider assigning possible labels (label assignment) for unlabeled samples, that is, trying each unlabeled sample as a positive or inverse sample respectively, and then all of these results, find a hyperplane with the maximum interval of all samples (including labeled samples and assigned unlabeled samples). Once the hyperplane is determined, the final tag assignment of unlabeled samples is the prediction result.

Formally speaking, give and, where

The learning goal of tsvm is to provide prediction marks for the samples in.

, Makes

(W, B) determines a division hyperplane. It is a relaxation vector, corresponding to labeled samples, and corresponding to unlabeled samples; it is a compromise parameter specified by the user to balance the complexity of the model, the importance of the labeled sample and the unlabeled sample.

Obviously, trying to assign tags for unlabeled samples is a poor process. It is possible to solve the problem directly only when there are few unlabeled samples. In general, a more efficient optimization strategy is considered.

Tsvm uses local search to iteratively find the optimal solution. Specifically, it uses a labeled sample to learn an SVM, that is, the constraint in the ignore formula. Then, the SVM is used to label the unlabeled data (label assignment), that is, the SVM prediction result is used as the "pseudo-label" (pseudo-label) to assign the unlabeled sample. Now it is known, and a standard SVM problem can be obtained by substituting it, so we can solve the new division hyperplane and relaxation vector. Note that the pseudo tags of unlabeled samples may be inaccurate at this time, therefore, you must set a smaller value to make the labeled sample more effective. Next, tsvm finds two unlabeled samples that are assigned as heterogeneous objects and are likely to encounter errors, switches their tags, and then solves the new division hyperplane and relaxation Vectors based on the above formula, repeat the process and gradually increase to improve the impact of unlabeled samples on the optimization target. Then, adjust the next labeling until the optimization is completed. In this case, the SVM obtained by the solution not only marks unlabeled samples, but also predicts examples not found during training. The tsvm algorithm is described as follows:

Obviously, it is a large-scale optimization problem that involves huge computing overhead to adjust each pair of unlabeled samples that may have errors in tag assignment. Therefore, A key aspect of semi-supervised SVM research is how to design an efficient optimization solution strategy, which develops many methods, such as graph Kernel) LDs with function gradient descent, means3vm with labeled mean estimation, etc.

Semi-Supervised Learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Semi-Supervised Learning

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support