A brief introduction to semi-supervised classification algorithms, self-trainning,co-trainning

Last Update:2018-07-26 Source: Internet

Author: User

Tags join svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article is to casually talk about the understanding of the semi-supervised algorithm, here mainly on the quasi-supervised classification .

The first is why the semi-supervised learning algorithm is used.

In general, when the amount of training data is too low, the model effect of supervised learning can not meet the demand, so use semi-supervised learning to enhance the effect. Training sample less, will lead to two problems, on the one hand, the distribution of samples can not really represent the distribution characteristics of real data, on the other hand, too little data can not meet the requirements of training and learning, "can only remember, not learn." These two reasons will cause the model of the training data to appear the classification boundary problem that does not find the real data correctly. Semi-supervised learning solves two problems, one is to use the existing data to simulate the distribution characteristics of real data in the feature space, and the other is to determine the classification boundary on this basis, that is, to determine P (X) and P (y| X).

on the semi-supervised learning algorithm.

The above mentioned the purpose of semi-supervised learning, so there are many algorithms, including semi-supervised SVM, Gaussian model, KNN model and so on provide the corresponding solution. These algorithms are based on the corresponding assumptions, and the corresponding methods are given. For example, KNN here assumes that the most recent category of samples in the K-tagged sample is the category of the sample. Semi-supervised SVM assumes that the classification boundary is the most sparse part of the sample distribution. Wait, no more detail. Because the above algorithm in the practical application of the operability is too low, the results are very difficult to control.

There is a label propagation algorithm based on KNN, which assumes that the label of the data point closest to the sample point is the label of the sample point, which needs to be iterated through, giving only one sample point label per iteration. The algorithm is too affected by outliers and accidental factors, and has poor effect.

The main talk about self-trainning and co-trainning two kinds of algorithms.

The hypothesis of self-trainning is that using the model obtained from the existing training data to predict the untagged data, the data with high confidence is more likely to be properly labeled, so it can be used to join the training set. So the flow of the algorithm is:

1, using the existing training data training model to predict the non-tagged data

2, the confidence level is higher than the portion of the label data and they are assigned to the model tag together to join the training set

3, if the training set and the model meet the requirements, then output the current training set and model, otherwise, back to 1

Obviously, this is an iterative process, but also an open process, the requirements of the 3 is actually the operator's own subjective will, and the model selection is unlimited (SVM,RM,LR, etc.). Here are a few suggestions.

When selecting a sample, not only to consider the confidence degree, but also to consider the difference in confidence, only the sample in a certain class of confidence is significantly higher than all other categories of confidence, can be selected to the training set.

The requirements for new sample selection need to be continuously enhanced during the iterative process.

This process needs to be very cautious, because improper operation will make the new training sample to join a great error, will not get the results due. Self-trainning algorithm is based on its own iterative learning, so it is easy to appear "deviation" situation.

Co-trainning algorithm:

This is an algorithm for self-trainning defects, which is no longer self-learning, and learning from each other. Two models are obtained in each iteration, and each has a separate training set.

The algorithm process is no longer described, as above, but it has two training sets (based on the original training set, randomly divided, complementary), each iteration of the training of a model, and a high degree of confidence of the sample into the other's training set. Note that is added to each other's training set, not your own. the next iteration is still based on the individual training sets to retrain to get the model.

It can be seen that two models have the effect of mutual correction, to a certain extent, to prevent the semi-supervised "deviation", but this is a prerequisite .... The feature set used by the training model requires a different set of features: Otherwise, it is invalid. As for the selection of the model is irrelevant, even in two training sets can choose different algorithms to train the model.

Finally, it's important to

The results of semi-supervision can no longer be judged solely by the indicators of supervised learning.

It's strange. Actually, it's nothing.

In most cases, the selection of the validation set data is derived from the original tagged data, so the training set of the problem, the validation set generally also exist, for example, too little (lack of representativeness, contingency), uneven distribution. In this case, the use of validation sets to evaluate the results of semi-supervised learning itself is problematic. Of course, if your validation set is well established, it is entirely possible.

Individuals feel that the results of semi-supervised learning is difficult to assess is also an important reason for restricting its development.

Just write it down here.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More