0 Introduction
Machine Learning is the core research area of artificial intelligence and an important approach to intelligent information processing. Supervised Learning is one of the most widely used learning methods in machine learning. In traditional supervised learning, the learning system learns a large number of labeled training samples (labeled examples) and establishes a model for predicting tags of unknown samples ). Here, the tag corresponds to the sample output, which is used to characterize the target concept to be learned.
With the development of data collection and storage technology, it is quite easy to collect a large amount of data without tags, but it is relatively difficult to provide tags for such data, because the data marking process often consumes a lot of manpower and material resources, it may even depend on a few field experts. For example, in computer-aided diagnosis, training data can be easily obtained from the daily physical examination of the hospital. However, it is unrealistic to enable authoritative medical experts to provide diagnosis for all the physical examination results. In fact, in practical applications, a large amount of unlabeled data and a small amount of labeled data coexist. However, because there are few labeled samples that can be used for supervised learning, it is usually difficult to learn a model with strong generalization ability. Therefore, how to use a large amount of unlabeled data to help improve the generalization ability of models learned from a small amount of labeled data has become one of the most important issues in the field of machine learning.
Currently, unlabeled data learning has three major technologies: semi-supervised learning, transductive learning, and active learning ). Unlike direct learning, which only focuses on the prediction performance of unlabeled data, and active learning relies on manual intervention, semi-supervised learning can automatically exploit unlabeled data, learning a model with strong generalization ability in the entire data distribution. The entire learning process requires no manual intervention, and uses unlabeled data completely based on the learning system itself. With its own characteristics and extensive application needs, semi-supervised learning has developed into a hot topic in machine learning over the past 10 years. In view of this, this article briefly introduces the research progress of semi-supervised learning.
1. Role of unlabeled data
Why can unlabeled data without concept tags help the learner learn target concepts? Figure 1 provides a simple example. "+" indicates the positive sample, "-" indicates the anti-class sample, and "." indicates the unlabeled sample. In this case, we need to predict the mark of the sample. If only labeled samples are used for learning (1 (a), it is natural that the sample is regarded as a class sample. However, if a large number of unlabeled samples are considered (1 (B) as shown in), we can find that the sample to be predicted and the labeled positive sample belong to the same cluster, and there is reason to believe that the sample nature in a cluster should be similar, therefore, it is more reasonable to predict this sample as an inverse sample. From this example, we can see that the distribution information provided by unlabeled data can help learning.
Figure 1 Functions of unlabeled data
The earliest theoretical explanation of the utility of unlabeled data appeared in Miller et al [1] in 1997. They assume that the training data conforms to a hybrid distribution composed of M components, and according to the maximum posterior probability rule, they export a P (Y | MJ, X) optimal classification function that is the product of P (MJ | X), in which MJ represents the J-mixing component. The learning goal is to estimate the probability of the above two items on the training data. Because the second item does not depend on the sample tag y, using a large amount of unlabeled data can help increase the P (MJ | x) estimation accuracy. Later, Zhang et al [2] Further Analysis of semi-supervised learning pointed out that if a parameterized model can be decomposed into p (x, y | θ) = P (Y | X ,) in the p (x | θ) format, the functions of unlabeled data are embodied in the fact that they can help to better estimate model parameters and improve model performance.
In fact, to make full use of unlabeled data in the learning process, you must establish a connection between unlabeled data distribution and prediction models. In a generative model, this relationship is reflected by the data generation process, that is, the model determines how to distribute unlabeled data. For general-purpose learners, certain assumptions are often used to establish the relationship between the prediction model and unlabeled data. In semi-supervised learning, cluster assumption and manifold assumption are the two most common assumptions for establishing connections. Clustering hypothesis requires that the prediction model should give the same class mark to the data in the same cluster, which is usually applicable to classification problems. The manifold hypothesis requires that the prediction model should give similar output to similar input data, in addition to classification, this feature is applicable to regression, sorting, and other tasks. In some cases, it can be considered as a natural extension of clustering assumptions. Most of the existing semi-supervised learning methods directly or indirectly reflect the above assumptions.
2. Semi-Supervised Learning Method
Currently, it is recognized that the work of semi-supervised learning originated from the use of unlabeled images in shahshahani et al [3] Satellite Remote Sensing Image Analysis in 1994. Since then, semi-supervised learning has gained wide attention, and many semi-supervised learning methods have been proposed one after another. These methods can be roughly divided into four categories: generative-model-based semi-supervised learning and low-density division (low-density-separation-based) semi-supervised learning, graph-based semi-supervised learning, and disagreement-based semi-supervised learning.
2.1 generative model-based semi-Supervised Learning
This method usually regards the probability of unlabeled samples belonging to each category as a set of missing parameters, and then uses the EM (expectation-maximization) algorithm to perform maximum likelihood estimation on the parameters of the generative model. The difference between different methods is that different generative models are selected as the base classifier, such as mixture of gaussians [3] and mixture of experts) [1]. Naive Bayes (Na ve Bayes) [4]. Although the generative model-based semi-supervised learning method is simple and intuitive, it can achieve better performance than the discriminative model in training samples, especially when there are very few labeled samples, however, when the model assumption is inconsistent with the data distribution, using a large amount of unlabeled data to estimate model parameters will reduce the generalization capability of the learned model [5]. Finding a proper generative model for data modeling requires a large amount of domain knowledge, which makes the application of semi-Supervised Learning Based on generative models limited in practical problems.
2.2 semi-supervised learning method based on low density division
This method requires that the decision boundary should pass through the sparse area as much as possible, so as to avoid assigning the dense data points in the cluster to both sides of the Decision boundary. Based on this idea, joachims [6] proposed the tsvm algorithm (2 shows that the real line is the classification boundary of tsvm, And the dotted line is the SVM classification boundary without considering unlabeled data ). During training, the tsvm algorithm first uses labeled data to train an SVM and estimates the tags of unlabeled data. Then, based on the maximum interval criterion, it iteratively exchanges the tags of samples on both sides of the classification boundary, this maximizes the interval and updates the current prediction model to push the decision boundaries to areas with relatively sparse data distribution while properly classifying labeled data as much as possible. However, the loss function of tsvm is not convex, so the learning process falls into a local minimization point, thus affecting the generalization ability. For this reason, a variety of tsvm variant methods were proposed to mitigate the impact of non-convex loss functions on the optimization process. Typical methods include deterministic annealing [7] And CCCP Direct Optimization [8. In addition, the concept of low density division is also applied to the design of semi-supervised learning methods other than tsvm, for example, regularization by Using Entropy semi-supervised learning, forcing learned classification boundaries to avoid dense data areas [9].
Figure 2 tsvm algorithm [6]
2.3 graph-based semi-Supervised Learning
This method uses labeled and unlabeled data to construct a Data graph. Based on the neighbor relationship in the graph, tags are transmitted from the labeled data point to the unlabeled data point (3, the light gray and black knots are marked samples of different classes, and the hollow knots are unmarked samples ). The graph-based semi-supervised learning method can be divided into two categories based on the label propagation method. One type of method can achieve explicit label propagation by defining a label propagation method that satisfies a certain nature, for example, tag propagation based on Gaussian random field and Harmonic Function [10], tag propagation based on global and local consistency [11], etc; another method is to implement implicit tag propagation by defining the regularization items on the graph. For example, by defining the regularization items on the manifold, the prediction function is forced to provide similar output to the neighboring neighbors in the graph, thus, the tag is implicitly transmitted from the labeled sample to the unlabeled sample [12]. In fact, the effect of the label Propagation Method on learning performance is far less than that of the Data graph Construction Method on learning performance. If the nature of the Data graph deviates from the internal law of the data, no matter which label propagation method is used, it is difficult to obtain satisfactory learning results. However, to build a data map that reflects the internal relationship of data, it often requires a lot of domain knowledge. Fortunately, in some cases, data can still be processed based on the nature of the data to obtain data graphs with higher robustness. For example, when a Data graph does not meet the measurement requirements, you can divide a non-measure graph into multiple measurement graphs based on the graph and perform tag propagation to overcome the negative impact of the non-measure graph on tag propagation [13]. Graph-based semi-supervised learning has a good mathematical foundation. However, since the time complexity of learning algorithms is mostly O (N3 ), therefore, it is difficult to meet the application requirements of semi-supervised learning for large-scale unlabeled data.
Figure 3 label Propagation
2.4 semi-Supervised Learning Method Based on inconsistency
This type of method requires collaboration with multiple differentiated learning tools to achieve utilization of unlabeled data. In an iterative learning process, when multiple learners have inconsistent prediction results on a unlabeled data, if the confidence level of some learners is significantly higher than that of other learners, the low-confidence learner uses the marker given by the High-confidence learner for learning. If the prediction confidence level of all learner is relatively low, the learner can obtain the tag information through interacting with the outside world. Here, unlabeled data actually provides an "information interaction platform" for multiple learners ". Inconsistency-based semi-supervised learning originated from the cooperative training algorithm (4) proposed by Blum et al [14] in 1998 ). This algorithm assumes that data has two fully redundant views (sufficient and redundant views), that is, a learner with strong generalization ability can be learned from each view, after a given class is marked, the views are mutually independent, and semi-supervised learning is performed by means of the learner tags on different views. They prove that when the above assumptions are met, collaborative training can use unlabeled data to improve the performance of the learner. However, in most practical applications, data does not have a fully redundant view. Therefore, in semi-supervised learning, researchers use multiple different learners in a single view to replace the learners in multiple views, typical methods include the Cooperative Training Method Based on the Special Base learner [15], the semi-Supervised Learning Method Based on the cooperative three classifier (tri-training [16]), and the semi-supervised learning method based on multi-classifier integration. co-forest [17], consistency) the semi-supervised regression method coreg [18] For confidence estimation. Recently, Wang et al [19] theoretically reveals that the key to the effectiveness of cooperative training is that there must be sufficient disagreement between the learning devices ), this provides a theoretical basis for the above method to replace the full redundancy view with multiple differentiated learning devices. The document [20] provides a summary of inconsistent semi-supervised learning methods.
Figure 4 cooperative training
The above four categories of semi-supervised learning methods have been successfully used in natural language processing, Internet search, software engineering, bioinformatics, medicine, and other fields, and achieved good results. For example, Li et al [21] designed the semi-supervised sorting method ssrank Based on the inconsistent framework, effectively improving the internet search performance by using unlabeled data; xu et al [22] applied the co-Forest algorithm to protein subcell localization. After learning with unlabeled data, the prediction performance improved by 10% compared with the existing supervised learning method.
3 conclusion
Semi-supervised learning is an important technology that uses unlabeled learning. It can automatically use a large amount of unlabeled data without external intervention to improve the general ability of the learner in the entire data distribution. This article briefly introduces the functions of unlabeled data in semi-supervised learning, the classification of semi-supervised learning methods, and representative algorithms.
Although the semi-supervised learning technology has made great strides, there are still some important issues to be further studied. For example, how many labeled samples are required for effective semi-supervised learning? In special circumstances, this question has been initially answered [23], but in general, the minimum demand for labeled samples for supervised learning is still a problem. Another question worth studying is, under what circumstances does semi-supervised learning work? Previous studies show that the use of semi-supervised learning may significantly reduce the generalization ability of the learner. Therefore, a secure semi-supervised learning method is designed so that the performance of the learner will not be reduced by using unlabeled data. This will help semi-supervised learning solve more real problems. In addition, applying semi-supervised learning to solve more practical problems will continue to become an important part of semi-supervised learning research.
References:
[1] Miller d j, uyar h s. A mixture of Experts classifier with learning based on both labeled and unlabelled data [c] // Mozer M, Jordan m I, Petsche T, et al. advances in neural information processing systems 9. cambridge: MIT Press, 1997: 571-577.
[2] Zhang T, oles f j. A probability analysis on the value of unlabeled data for classification problems [c] // proceedings of 17th International Conference on machine learning. stanford: [S. n.], 2000: 1191-1198.
[3] shahshahani B, landgrebe D. the effect of unlabeled samples in allocation cing the small sample size problem and mitigating the Hughes phenomenon [J]. IEEE Transactions on geoscience and remote sensing, 1994, 32 (5): 1087-1095.
[4] Nigam K, McCallum a K, Thrun S, et al. text classification from labeled and unlabeled documents usingem [J]. machine learning, 2000, 39 (2-3): 103-134.
[5] Cozman f g, Cohen I. unlabeled data can degrade classification performance of generative classifier [c] // Proceedings of the 15th International Conference of the Florida artificial intelligence research socioty. pensacola: [S. n.], 2002: 327-331.
[6] joachims T. transductive inference for text classification using Support Vector Machines [c] // Proceedings of the 16th International Conference on machine learning. bled, Slovenia: [S. n.], 1999: 200-209.
[7] sindhwani V, keerthi S, Chapelle O. deterministic annealing for semi-supervised kernel Machines [c] // Proceedings of the 23rd International Conference on machine learning. pitt0000gh: [S. n.], 2006: 123-130.
[8] collobert R, sinz F, Weston J, et al. Large Scale transductive SVMs [J]. Journal of machine learning research, 2006, 7 (8): 1687-1712.
[9] grandvalet y, bengio y. semi-supervised learning by entropy Minimization [c] // Saul l K, Weiss y, bottou L, et al. advances in neural information processing systems 17. cambridge: MIT Press, 2005: 529-536.
[10] Zhu X, ghahramani Z, Lafferty J. semi-supervised learning using Gaussian fields and harmonic functions [c] // Proceedings of the 20th International Conference on machine learning. washington: [S. n.], 2003: 912-919.
[11] Zhou D, bousquet o, Lal t n, et al. learning with local and global consistency [c] // Thrun S, Saul l, schlkopf B, et al. advances in neural information processing systems 16. cambridge: MIT Press, 2004: 321-328.
[12] Belkin M, niyogi P, sindwani v. manifold regularization: a geometric framework for learning from labeled and unlabeled examples [J]. journal of machine learning research, 2006, 7 (11): 2399-2434.
[13] Zhang Yin, Zhou Zhihua. non-metric label propagation [c] // Proceedings of the 21st International Joint Conference on artificial intelligence. pasadena: [S. n.], 2009: 1357-1362.
[14] Blum A, Mitchell T. combining labeled and unlabeled data with co-training [c] // Proceedings of the 11th Annual Conference on Computational learning theory. madison: [S. n.], 1998: 92-100.
[15] Goldman S, Zhou y. enhancing supervised learning with unlabeled data [c] // Proceedings of the 17th International Conference on machine learning. san Francisco: [S. n.], 2000: 327-334.
[16] Zhou Zhihua, Li Ming. tri-training: Exploiting unlabeled data using three classifiers [J]. IEEE Transactions on knowledge and data engineering, 2005, 17 (11): 1529-1541.
[17] Li Ming, Zhou Zhihua. improve Computer-Aided Diagnosis with machine learning techniques using undiagnosed samples [J]. IEEE Transactions on systems, man and cybernetics-Part A: systems and humans, 2007, 37 (6): 1088-1098.
[18] Zhou Zhihua, Li Ming. semi-supervised regression with co-training style algorithms [J]. IEEE Transactions on knowledge and data engineering, 2007, 19 (11): 1479-1493.
[19] Wang Wei, Zhou Zhihua. analyzing co-Raining style algorithms [c] // Proceedings of the 18th European Conference on machine learning. warsaw: [S. n.], 2007: 454-465.
[20] Zhou Zhihua, Li Ming. Semi-supervised learning by disagreement [J]. knodge dge and information systems, 2010, 24 (3): 415-439.
[21] Li Ming, Li Hang, Zhou Zhihua. Semi-supervised Document Retrieval [J]. Information Processing & Management, 2009, 45 (3): 341-355.
[22] Xu Qian, Hu Derek Hao, Xue Hong, et al. Semi-supervised protein subcellular localization [J]. BMC bioinformatics, 2009, 10 (S1): s47.
[23] Zhou Zhihua, Zhan dechuan, Yang Qiang. semi-supervised learning with very few labeled training examples [c] // Proceedings of the 22nd aaai Conference on artificial intelligence. vancouver: [S. n.], 2007: 675-680.
About the author: Dr. Li Ming, associate professor of computer science and technology at Nanjing University, member of the China AI society. He is mainly engaged in machine learning, data mining, and information retrieval. E-mail: [email protected]
Fund Project: National Natural Science Foundation project (60903103)
Source: http://caai.cn/contents/421/3585.html
Semi-supervised learning [transfer]