Original address:http://blog.csdn.net/miscclp/article/details/6339456
Under the traditional machine learning framework, the task of learning is to learn a classification model based on a given sufficient training data, and then use this learning model to classify and predict the test document. However, we see that the machine learning algorithm has a key problem in the current research of web mining: A lot of training data in some newly emerging fields is very rare. We see the development of Web applications very fast. A number of new fields are emerging, from traditional news, to web pages, to pictures, to blogs, podcasts and more. Traditional machine learning requires a large number of training data to be calibrated in each area, which can be costly and labor intensive. Without a large number of labeling data, it will make a lot of learning-related research and application can not be carried out. Second, traditional machine learning assumes that the training data are subject to the same data distribution as the test data. In many cases, however, this same-distribution hypothesis is not satisfied. What is usually likely to happen is that the training data expires. This often requires us to re-annotate a lot of training data to meet our training needs, but labeling new data is very expensive and requires a lot of manpower and resources. From another point of view, if we have a large number of training data in different distributions, it is very wasteful to discard the data completely. How to use these data rationally is the main problem to be solved in the study of migration. Migration learning can migrate knowledge from existing data to help with future learning. The goal of migration learning (Transfer learning) is to use the knowledge learned from an environment to help learning tasks in a new environment. Therefore, migration learning does not assume the same distribution as traditional machine learning.
Our work on migration learning can now be divided into the following three parts: case-based migration learning under homogeneous space, feature-based migration learning under homogeneous space and migration learning under heterogeneous space. Our research indicates that the migration learning based on instance has stronger knowledge migration ability, and the feature-based migration learning has more extensive knowledge transfer ability, and the migration of heterogeneous space has extensive ability of learning and expanding. These methods are different.
1. Instance-based migration learning under homogeneous space
The basic idea of instance-based migration learning is that although the auxiliary training data and the source training data will be somewhat different, there should be a part of the auxiliary training data that is suitable for training an effective classification model and adapting to the test data. Our goal, then, is to identify examples of the data that are appropriate for the test from the training data, and migrate those instances to the learning of the source training data. In the case-based migration learning aspect, we generalize the traditional adaboost algorithm, and propose a boosting algorithm with migration capability: tradaboosting [9], which has the ability of migration learning, so as to maximize the use of auxiliary training data to help the classification of the target. Our key idea is to use boosting technology to filter out the data that is most unlike the source training data in the auxiliary data. Among them, the role of boosting is to establish a mechanism for automatic weight adjustment, so the weight of the important auxiliary training data will increase, the weight of the non-important auxiliary training data will be reduced. After adjusting the weights, these weighted auxiliary training data will be used as additional training data, along with the source training data to never improve the reliability of the classification model.
Instance-based migration learning can only occur when the source data is very similar to the secondary data. However, when the source data and the auxiliary data are very different, the instance-based migration learning algorithm is often difficult to find the knowledge that can be migrated. However, we find that even if the source and target data do not share some common knowledge at the instance level, they may have some overlap at the feature level. Therefore, we have studied the feature-based migration learning, which discusses how to use the common knowledge of the feature level to study the problem.
2. Feature-based migration learning under homogeneous spaces
In the research of feature-based migration learning, we propose a variety of learning algorithms, such as COCC algorithm [7],TPLSA algorithm [4], Spectral analysis algorithm [2] and self-learning algorithm [3]. The clustering algorithm is used to generate a common feature representation, which helps to learn the algorithm. Our basic idea is to use the clustering algorithm to cluster the source data and the auxiliary data at the same time, and get a common characteristic representation, which is better than the characteristic representation based on the source data. Migration learning is achieved by representing the source data in this new space. Using this idea, we propose a feature-based supervised migration learning and a feature-based unsupervised migration learning.
2.1 Supervised migration learning based on feature
Our work on feature-based supervised migration learning is based on the cross-domain classification of mutual clustering [7], which concerns the question of how to use the large number of annotated data contained in the original domain for migration learning when given a new, different domain, when labeling data and its scarcity. In the work of cross-domain classification based on mutual clustering, we define a unified information theory formal formula for the cross-domain classification problem, which is based on the classification problem of mutual clustering into the optimization problem of the objective function. In our proposed model, the objective function is defined as the source data instance, the loss of mutual information between the public feature space and the auxiliary data instance.
2.2 Unsupervised migration learning based on features: self-learning clustering
Our proposed self-learning clustering algorithm [3] belongs to the feature-based unsupervised migration learning aspect. The question we are considering here is that there may be a lot of data that can be tagged in the real world, and how to use the large number of untagged data to carry out the migration learning problem in this case. The basic idea of self-learning clustering is to get a common feature representation of both the source data and the auxiliary data, and this new feature indicates that because of the large amount of auxiliary data, it will be better than the characteristic representation based on the source data, thus helping the cluster.
The two learning strategies presented above (feature-based supervised migration learning and unsupervised migration learning) address the problem of feature-based migration learning in the same feature space between source and auxiliary data. When the source data and the auxiliary data are not in the same characteristic space, we also study the feature-based migration learning in the cross-feature space, which also belongs to the feature-based migration learning.
3. Migration learning in heterogeneous spaces: translation Learning
Our proposed translation learning [1][5] is dedicated to solving the situation where the source data and test data belong to two different feature spaces respectively. In [1], we use a lot of easy-to-get annotated text data to help with only a small number of annotated image classification problems, as shown in. Our approach is based on the use of data with two perspectives to construct a bridge that communicates two feature spaces. Although these multi-view data may not necessarily be used for classifying training data, they can be used to build translators. Through this translator, we combine the nearest neighbor algorithm with the feature translation, translate the auxiliary data into the source data feature space, and use a unified language model to study and classify.
Citation:
[1]. Wenyuan Dai, Yuqiang Chen, Gui-rong Xue, Qiang Yang, and Yong Yu. Translated Learning:transfer learning across Different Feature Spaces. Advances in Neural information processing Systems (NIPS), Vancouver, British Columbia, Canada, December 8-13, 2008 .
[2]. Xiao Ling, Wenyuan Dai, Gui-rong Xue, Qiang Yang, and Yong Yu. Spectral Domain-transfer learning. In Proceedings of the fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Pa Ges 488-496, Las Vegas, Nevada, USA, August 24-27, 2008.
[3]. Wenyuan Dai, Qiang Yang, Gui-rong Xue and Yong Yu. Self-taught clustering. In Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML), pages 200-207, Helsinki, Finl And, 5-9 July, 2008.
[4]. Gui-rong Xue, Wenyuan Dai, Qiang Yang and Yong Yu. Topic-bridged pLSA for Cross-domain Text classification. In Proceedings of the Thirty-first International ACM Sigir Conference on, and Development on information retrieval (SIGIR2008), pages 627-634, Singapore, July 20-24, 2008.
[5]. Xiao Ling, Gui-rong Xue, Wenyuan Dai, Yun Jiang, Qiang Yang and Yong Yu. Can Chinese Web Pages classified with 中文版 Data Source? In Proceedings the seventeenth international World Wide Web Conference (WWW2008), Pages 969-978, Beijing, China, April 21- 25, 2008.
[6]. Xiao Ling, Wenyuan Dai, Gui-rong Xue and Yong Yu. Knowledge transferring via implicit Link analysis. In Proceedings of the thirteenth International Conference on Database Systems for Advanced Applications (DASFAA), Pag Es 520-528, New Delhi, India, March 19-22, 2008.
[7]. Wenyuan Dai, Gui-rong Xue, Qiang Yang and Yong Yu. Co-clustering based classification for Out-of-domain Documents. In Proceedings of the thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Pa Ges 210-219, San Jose, California, USA, 12-15, 2007.
[8]. Wenyuan Dai, Gui-rong Xue, Qiang Yang and Yong Yu. Transferring Naive Bayes classifiers for Text classification. In Proceedings of the Twenty-second National Conference on Artificial Intelligence (AAAI), Pages 540-545, Vancouver, British Columbia, Canada, July 22-26, 2007.
[9]. Wenyuan Dai, Qiang Yang, Gui-rong Xue and Yong Yu. Boosting for Transfer learning. In Proceedings of the Twenty-fourth International Conference on Machine Learning (ICML), Pages 193-200, Corvallis, Or Egon, USA, June 20-24, 2007.
[Ten]. Dikan Xing, Wenyuan Dai, Gui-rong Xue and Yong Yu. Bridged refinement for Transfer learning. In Proceedings of the eleventh European Conference on principles and practice of Knowledge Discovery in Databases (PKDD 20 ), Pages 324-335, Warsaw, Poland, September 17-21, 2007. (Best Student Paper Award)
[one]. Xin Zhang, Wenyuan Dai, Gui-rong Xue and Yong Yu. Adaptive Email Spam Filtering based on information theory. In Proceedings of the eighth International Conference on WEB Information Systems Engineering (WISE), Pages 159–170, N Ancy, France, December 3-7, 2007.
Migration Learning (Transfer Learning) (reproduced)