Under the traditional machine learning framework, the task of learning is to learn a classification model based on a given sufficient training data, and then use this learning model to classify and predict the test document. However, we see that the machine learning algorithm has a key problem in the current research of web mining: A lot of training data in some newly emerging fields is very rare. We see the development of Web applications very fast. A number of new fields are emerging, from traditional news, to web pages, to pictures, to blogs, podcasts and more. Traditional machine learning requires a large number of training data to be calibrated in each area, which can be costly and labor intensive. Without a large number of labeling data, it will make a lot of learning-related research and application can not be carried out. Second, traditional machine learning assumes that the training data are subject to the same data distribution as the test data. In many cases, however, this same-distribution hypothesis is not satisfied. What is usually likely to happen is that the training data expires. This often requires us to re-annotate a lot of training data to meet our training needs, but labeling new data is very expensive and requires a lot of manpower and resources. From another point of view, if we have a large number of training data in different distributions, it is very wasteful to discard the data completely. How to use these data rationally is the main problem to be solved in the study of migration. Migration learning can migrate knowledge from existing data to help with future learning. The goal of migration learning (Transfer learning) is to use the knowledge learned from an environment to help learning tasks in a new environment. Therefore, migration learning does not assume the same distribution as traditional machine learning.
Our work on migration learning can now be divided into the following three parts: case-based migration learning under homogeneous space, feature-based migration learning under homogeneous space and migration learning under heterogeneous space. Our research indicates that the migration learning based on instance has stronger knowledge migration ability, and the feature-based migration learning has more extensive knowledge transfer ability, and the migration of heterogeneous space has extensive ability of learning and expanding. These methods are different.
1. Instance-based migration learning under homogeneous space
The basic idea of instance-based migration learning is that although the auxiliary training data and the source training data will be somewhat different, there should be a part of the auxiliary training data that is suitable for training an effective classification model and adapting to the test data. Our goal, then, is to identify examples of the data that are appropriate for the test from the training data, and migrate those instances to the learning of the source training data. In the case-based migration learning aspect, we generalize the traditional AdaBoost algorithm and propose a boosting algorithm with migration capability: tradaboosting [9], which has the ability of migration learning, The result is the ability to maximize the use of ancillary training data to help target classification. Our key idea is to use boosting technology to filter out the data that is most unlike the source training data in the auxiliary data.
Among them, the role of boosting is to establish a mechanism for automatic weight adjustment, so the weight of the important auxiliary training data will increase, the weight of the non-important auxiliary training data will be reduced. After adjusting the weights, these weighted auxiliary training data will be used as additional training data, along with the source training data to never improve the reliability of the classification model.
Instance-based migration learning can only occur when the source data is very similar to the secondary data. However, when the source data and the auxiliary data are very different, the instance-based migration learning algorithm is often difficult to find the knowledge that can be migrated. However, we find that even if the source and target data do not share some common knowledge at the instance level, they may have some overlap at the feature level. Therefore, we have studied the feature-based migration learning, which discusses how to use the common knowledge of the feature level to study the problem.
2. Feature-based migration learning under homogeneous spaces
In the research of feature-based migration learning, we propose a variety of learning algorithms, such as COCC algorithm [7],TPLSA algorithm [4], Spectral analysis algorithm [2] and self-learning algorithm [3]. The clustering algorithm is used to generate a common feature representation, which helps to learn the algorithm. Our basic idea is to use the clustering algorithm to cluster the source data and the auxiliary data at the same time, and get a common characteristic representation, which is better than the characteristic representation based on the source data. Migration learning is achieved by representing the source data in this new space. Using this idea, we propose a feature-based supervised migration learning and a feature-based unsupervised migration learning.
2.1 Supervised migration learning based on feature
Our work on feature-based supervised migration learning is based on the cross-domain classification of mutual clustering [7], which concerns the question of how to use the large number of annotated data contained in the original domain for migration learning when given a new, different domain, when labeling data and its scarcity. In the work of cross-domain classification based on mutual clustering, we define a unified information theory formal formula for the cross-domain classification problem, which is based on the classification problem of mutual clustering into the optimization problem of the objective function. In our proposed model, the objective function is defined as the source data instance, the loss of mutual information between the public feature space and the auxiliary data instance.
2.2 Unsupervised migration learning based on features: self-learning clustering
Our proposed self-learning clustering algorithm [3] belongs to the feature-based unsupervised migration learning aspect. The question we are considering here is that there may be a lot of data that can be tagged in the real world, and how to use the large number of untagged data to carry out the migration learning problem in this case. The basic idea of self-learning clustering is to get a common feature representation of both the source data and the auxiliary data, and this new feature indicates that because of the large amount of auxiliary data, it will be better than the characteristic representation based on the source data, thus helping the cluster.
The two learning strategies presented above (feature-based supervised migration learning and unsupervised migration learning) address the problem of feature-based migration learning in the same feature space between source and auxiliary data. When the source data and the auxiliary data are not in the same characteristic space, we also study the feature-based migration learning in the cross-feature space, which also belongs to the feature-based migration learning.
3. Migration learning in heterogeneous spaces: translation Learning
Our proposed translation learning [1][5] is dedicated to solving the situation where the source data and test data belong to two different feature spaces respectively. In [1], we use a lot of easy-to-get annotated text data to help with only a small number of annotated image classification problems, as shown in. Our approach is based on the use of data with two perspectives to construct a bridge that communicates two feature spaces. Although these multi-view data may not necessarily be used for classifying training data, they can be used to build translators. Through this translator, we combine the nearest neighbor algorithm with the feature translation, translate the auxiliary data into the source data feature space, and use a unified language model to study and classify.
Migration Learning (Transfer learning)