AI Learning Society
Links: https://www.zhihu.com/question/57523080/answer/236301363
Source: Know
Copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.
Today I'll introduce you to CVPR 2017 a more interesting article on medical image processing, using the methods of active learning and incremental learning.
The main thing to share today is, first of all, to introduce the motivation of this article, that is why he did the work, and then describe how he did it, and the application of the two data sets, and finally make a brief summary of its characteristics and what needs to be improved.
In fact, in machine learning, especially in deep learning, there is a very important premise is that there is a sufficient amount of data to be labeled. But this kind of labeling data generally need to be labeled manually, sometimes the cost of labeling is very high, especially in medical image processing above. Because medical image processing requires some domain knowledge, which means that doctors are more familiar with these diseases before they can be marked, we are generally difficult to mark. Unlike in the natural image above, such as imagenet above the picture, is some people face, scene and physical, we can all go to the mark, the cost is a little lower. Medical images cost more, like the example on my right, the two common ways medical images are X-ray and Ct. X-ray In fact, a person generally photographed, the cost is about 20 to 30 yuan a sheet of money, CT is a cross-sectional, shoot a person about hundreds of pictures, the cost will be a little higher, mark the time will be longer.
For example, for example, labeled 1000, this data is not too large for deep learning data, X-ray needs 2 to 30,000 yuan, 3-4 days to mark, CT cost will be longer, and time cost is also a very important problem. So how to solve the problem of deep learning in medicine, especially medical image? It is necessary to use as little data as possible to train a promising classifier, which means a better classifier.
So we're going to have to think about how much training data we need to train a promising classifier? Here is an example of the graph on the left, the performance of the model with the increase of data is a linear growth process, that is, the more data, the higher its performance. However, in practice, this situation rarely occurs, the general situation is the amount of data to a certain extent, its performance will reach a bottleneck, it will not increase with the training data increase. But there are times when we think about making this point a little bit earlier, so that it happens at a smaller amount of data. For example, the red dashed part of the graph on the right, with smaller data to achieve the same performance. In this paper, we introduce the means of active learning active learning, and find a small data set to achieve the same effect as a large data set.
How to reduce the critical point in the right image by active learning? is to take the initiative to learn the more difficult, easy to divide the wrong, the information of large samples, and then to mark up such samples. Because these are more difficult to divide, easy to divide the possibility of a few samples training out, difficult to divide the need for a large number of data, the model can be learned. So the model has to learn these hard things first.
How to define this "difficult"? is "difficult", "easy to divide the wrong", "The information is Big", actually said is a meaning. This "information large" with two indicators to measure, entropy Yamato diversity high. Entropy is the "entropy" in informatics, and diversity is diversity. The diversity in this data represents a higher generalization of what the model learns. For example, for the two classification problem, if the predicted value is near 0.5, it means that the entropy is relatively high, because the model is more difficult to distinguish what kind of it is, so gave it a 0.5 probability.
Use active learning to find the more difficult samples to learn with these 5 steps
- First of all, the non-annotated image data in a large number of natural images of the network training, we know that there are many commonly used networks, from the initial lenet, AlexNet, Googlenet, Vgg, resnet such networks to test again, get the predicted value. And pick out the hardest, most informative samples to mark.
- Using the samples just labeled to train the deep learning Network, get a network n
- Put the remaining unlabeled images with N over again, get the predicted values, pick the hardest ones, use the manual to label it
- We will continue to train this network by taking the samples that have just been labeled and the samples that were already marked, that is, the entire labeling set.
- Repeat 3 to 4 this step until the current classifier is very good at sorting out the more difficult images you have chosen.
The text explained just now may not be very intuitive, let's take a look at a picture. This diagram looks from left to right, and at first the gray meaning is not labeled, and then uses a pre-trained model to predict which class it is again. So there is a probability on each data, according to this probability to choose whether it is not difficult to divide the data, you get the middle of this diagram, the above section is more difficult, and then we will mark it out. Then with a continuous fine-tune CNN, is in the original model to do another fine-tune, because there are some annotated data, you can continue to fine-tune. The model after fine-tune has a predicted value for unlabeled data, and then marks them against these predicted values and what is hard to find. And then put these annotated data together with the data noted before, and then do a continuous fine-tune, you get CNN2. And so on and so on, until all the data have been marked, or the model is not finished when the effect is very good, because the difficult data are already marked.
Just mentioned two indicators to determine whether a data is difficult to score. Entropy is more intuitive, the prediction results in about 0.5 think it is more difficult to score, but diversity this value is not very good characterization, through the data augmentation the way to enhance the design of indicators, that is, from an image design a series of its deformation. These variants can be flipped, rotated, panned, and so on, and one becomes several or even more than 10, increasing its diversity. And then to these all deformation to predict their classification results, if the results are not uniform, it shows that the image of the diversity is relatively strong, then this image is more difficult to divide, is hard sample, on the contrary is relatively good points, then do not do its enhancement. Then the predicted values for all the enhanced data should be consistent, because they represent the same thing, but there are some exceptions, if the simple data enhancement as I just said.
This will produce a problem, the original image, such as the kitten on the left, after the translation, rotation, scaling and other operations to get 9 pictures, each picture is its deformation. Then we use CNN on these 9 pictures of the probability of a cat, you can see the probability of the above three figure is relatively low, is not judged to be a cat, we intuitively to see, like mice, dogs, rabbits are possible. Originally this is a simple example, it is easy to recognize that this is a cat, but enhanced later on instead let the model is uncertain. This is a situation that needs to be avoided.
So this time to do a majority selection, is a minority to obey the majority of the way, because most of it is recognized that it is a cat. This is the tendency to look at it, with the 6 predicted value of 0.9 of the data, the above three predicted value of 0.1 is not as a result of enhanced . In this way, the general direction of network prediction is unified.
In addition to active learning, the innovation of this article is not the beginning of batch learning, but the sequential learning. It will not be particularly good at the beginning, because there is no labeling data, it is from a imagenet database training model directly to the medical application to predict, the effect should not be too good. Then, as the callout data increases, the effect of the active learning is gradually reflected. Here is in every time fine-tune, are in the current model based on the further fine-tune, and not all from the original Pre-train model do fine-tune, so that the last time the parameters have a little memory, is continuous learning. This approach is similar to the academic common sequntial learning and online learning. But there is a drawback is that fine-tune parameters are not very good control, there are some super parameters, such as learning rate and some other, in fact, it is necessary to change with the model changes, and it is relatively easy to start falling into the local minimal, Because there is not a lot of data to label at the beginning, it is possible for the model to learn a bad result. So this is an open problem that can be solved in several ways, but the solution is not mentioned in this article.
This method in machine learning is relatively common, is to find those difficult to divide the data to do sequntial fine-tune. This paper is mainly used in medical images above, with two examples of the results, one is a colonoscopy video frame classification, to see if there are lesions, tumors and so on. The conclusion is that only 5% of the sample is used to achieve the best results, because in fact, because it is a continuous video frame, usually the same, the front and rear frames are similar, do not need to label each frame. Another example is similar, pulmonary embolism detection, detection + classification of the problem, only 1000 samples can be done with 2,200 random samples of the same effect.
I also know this author, who is a PhD student in ASU, and is currently practicing at Mayo, a very famous private hospital in the United States, and has more dealings with doctors who need to be labeled. This is the equivalent of a real demand from a research topic.
Summing up, this article has a few of the better highlights.
- From the callout data, starting from a completely unlabeled data set, you do not need to label the data at the beginning, and finally to compare a small amount of data to achieve good results;
- Then, from the way of sequntial fine-tune, instead of retraining;
- When selecting samples, it is the consistency of the candidate samples, and the selection of the samples is worth noting;
- The automatic handling of noise is the example of the cat that was just lifted, and the noise generated by the data enhancement was removed by a minority of the majority-compliant way;
- With only a small number of patches compute entropy and KL distance in each candidate set, the KL distance is the indicator that describes the diversity, thus reducing the amount of computation. Traditional deep learning will need to do before the training of data enhancement, each sample is equal; This article contains some data enhancement not only does not play a good role, but brings the noise, it needs to do some processing, but also some of the data does not need to be enhanced, which reduces noise and saves calculation.
Q&a
Q: Why did the active learning not be better than the random selection when I started?
A: In fact not necessarily, sometimes there is no way to ensure that who is good. The active learning did not label the data at the outset, and it did not know what data was hard at the time and was not trained on the medical data set. This time with the random selection is the same, is migrating the original ImageNet image learning effect. The random selection is likely to choose the hard results directly, so it might be a little better than the first active selecting, but it's not always random selection good. Just can't guarantee which one is better.
Fine-tuning convolutional neural Networks for biomedical Image analysis:actively and Incrementally how to use as few callout data as possible to train a classifier with potential effects