A few methods to improve the model
- Collect more Data
- Collect more diverse training set
- Train algorithm longer with gradient descent
- Try Adam instead of gradient descent
- Try Bigger Network
- Try dropout
- Add $L _2$ Regularization
- Network Architecture
- Activation functions
- Hidden units
Second, orthogonalization
definition: try to make the factors orthogonal, do not affect each other, in order to optimize one aspect of the model, will not affect the model in other aspects of the performance of the ability.
When you find that a model does not perform well in the dev DataSet, you may take some steps to improve the model, but the approach taken may affect the performance of the model in other datasets at the same time. For example, the performance of the Dev DataSet is getting better, but the performance in the test set is poor, and changing one factor may affect multiple aspects at the same time. As a result,orthogonalization is trying to make every method as much as possible affect only one factor.
Three, the evaluation index of the model
? There are many criteria for model evaluation, such as accuracy, precision, recall , and so on, but these standards tend to interact with each other, for example in precision, but generally in recall, Therefore, it is very possible to judge which model is better through these two indicators. Therefore, when evaluating the model, try to adopt a single uniform standard, such as F1 score. When the model has errors in multiple dimensions, the average error is also used as much as possible. Of course, according to the actual needs of different, each standard on the impact of the model weight is not the same. Therefore, we can also according to the actual needs, the choice of the right to be a major criterion. For example, the tradeoff between the predictive precision of a model and the time required to run it.
? If the classification errors for the cat's classifier A and B are 3% and 5%, respectively, but classifier a may classify the pornography as a cat and then push it to the user, while the classifier B does not. So in this case, classifier B is a better choice. In this case, we can add the weight of the pornographic image in the calculation of the error term, if the input x is a pornographic image, then the corresponding error penalty will be much larger, so that the pornographic images are not mistakenly classified as a cat and pushed to the user.
? When the model behaves well on the defined indicator and dev/test set, but it behaves normally in the actual application, it should revise the evaluation index of the model or modify the Dev/test set.
Iv. Choice of Train/dev/test sets
- The dev set and test set must come from the same distribution and preferably be reflected in future data distributions. The distribution of the train set can be different from that of the dev set and test set (and of course, the distribution can be the same as possible);
- When the dataset is small (for example, 100, 1000, 10000), the train set and the test set can have a ratio of 7:3, or the train set is 6:2:2 to the dev set, the test set, and when the dataset is large (for example, 1000000), Then train set: Dev set: Test set =9.8:0.1:0.1. Of course, under the premise of satisfying the above conditions, the test set should be made as large as possible to give the model the performance of a higher degree of confidence.
Wu, Human-level performance
- When the model behaves more than Human-level performance , the increase in expressiveness slows down slowly, but it is never possible to exceed the Bayes optimal error.
- Human beings have been able to achieve very good results in many tasks, such as case recognition, speech recognition and so on, if the model performance of human poor, you can consider the following methods:
- Get labeled data from humans;
- Gain Insight from manual error analysisi:why do a persion get this right?
- Better analysis of Bias/variance.
- Suppose training error is 8%,dev error is 10%. When human-level performance(approximate to Bayes error) is 1%, the apparent training error distance human error also has a 7% gap, while the distance from Dev Error is only 2% of the difference, so in this case, we should focus on the improvement of the model in the deviation, rather than the variance, and when Hu Man-level performance is 7.5%, Training error distance human Error is only 0.5% of the gap, and Dev error has a 2% gap, so at this point we should focus on the improvement of variance, not deviation.
Vi. Error Analysis
- Check what categories were mistakenly categorized in the dev set, the reason for which they were mistakenly categorized, and analyze their respective proportions, and then decide which problem to focus on.
- In a dataset, it is possible that there are some data set labels that are wrong. So we should look at the dev set to determine if it is the classifier's prediction error, or because of the label error of the dataset;
- Of course, the data sets that are correctly categorized by the model may also have labels that are wrong, but at this point your model is "correctly categorized", which is a case to consider. However, the general situation will take a lot of energy, so the situation is generally less to deal with;
- When correcting the wrong set of tags, ensure that the dev set and test set operate simultaneously to ensure the consistency of data distribution;
Seven. The distribution of the training set and the test set is inconsistent 1. Inconsistent distribution of sources and data set partitioning methods
Example: In the cat classifier, the training data used from the Internet, the quality of these pictures is very good, but this classifier is actually used in mobile apps, and in mobile apps, the user's pictures of the cat is not so good quality, so in the training set on the performance of a good classifier, In practical applications, it may be a very general performance. and the image collection from the online level A is often very large, for example 1000000, and the user from the picture set B may be only 10000. For this scenario, there are several possible solutions:
- Add 10000 DataSet B to 1000000 of dataset A, then disrupt and re-98:1:1 into Train/dev/test, but under this process, it's very likely that dev and test contain most of the datasets from a, and only a small part comes from B, This is inconsistent with the purpose of setting up the Dev collection. The data distribution of the dev set and test set should be as consistent as possible with the data distribution in the actual application-consistent with the distribution of DataSet B, so that the trained model can perform well in practical applications.
- The training set contains all dataset A and part of DataSet B, and then divides the remaining dataset B into the dev set and test set. Although in this case the training set and the dev set, the test set of data inconsistent, but the actual performance will be better than scenario 1.
2. How to determine why the Dev set error is too high
? When the distribution of the training set and the verification machine/test set is inconsistent, a small portion of the training set can be extracted as Train-dev set, at this point, the entire data set contains train, Train-dev, Dev, test four sets, the role of the following:
Train set: for model training
Train-dev set: With the same data distribution as the train set, but not for model training, only for error measurement
If the error of Train-dev set is similar to that of Dev set, but it is much higher than the error of train set, this is a deviation problem, if the error of Train-dev set is similar to that of train set, dev The error of set is much higher than the error of Train-dev set, which indicates that this is a problem caused by inconsistent data distribution of training set and validation set.
3. Analyze problems through error relationships between datasets
? Assuming that the training error is 1%, the validation error is 10%, if the training set and the verification machine/test set of data distribution is consistent, it is a high variance problem, but if similar to the above example, the training set and the test set distribution is inconsistent, the cause of the error may only be because the training set of picture quality better classifier easy to distinguish , and the picture quality of the validation set is common, which causes the classifier not to classify correctly, resulting in a large validation error.
4. Ways to resolve data mismatches
- Perform manual error analysis to uncover the causes of data mismatches. By analyzing the difference between the train set and the dev set, we try to get more train set accumulated by the dev set distribution.
- The method of synthesizing artificial data is used. For example, in the car voice recognition system, training set for quiet environment recorded in 10,000 hours of voice data, but the actual application, the car voice recognition system input voice data is included noise, such as the car sent sound, the surrounding vehicle horn sound, car echo and so on. So, if you have an hour of car noise data, for the train set and the dev set as consistent as possible, by the method of artificial data synthesis, this hour of noise data and 10,000 hours of quiet environment recorded in the voice data to be synthesized, This, of course, makes it possible for the system to cross-fit this hour's noise data. Another solution is to record 10,000 hours of noise data, and of course this method consumes more energy.
Eight, learning from multiple tasks1. Transfer Learning
- pre-training and fine tune: in the deep neural network training, using the parameters of the previously trained model as the model parameters of the initialization, this is pre-training; updating the parameters of the model in later training is fine Tuning
- Migration learning applies the knowledge learned from one dataset A to another in data set B. However, if the magnitude of dataset A is smaller than DataSet B, it is not advisable to apply migration learning to B, since dataset A can provide little information to DataSet B. You should retrain a model for DataSet B at this point.
2. Muti-task Learning
- definition: Given an input, multiple aspects of the input can be judged at the same time. For example, given in an autonomous driving, given an image, and then at the same time to determine whether the image has sidewalks, stop signs, cycling, traffic lights and so on. The y tag is typically [0, 1, 1, 0,......, 1], where 1 indicates that the image is used for this property.
- The concept of multi-classification learning is somewhat similar to that of Muti-task learning, but multi-classification learning (such as Softmax, SVM) refers to one of the categories in which a single input is judged to belong to more than one category, such as a given animal image, which determines that it belongs to a cat/dog/pig. Under multi-category learning, the y tag is usually [0, 1, 2, 3, ... m], where m-1 is the number of categories.
- In Muti-task learning, the existence of a property on the y tag of the training set is sometimes indeterminate. For example, given a pair of images, the y tag only tells you that the image has a traffic light and a car, but for a sidewalk, stop sign is not sure, these properties in the Y tag is the corresponding hit ? number. Therefore, when calculating the loss function, the undefined loss of these properties is not counted, only the position of the Y label that has been indicated as 0 or 1 is calculated.
- Muti-task learning only when the network depth is relatively high performance is better. Because deep neural networks can learn some of the low-dimensional features of images, these features can be used to share for multitasking learning at the same time.
- One of the things that requires a simultaneous decision on one input is Muti-task learning, and the other is learning a classifier for each category. When the sample data of each class is small, it is wiser to adopt muti-task learning, so that the characteristics of the image can be learned better, and when the sample data of each class is large, the multi-classification model can be used, so the accuracy of the model prediction is generally higher.
3. End-to-end Learning
- End-to-end leraning is only applicable when the data set is large enough.
Pros:
- Let the data speak. In non-end-to-end models, it is often used to define a series of network layers for the model, which can extract some characteristics of the input data, such as structure or voice data, but the extracted information is given by humans, and in the End-to-end model, Direct learning from the input to the output of the mapping, may be able to discover the data inside the deeper potential feature information;
- Less hand-designing of needed.
Cons:
- May need large amount of data;
- Excludes potentially useful hand-designed compoents.
"DL. AI "Structuring machine learning Projects" notes