11. When to modify development sets, test sets, and metrics
Start a new project, choose the development set and test set as soon as possible, example, according to the metric a classifier in front of the B classifier, but the team thinks that the B classifier is better than the a classifier in the actual product, then need to consider modifying the development set and test set, or evaluate the indicator.
There are three main reasons that may result in a lower rating for the A classifier:
(1) You need to deal with the actual data distribution and development set, test machine data distribution is different.
(2) Your development set is over-fitted.
(3) This indicator is not a goal that the project should optimize.
12 Summary: Set up development sets and test sets
(1) Selecting data as a development set and test set should be the same as the data that you expect to be acquired and well processed in the future, but not consistent with the distribution of training data.
(2) The distribution of the development set and the test set should be as consistent as possible.
(3) Select a single evaluation index to optimize, need to consider a number of goals, it may be necessary to integrate them into an expression (such as the average number of error indicators), or to define the satisfaction index and optimization indicators.
(4) machine learning is a high-speed iterative process: You may want to try a lot of ideas before the final satisfying scenario comes up.
(5) Having development sets, test sets, and single-valued evaluation indicators can help you quickly evaluate an algorithm that accelerates the iterative process.
(6) When you explore a new application, build your development sets, test sets, and metrics as much as possible within a week, and take longer to mature applications.
(7) Traditional 70%/30% training set and test set partitioning are not applicable to large scale data, in fact the proportion of development set and test set will be much lower than 30%.
(8) The size of the development set should be large enough to detect subtle changes in the accuracy of the algorithm, but not too large; the size of the test set should be large enough to allow you to make a full estimate of the final performance of the system.
(9) When development sets and evaluation metrics no longer give the team a proper orientation, modify them as soon as possible: (i) If your development set is over-fitted, get more development set data. (ii) If the data distribution of the development set and test set is different from the data distribution that is actually concerned, obtain the new development set and test set. (iii) If the evaluation indicators are not able to measure the most important task objectives, then the evaluation indicators need to be modified.
13. Quickly build and iterate your first system
Case: In the face of building a new spam filtering system, there may be many different ideas, when the first is not comprehensive consideration, but as soon as possible first to build a basic system. From the results of this system, find out the next clue (guideline) of how to go.
14. Error Analysis: Evaluating ideas based on development set samples
Error analysis refers to the process of examining development set samples that have been mistakenly categorized by algorithms in order to find the cause of these errors. This will help you prioritize the project, that is, to find out what causes the biggest error ratio, then optimize it to make the most sense, rather than blindly optimize for one reason.
15. In error analysis and evaluate multiple ideas
The use of spreadsheets to statistics, the wrong classification of samples (such as 100) because of the reason for the wrong classification, there may be a picture for several reasons (such as dog and very vague), the final statistical reasons for the ratio, choose a large optimization.
16. Clean the development set and test set samples of the mis-labeled
or according to the above principle, to see the proportion of the wrong sample of the impact of the error classification, and then determine whether to amend, if the correction, it is necessary to use this correction method to test set, so as to ensure consistency.
17. Large development sets are split into two subsets, focusing on one
For example, there are 5000 large development sets, there are 20% errors, that is, there are 1000 error classification, which is all the error analysis, very time-consuming, it is possible to select 500 development sets (about 100 mistakenly classified) as eyeball development set, the remaining 4500 as a blackbox development set, The former has been observed, so it is easy to cross-fit, through the latter to verify that the first time the former has been fitted, it should discard or look for a new eyeball development set or gradually move the latter into the former.
18.Eyeball and Blackbox Development set how large this setting
The eyeball development set should be large enough to make you aware of the main error categories of the algorithm, the smaller the error of the classifier, the greater the need for the eyeball development set, in order to meet the requirements of the first sentence; If we face the task that human beings do not do well, Then the inspection car eyeball development set will not have a dozen moderators, because it is difficult to find out the algorithm can not accurately classify a sample reason.
The size of the blackbox development set is not as important as that of eyeball, and sometimes it may not even be (because eyeball requires a large development set), which is only a large risk of overfitting.
19. Summary: Basic error analysis
(1) When you start a new project, especially in a field that is not good, it is difficult to guess the most promising direction correctly.
(2) So don't design and build a perfect system from the start. Instead, build and train a basic system as much as possible (possibly within a few days), then use error analysis to help you identify the most promising direction and iteratively refine your algorithm accordingly.
(3) Perform error analysis by manually checking the sample of the development set of approximately 100 algorithm error classifications, and calculate the primary error category, using this information to determine which type of error to prioritize.
(4) Consider that the development set is divided into human inspection eyeball development set and non-human inspection of the blackbox development set. If the performance on the eyeball development set is much better than in the Blackbox development set, then you have already fitted the eyeball development set and should consider getting more data for it.
(5) The eyeball data should be large enough to allow the algorithm to have enough error-disaggregated samples for you to analyze. For many applications, a BLACKBOXK development set containing 1000-10000 samples is sufficient.
(6) If your development set is not large enough to be split in this way, use the eyeball development set for manual error analysis, model selection, and tuning parameters.
20. Deviations and variances: two major sources of error
The premise: Assume that your training set, development set, and test set all come from the same distribution. Is it advisable to simply get more training data?
The following initial definitions of deviations and variances (total error = deviation + Variance):
Deviation: The error of the training set; variance: The error of the test set minus the error of the training set (that is, how much the test set is worse than the training set); Example: a classifier, the error of the training set is 15%, the error of the test set is 16%, then the deviation is 15%, the variance is 1%, this time is high Simply increasing the amount of data is useless.
Wunda "Machine Learning Yearning" summary (11-20 chapters)