Summary:
A method--elblocker is proposed for automatic detection of blocking Bugs (prevent other Bugs from being? xed).
The difficulty is that these blocking bugs account for only a small percentage (the class imbalance phenomenon).
Method: Given a training set, Elblocker first divides the training data into multiple mutually exclusive collections. Create a classifier for each set, then set a threshold (decision boundary) based on the result of the mixed classifier, separating the blocking bugs from non-blocking bugs.
S1 Introduction
Elbloker has two metrics:
A. Accuracy and recall rate; F1-score
F1-measurea is an evaluation indicator that is often used in information retrieval and natural language processing.
? F1-measure is a comprehensive evaluation index based on accuracy rate precision and recall rate recall, which is defined as follows:
F1 = 2RP/(r +p)
where R is recall,p to precision.
B. Performance indicators; cost effectiveness
S2 Preliminaries & Motivation Preliminary materials and motivations 2.1 experiments have shown that blocking bugs take a long time to be found in 2.2 two problems:
A. is the classifier constructed by a subset better than the classifier constructed by the complete works?
Experiment: (When the experiment, K takes 10, 9 as the training set, 1 as a test set)
Ⅰ. The training set is divided into K distinct subsets of the same size;
Ⅱ. Create a classifier for each subset, and create a classifier based on the complete
Ⅲ. Using the same test set, the K classifier is classified, and a stochastic classification prediction is established;
Ⅳ. Use random forests to construct classifiers.
Experimental results:
The general trend shows that the F1 index increases with the increase of K.
Conclusion: establishing multiple classifiers for each subset and predicting results is better than the result of establishing a classifier for the complete set;
B. Are the different decision boundaries (thresholds) leading to significant differences in decision performance?
Experiment:
Ⅰ. Establishing a classifier based on the complete set;
Ⅱ. A new set of tests is categorized, the classification measures are compared with the thresholds, the different thresholds are observed, and the f1-score changes.
Conclusion: The threshold value is different, F1-score is different. S3 Elblocker Architecture
Model building phase and prediction phase.
Model Building phase: building a hybrid model based on a marked (blocking or non-blocking) vulnerability log
Prediction phase: Whether the classification prediction is a blocking bug
S4 Elblocker approach 4.1 subset scores definition
For each test set, each classifier classifies it and gives a likelihood value as a sub (BRI). The random forest algorithm is used to construct a large number of decision trees by using the data of training stage. In the classification decision-making stage, the results of most classifiers are used as the final classification criteria.
4.2 Elcomposer Classifier
Here the calculation of the value of the Elcomposer is relatively simple, directly averaged. And then the threshold is larger than the size.
How is a more appropriate threshold automatically generated? --greedy algorithm, in fact, I understand is not what greedy, that is, each counted again, and then take the largest, more like in the enumeration.
S5 experiments and Results 5.1 in order to compare, adopt the same experiment environment, namely the same data set and so on.
6 Open Source software projects: Freedesktop,chromium,mozilla,netbeans,openof?ce,eclipse.
Among them, Mozilla, Eclipse, FreeDesktop and NetBeans use Bugzilla as issue tracking system;
Openof?ce uses Issuetracker as issue tracking system;
Chromium uses Google code as the issue tracking system.
These issue tracking system bug report has a domain called "Blocks", which can be used to determine if the block is blocked.
Tested with 10 percent cross-validation, 100 times. Each cross-validation is randomly divided into 10 copies.
With regard to the imbalance class phenomenon (class imbalance), Garcia and Shihab are using resampling (re-sampling), and the method of random forest resampling has achieved good results, and the author has done the same. For the class imbalance algorithm, the smote and one-sided selection (OSS) algorithms are used in this paper, in which:
Smote is one of the resampling techniques, the feature of the smote algorithm is not to follow the random oversampling method to simply copy the sample, but to add a new non-existent sample, so to a certain extent can avoid the classifier overfitting.
One-sided selection (OSS) is proposed by Rule Kubat and Matwin attempts to intelligently under-sample (under sampling) the majority CL The removing majority class examples that is considered either redundant or noisy.
In addition, the bagging algorithm is used: Bagging is a method to improve the accuracy of the learning algorithm by constructing a series of predictive functions and then combining them into a predictive function in a certain way.
1. Given a weak learning algorithm, and a training set;
2. The accuracy rate of a single weak learning algorithm is not high;
3. The learning algorithm is used many times, the prediction function sequence is obtained, and the voting is carried out;
4. The final result accuracy rate will be improved. 5.2 Metrics 5.2.1 F1-score (mentioned earlier)
Blocking bug as probability of blocking bug: TP
Blocking bug as probability of non-blocking bug: FN
Non-blocking bug as probability of blocking bug: FP
Probability of non-blocking as non-blocking: TN
5.2.2. Cost effectiveness
We use [email protected]% (e-mail protected]%) as the default cost effectiveness metric.
In the first 20% (all) bugs found through Elblocker, the number of blocking bugs/technique blocking in the first 20% (all) bugs found through perfect bugs
5.3 What is the performance of the question Rq1:elblocker? How much better is it than the most advanced technology?
Using F1-score and [email protected] as the indicator,
Simultaneous calculation of 5 methods: Elblocker, Garcia and Shihab ' s method, SMOTE, OSS, Bagging.
On 6 projects.
10 percent cross-validation is also used.
Using the Wilcoxon Signed-rank method to test whether the increase of elblocker is statistically significant
In addition, Cliff's delta is used to quantify the difference between two groups.
RQ1 Experimental Results:
To summarize, on average elblocker improves the f1-scores over Garcia and Shihab ' s method, SMOTE, OS S, and Bagging by 14.69%, 23.36%, 30.98%, and 171.65%, respectively. Also, on average elblocker improves the [e-mail protected]% over Garcia and Shihab ' s method, SMOTE, O SS, and Bagging by 8.99%, 15.76%, 22.64%, and 56.82%, respectively. Using the Wilcoxon signed-rank test, we nd, the improvements provided by Elblocker is statistically signi?cantly and has a large effect size.
RQ2: What are the performance differences between Elblocker and baseline methods when taking different proportions of the bug report? i.e. effectiveness at different K
The previous use of [email protected]%, and now using [email protected]%, that is, using a different dry K value to calculate performance.
RQ2 Experimental Results:
We Notice elblocker is better than the baseline methods for a wide range K values.
RQ3: When the number of different subsets is taken, what is the performance difference of Elblocker?
Previously, 10 subsets were used by default, and now a 2~20 subset was used to construct the corresponding classifier.
RQ3 Experimental Results:
The performance of Elblocker is generally stable across various numbers of subset classi?ers.
Rq4:elblocker time complexity?
Run time comparison with other methods. Weight in comparison model building and prediction time
RQ4 Experimental Results:
Within an acceptable range.
S6 discussion on the effectiveness of 6.1 elblocker
Compare Elblocker with random prediction
Conclusion: To summarize,in more cases (except for Chromium) Elblocker achieves a much better performance compared to random predic tion.
6.2 Elblocker vs. Bagging + Random Forest
Elblocker improves the f1-score and [email protected] of Bagging + RA by 175.05% and 12.83%, respectively.
6.3 Validity risk (threats to validity)
Internal validity (relates to errors in our code andexperiment bias): To reduce training set selection bias, we run 10-f Old Cross-validation, and record the average performance.
External validity (relates to the generalizability ofour results.) : We plan to reduce the threat further by analyzing even more bugs reports from additional software p Rojects.
Construct validity (refers to the suitability of our evaluation measures. ): Use F1-score and cost effectiveness which is also used by past studies to evaluate the effectiveness of various a utomated Software Engineering Techniques
S7 related work 7.1 Blocking bug prediction
The question was first proposed by Garcia and Shihab, who did a further study on the issue and confirmed that his approach was better.
7.2. Other studies the bug report Management 7.3. Imbalanced Learning and Ensemble learning
Imbalanced Learning: Class imbalance Learning
Under-Sampling (under-sampling): OSS
Resampling (over-sampling): SMOTE
Ensemble Learning:
This method is similar to bagging, bagging is also a subset of the construction of classifiers, the difference is:
To determine the label of an instance, Bagging uses a majorityvoting mechanism. Our elblocker are different from Bagging since we don't use a bootstrap sampling to select the subsets, rather we Randomly divide the training set into multiple disjoint subsets, and we build a classi?er on each these subset S.Moreover,after We build multiple classi?ers, we automatically detect an appropriate imbalanced Decis Ion Threshold; Bagging does not consider a threshold.
S8 conclusion and future work
1 Apply Elblocker to more projects;
2 using some other methods (such as text mining, data retrieval) to improve the efficiency of elblocker;
3 Develop a tool to tell developers that not only a bug is a blocking bug, but also that it tells them which bugs have been blocking.
Software mining elblocker:predicting blocking bugs with ensemble imbalance learning