Beyond data Mining

Source: Internet
Author: User

The Predictive modeling community (predictive modeling community) applies data mining to artifacts from software projects. This work has been very successful, and we know how to build a predictive model for the impact and inadequacy of the software, and to build a predictive model for tasks such as the Developer programming model (see the extended version of this article for more information).

That is to say, we need to change the focus of the predictive modeling community in order to really make a difference to practitioners in the industry. For the data, we spend too much time on algorithmic mining, but in fact the battlefield has shifted to what I call landscape mining (landscape mining). In order to support industry practitioners, we are preparing to turn to what I call decision Mining, and then to the discussion of what is being dug up.

This article compares and contrasts the four types of diggers shown in Figure 1:

The algorithm digger explores the parameter adjustment of the data mining algorithm.

The landscape Explorer reveals the shape of the decision space.

The decision-maker views how to change an item most effectively.

The discussion of the excavator helps the community to choose between different decisions.

Algorithms and landscape mining are more focused on research activities, to explore the internal details of the excavator. Decisions and discussion mining focus more on practitioners because they are concerned with how the community uses the conclusions.

Algorithm Mining

Although few people point out, the initial premise of predictive modeling is that predictions should be able to guide software management, in other words, the goal of forecasting is to make decisions.

Sadly, the original goal seems to have been forgotten. Too many researchers in this field are trapped in a vicious circle, publishing papers that use little time to explore data but spend a lot of time on data algorithms. Most of these papers focus on exploring the configuration of the algorithm, rather than how to reflect the underlying data. A recent paper has pointed out that this algorithm mining is very little harvested, because the "ascension" obtained in this way is at best marginal. For example, for workload estimates and defect forecasts, simpler data mining can achieve the same, or even better, results than the more sophisticated. 1,2

Landscape excavation

Algorithm mining is "jump to see", the researcher threw the algorithm on the data, and then see what the result is. The second way is to "see and then jump", mining the data to find the possible reasoning space, and then with the learning device leap. This is the "landscape" of the data.

Figure 1. Four of excavations from left to right represent the past and the future.

Consider W1 case-based reasoning (CBR) systems, also known as "Dub-ya" or "decision makers." 3 CBR concludes by detecting recent similar cases. To make the W1 into a landscape digger (which we'll call W2), we can cluster the training data into a cluster tree in which the child nodes contain a subset of the parent node. Then use an attribute selector to traverse the data, rejecting the attributes whose values cannot distinguish the cluster. To be specific, we check the entropy of all the attribute values in the entire cluster, and erase those with the highest entropy. Finally, we can replace the leaf cluster with the median number of leaf clusters. The resulting space and sample for this feature are very small: only a few of the dozens of features are left, and hundreds of samples are ultimately one in each cluster.

Since inference is now limited to the subtree of the cluster (now there is only one representative sample in the leaf node), we can quickly build a lot of local models for specific situations.

W2 has two important characteristics. First, it's a landscape explorer that maps out different areas of data so that we can build different models. Second, although these ideas are unique, each part of W2 is a familiar tool for the predictive modeling community. That is to say, let the predictive modeling community readjust the tool and target an interesting new goal.

Decision Mining

Recently, there was a lecture on software analysis at ICSE 2012, and industry practitioners reviewed the state of data mining technology. "Predicting everything is OK, but how about making decisions?" 4 speakers said. "Data mining is useful because it focuses on the investigation of specific issues, but the data mining is in the process of a higher-level decision process."

To turn the W2 into a decision-mining device (which we call W3), we added a control group to learn. The classification can give different regions of the data, and the control group can give the differences in these areas. Control groups are much smaller than classification rules, especially when they are generated as a result of subsequent processing of certain decision tree processes. The control groups studying at the upper level of the decision tree tended to eliminate most of the possibilities, selecting only a small number of categories, and they achieved this effect with fewer additional constraints.

The cluster used by W3 is the same as that found in W2, but it will apply the principle of jealousy. Each cluster finds the nearest neighboring cluster it is most eager to find, for example, for workload estimates, adjacent clusters of projects are built to be more inexpensive clusters. W3 then applies a control group to the adjacent cluster to find the best practices for achieving those better results in that cluster. In a recent paper in the IEEE Software Engineering journal, I have pointed out that this "local learning" based on jealousy is better than a generic model of learning from all the data. 5

W3 gives us the same experience as W2: refactoring our existing tools to get a new, innovative model of predictive modeling.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.