From regression analysis to data mining

Source: Internet
Author: User

Regression analysis is a statistical analysis method to study the quantitative relationship between two or more variables, which is widely used in many industries. Whether in the banking, insurance, telecommunications and other service industry business analysts in the database marketing, fraud risk detection, or semiconductor, electronics, chemical, pharmaceutical, steel and other manufacturing industry research and development personnel in the new product experiment design and analysis, process optimization and process monitoring, or more broadly speaking, Regression analysis is often used in different types of enterprises to carry out quality management and Six Sigma projects.

Regression analysis can help us determine which factors are significant, which are not significant, and can be used to predict and control the regression equation. However, with some requirements for the effectiveness and accuracy of the regression model, we find that regression analysis has some congenital deficiencies and pitfalls:

1. Lack of actual data to validate the effectiveness of the link, often heard that the complaint is: The model looks beautiful, but one to the application link to find that the prediction is inaccurate;

2. Modeling means single, can not consider the problem in a multi-angle, so as to better fit the data;

3. It is not possible to systematically compare the different models obtained by different methods, not to mention the selection of a relatively optimal model among many candidate models.

At this point, to eliminate the above hidden dangers, the ideal way to break the tool bottleneck is from the "regression analysis" to the level of "data mining" level.

Data mining is a larger concept of data analysis, which mainly refers to the whole process of revealing hidden, previously unknown and potentially valuable information from a large amount of enterprise data. At the statistical technical level, data mining has at least three major characteristics:

1. Emphasizing the partitioning of data sources prior to analysis modeling, it is generally necessary to classify all the raw data into model training data training data, model-validated validation, and model test data. From the source to ensure that the resulting model is to withstand the real complex situation of the severe test.

2. Provides a wealth of modeling tools, in addition to the least squares, stepwise and logistic methods and other traditional regression analysis, but also includes many new and practical modeling techniques, such as: Decision Tree (decision Tree), Neural networks (neural network), Association Rules ( Association Rule), support vector machines (supported vectors machine), text mining (Mining), and so on. This gives us the ability to solve problems in the event of failure of regression analysis.

3. "Model Comparison" is an essential part of the process of data mining, so that we can scientifically and objectively find the most ideal model from different candidate models to make the most accurate prediction analysis, and reduce the prediction error to the lowest.

Obviously, these three features of data mining effectively compensate for the shortcomings of regression analysis, which lays a solid foundation for our modeling and forecasting work. The following is a real case to illustrate the application of regression analysis to data mining, for data security considerations, the core data (including variable name) has been the corresponding encoding processing.

Research and Development Department of a well-known steel company in a project to build a structural steel end-quenching curve Prediction model, a predictive model was made using the interactive visual statistical discovery software developed by SAS for general engineers and scientists in JMP (see).

From the analysis report, the prediction model is still good. However, in the process of the model generalization, it is found that the prediction error is very large, and even the confidence of the technician applying statistical modeling is seriously shaken. Fortunately, under the guidance of the authoritative consulting agency, it is found that the main cause of the model prediction error is the overfitting of the model, which contains a lot of noise information which is not necessary to fit. The project members re-considered the methodology needed in the technical research, finally decided to upgrade to the advanced version of JMP Pro, did not do a field experiment, did not apply for any additional budget, but significantly improved the model's predictive effect, achieved the desired results.

In terms of technical details, the difference between the late and the early stages of the project also happens to embody the three characteristics of the data mining described above, namely:

First, without swallowed all the data are used to build the model, but planned to a certain proportion of all data into training data, validation data, test data three categories, all kinds of data to ensure that the resulting model in the production phase of the effectiveness.

Second, the development of ideas, organic use in addition to regression analysis of a variety of data mining modeling tools, such as decision trees, neural networks, and its derivative tools (such as random Forest bootstrap Forest, Ascension tree boosted trees, etc.), Avoids modeling errors caused by the mechanically of a single method.

Thirdly, the first loose and tight, the integration of the candidate models obtained before the combination of scientific rigorous statistical quantitative indicators and practical business experience, the selection of the most appropriate prediction model, embodies the "absorbing, learn from each other" modeling concept.

In short, "from regression analysis to data mining" is the enterprise in the refinement of management development to a certain stage will inevitably encounter a problem. Of course, compared with the traditional regression analysis, data mining will appear relatively complex. However, modern statistical analysis software that incorporates advanced algorithms and is concerned about interface-friendly, such as the JMP Pro software used in the case, has greatly reduced the technical threshold for data mining, making it possible for both trained statisticians and ordinary technicians without statistical skills to get started quickly, Real data mining information that is useful for business operations.

--------------------------------------------------------------

Search for "Data Ocean"

From regression analysis to data mining

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.