Thinking in BigDate (12) Big Data guidance data mining method model order (3)

Source: Internet
Author: User

Continue with the above blog: There are instructions on the data mining method model steps

5. Repair problem Data

All data is dirty. All data is faulty.Is it true that the problem may sometimes change with the change of data mining technology. Some technologies, such as decision trees, missing values, and outliers, do not cause much trouble. However, other technologies, such as regression and neural networks, may cause many problems.

5.1 too many classification variables

Variables with many values must be processed in some way. One method isGroup these valuesThat is, the values of multiple classes with the same relationship with the target variable are put together.

5.2 numeric variables containing partial distribution and outlier

For data on outlier and partial distribution, use:Multiply all variables and weights, and then sum them.. These values are sometimes used to divide values into values of the same size. For example, we use the mostConvert the data, Through the standardization of the value to narrow the scope of these values.

5.3 Missing Values

The missing values are included in the model, but the model itself cannot process the missing values. dropping these values will result in errors because these values are unevenly distributed. Method: replace the value with the average value or the most common value. Replacing the missing value with an impossible value produces worse results.

Some data mining tools provide the ability to fill in missing values. These methods are basically available and use data mining technology to find out the value of missing values.

However, some values are often missing due to their normal nature. For example, if you look into a model that uses historical data for one year, problems may occur to users who have been using data for more than one year. In the amount of time they spent, that part of the data was empty. Some customers do not match the database, resulting in loss of all population statistics. At this time, we use multiple models on different parts of the data. For more than one year, a model is provided. Create another model for recent customers. Create multiple models as needed.

When creating a model, pay special attention to the following,Record discarded data. Typically, a model is decomposed into subsets that do not contain missing values, and a separate model is created for each subset.

 

6. convert data to reveal information

After the data has been aggregated and the main problem of the data has been fixed, prepare to analyze the data. This may require adding a derived field to reveal some information. It may also involve deleting outliers and bins, grouping classification variables, and applying some conversions, such as logarithm conversion and counting.

7. Build a model

In guided data mining, the training set is used to produce explanations of Dependencies or target variables based on independent targets or input variables. This interpretation refers to the representation of neural networks, decision trees, link graphs, or other relationships, that is, the relationship between the target and other fields in the database. Generally, data mining software for roommates is automatically completed.

8. Evaluate Model vacancies. We will discuss how to evaluate a model later. 9. Deployment model

The data mining tool regards the scoring Code as part of the model deployment process. This score can be used: SAS or SPSS, or a programming language, C, java, or C #. However, the deployment of model code solves only half of the problem, because the model usually uses input variables that do not exist in the raw data. Model scoring is a great challenge, especially when you need to perform real-time evaluation on the model. For example, when a customer puts an item in a shopping basket or accesses a Web page, the Web application must rate the model. Such a score must be very quick, because the customer's scoring process cannot interfere with the convenience of website navigation.

9.1 deploy Optimized Models

To evaluate the profit of a model, you must consider whether the cost and income of the model are correct. For groups of different sizes, the chart shows the actual profitability of an activity.

To evaluate the profit of a model, you need to ask the following questions:

· What is the fixed cost for setting up activities and supporting models?

· What is the cost of each discount recipient,

· What is the cost of each preferential contact?

· What is the value of positive response?

When the quality of the profit model depends on its input. Although the Activity Cost and variable cost are easy to get, it is difficult to estimate the predicted values of the respondents. The customer's value is beyond the scope of discussion, but a good old residence helps to measure the value of the data mining model.

Finally, the most importantMeasurement is ROI. Improving the measurement test set helps you select an appropriate model. Profit calculation based on improvement helps determine how to apply the results of the model. However, measurement information in these fields is also very important. In a database marketing application, the control group should be set aside, and the customer's response should be carefully tracked based on different model scores to develop appropriate solutions.

10. Evaluation results

A typical model requires different test groups:

· Inspection group: A group with a higher model score than the one that receives feedback

· Model control group: group with a high score but no feedback

· Control group: groups with low or random model scores and received information

· Control group: groups with random model scores and without feedback

11. Start again

Each data mining project has more questions than the answer, which is a good thing. This means that some new relationships that were previously invisible are now visible. The newly discovered relationship puts forward new assumptions for testing, and the data mining process starts again. Remines valid feedback information.

 

Summary:

Guided data mining is to search for historical records to find a mode that can interpret a specific result.. The two categories of data mining models with guidance are:Profiling and Prediction Models. These types use the same technology and method:The difference lies only in the construction method of the model set..

There are guidance data mining problem solutions that may involve multiple chained models. Therefore,A cross-selling model may adopt different prediction models for each product and use decision-making rules to select the best results. The response model can be used to optimize profitability. It truly calculates the response expectations, rather than the possibility of response. A more complex method is to use the incremental response model. The goal is to increase the response rate of marketing work, not just the response rate.

In the process of creating a data mining model, the first hurdle is to convert the data mining problem into a business problem. The next challenge is to find appropriate data that can be converted to actionable information. After finding the data, you should explore it in depth. Some data problems may be found during the exploration process. It will also help data mining personnel establish an intuitive understanding of the data. The next step is to create a model set and divide it into a training set, verification set, and test set.

Data conversion is required,Two goals: 1. Solve Some data-related problems, such as classification variables with too many missing values and values. 2. reveal some information and use innovative variables to represent trends, other proportions, and combinations. The following describes how to convert data.

When the data is converted, it is relatively easy to build a model. Each type of model has its own indicators, which can be used to evaluate them. The evaluation method independent of the model is also feasible. Some of the most important evaluation methods are the improvement diagram and ROC diagram. These methods indicate how the model increases the concentration of the predicted target variable values, A confusion matrix is provided to show the error rate of incorrect classification of the corresponding binary model, and a score distribution chart is displayed for the numerical target. In the future, we will further discuss how to build our own model based on this method.



See Data Mining Technology



CopyrightBUAA

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.