The purpose of data mining is to find more high-quality users from data. Next, I went on to discuss the data mining method model in the previous blog. What is a guided data mining method model and how to build a model for data mining. To build a Data Mining Model with guidance, you must first understand and define the target variables that the model tries to estimate. A typical case is a binary response model, such as choosing a customer model for direct mail and email marketing activities. Historical customer data is selected for model building. These customers responded to similar previous activities. The purpose of guiding data mining is to find more similar customers to improve responses to future activities. This is the process of constructing a guided Data Mining Model,First, define the structure and objectives of the model. 2. Add response modeling. 3. Consider the stability of the model. 4. Discuss the stability of the model through the prediction model and analysis model.The following describes how to construct a Data Mining Model Based on specific steps.
Provides guidance on data mining methods:
· Converting business problems into data mining Problems
· Select appropriate data
· Recognize Data
· Create a model set
· Repair problem Data
· Convert data to reveal information
· Build a model
· Evaluation Model
· Deployment model
· Evaluation results
· Start again
(Provides guidance on data mining methods and models)
Step: 1. Convert business problems into data mining Problems
In Alice in Wonderland, Alice said, "I don't care where to go ". The cat said, "Well, it's okay for you to go any way ". Alice added: As long as I can reach somewhere. Cat: "Oh, you can do this, as long as you can walk for a long time ."
The cat may have another meaning. If there is no definite destination, it cannot be determined whether you have been walking for a long enough time.
The goal of a data mining project isFind a solution to a clearly defined business problem. The data mining target of a specific project should not be a broad and general regulation. The broad objectives should be embodied, refined, and in-depth observation of customer behavior may become specific goals:
· Determine who is unlikely to be renewed
· Set a call plan for family-based enterprise customers, which will reduce the customer exit rate
· Identify which online transactions may be fraudulent
· If wine and beer are no longer sold, list products at sale risk
· Predict the number of customers in the next three years based on the current marketing strategy
Guided data mining is often used as a technical problem, that is, finding a model to explain the relationship between a group of input variables and the target variables.This is often the center of data mining, but if the target variable is not correctly defined and the appropriate input variable is not determined. In turn, these tasks depend on the degree of understanding of the business issues to be solved. Without a correct understanding of business issues, data cannot be converted into mining tasks. Before starting the technology, you must understand two problems:How to Use results? How to deliver the results?
The two questions above are based on the real needs of customers? Rather than what data mining engineers think is useful data and what is the best data for customers. These results seem to help customers increase sales, but what are the results of our delivery? Do we understand the customer's needs? Do not rush to get started with all the prerequisites. First, understand the real needs and never blindly master or stop.
1.1 how to use results?
For example, many data mining jobs aim to improve customer retention
· Proactively offer a discount to high-risk or high-value customers to retain them
· Change the combination of acquisition channels to facilitate those channels that can bring the most loyal customers
· Predict the number of customers in the next few months
· Product defects that affect customer satisfaction
These goals will affect the data mining process. . Reaching existing customers through phone or Direct Mailing Marketing activities means you need to know why they are at risk in addition to identifying customer risks, so as to create an attractive offer. The phone number cannot be too early or too late. Prediction means that, apart from determining which customers may leave, determining how many new customers will join and how long they will stay. The addition of new customers does not only mean that the prediction model should solve the problem, but also be incorporated into the business objectives and budget.
1.2 How to deliver results?
There may be several different types of deliverables for guided data mining projects. The delivery form is often a report or a briefing filled with icons and graphs. The delivery form will affect the data mining results. It is not enough to generate a list of customers for a marketing test when our purpose is to remind sales of surprises. The so-called "how to deliver results" means that after the mining results are generated, we can provide users with the results in a good way, but we will encounter problems in the actual process, we have no way to deliver this result. Because the results of your delivery may lead to loss of customers that are not supposed to be lost. This is also an issue we should consider before specific work.
The role of data mining personnel is to ensure that the final expression of business problems can be converted into a technical problem. The premise is that the business problems are correct.
2. Select the appropriate data 2.1. what data is available?
The first place to look for customer data isEnterprise Data Warehouse. Data in the repositoryCleaned and verifiedAndMultiple data sources are integrated. A single data model is expected to ensure that fields with similar names have the same meaning throughout the database and compatible data types. An enterprise database is a historical database, and new data is constantly appended, but historical data remains unchanged. This is more conducive to decision-making support.
The problem is thatMany enterprisesOrganization,Such a data warehouse does not actually exist., Or existsOne or more Data WarehousesAnd does not comply with the rules directly for data mining. In this case, the mining personnel mustSearch for data from databases and business systems of different departments. Business System data refers to the execution of a specific task, such as website operation, claim processing, and call or bill processing. TheirThe goal is to process transactions quickly and accurately, and data can be saved in any format.For some enterprises without a data warehouse, the data is often hidden, and a large amount of Enterprise scheduling and planning are required to sort out the data. This involvesProblem: The importance of a data warehouse to an enterprise. To establish an enterprise-level data warehouse, you need to make decisions not just by a manager. This may require the order of the top enterprise-level leaders, all the following departments work together.
It is quite difficult for enterprises to determine which data is available. Because many documents are lost or out of date. Generally, no one can provide all the answers. To determine what data is available, you need to traverse the data dictionary, understand the specific business, communicate with each department, Access Users and DBAs, review existing reports, and find out whether the data is useful. There are still some problems, not only the customer data, but also the data of potential customers. When this data is required, external resources and business systems, such as Web logs, call details, call center systems, and sometimes even emails or workbooks, are the sources of data information.
The data mining method does not wait until the perfect and clean data is used for the next step. Although extra clean data is required, the current data must be used for mining. Start working in advance.
How much data is enough?
1. The more data, the better. More data means better.During modeling, the model set must be balanced so that the number of results is equal. If there is a small proportion of rare data in a large sample, a small, balanced sample will be more popular.
2. When the model set is sufficient to establish a good and stable model, making it bigger will have the opposite effect.Because it takes longer to run on a larger model, data mining is a repetitive process, which leads to a waste of time. If a modeling routine takes hours instead of minutes to run, the time consumption will not be sufficient. As a result, after the model is determined, the more data, the better.
How long does it take for 2.3?
Data mining uses past data to predict the future. But how long does the data come from? There is no qualitative answer, and many factors need to be considered. On the other hand, long history data may not be useful for data mining because the market environment is changing, especially when external events (such as changes in regulatory systems) intervene. For many customer-centric applications, it is appropriate to have a 2-3 year history. However, in this case, the customer relationship does exist effectively to prove that it is valuable. What is important: What is the initial channel? What was the initial offer? How did the customer make the payment at first.
Variable:
Unskillful data mining personnel are sometimes too eager to throw variables that are unlikely to make sense, and only save the several carefully selected variables that they think are important. The data mining method requires that the data itself be able to reveal whether it is important or not.
Generally, when combined with other variables, the variables that were previously ignored will have predictive value. For example, a credit card user has never paid cash and has never been used. Data Mining shows that they only used credit cards for advance payment in November and December. It is estimated that these people are very cautious, and most of the time they avoid high interest rates due to multiple clicks, exercise caution and draw a conclusion (they are more likely to choose to pay in arrears than those who are used to cash advances), but they need some extra cash during the holidays, and is willing to pay higher interest.
2.4 What must the data contain?
The data must containExamples of potentially meaningful results. The purpose of guided data mining is to predict the value of a specific target variable. However, in guided data mining, a model set must be composed of classified data. To distinguish who is in arrears and who cannot be in arrears, each category of the model set needs thousands of examples. When a new application is generated, the application is compared with the previous application of the customer, and the new application can be directly classified. This implies that data can be used to describe what happened in the past and learn from mistakes. First, we must identify what mistakes we have made.
3. Recognize data
Before data is used to build a model, the importance of time spent on data exploration is usually not paid enough attention. We will take an absolute length to illustrate this issue later.A good data mining engineer seems to be very dependent on intuition-for example, being able to guess to some extent what the outcome of the variable to be derived is. The only way to use intuition to perceive what happened in a strange data set is to get stuck in data. You will find many data quality problems, it can also be inspired to present problems that are not easily discovered in other situations.
3.1 check Distribution
In the initial exploration stage of the database, data visualization tools are very useful, such as scatter charts, bar charts, geographic maps, Excel and other visualization tools that provide powerful support for viewing data.
When you start to create a data file in a new data source, you should analyze the data to understand what happened, including the count and the summary statistics of each field, and the number of different values of the classification variable, in addition, when appropriate, it is necessary to use a cross-Statistical table between products and regions. In addition to understanding the data, the profiling may generate inconsistencies or warnings for defining problems, which may cause trouble for subsequent analysis.
3.2 comparison between values and descriptions
Observe the values of each variable and compare them with the description of the variable in the existing file. This work can identify inaccurate or incomplete data descriptions. Actually, whether the data you recorded is consistent with the data you want to describe must be determined first. What is the purpose? In the actual data mining process, what do you mean by the data of this field? If the business personnel know, it is the best. If the business personnel do not know, this time may need to be tested based on experience, and this often happens, and the field definition is not clear.
3.3 ask a big question
If the data seems unwise or not as expected, record it. An important output of the data exploration process is to provide a list of problems for the person providing the data. In general, these issues will need further research because few users observe the data as carefully as data mining engineers. A series of problems such as fields, meanings, usefulness, absence, and problems in data exploration require a great deal of work. It is also a process of detail.
4. Create a model set
A model set contains all the data used during modeling. Some data in the model set is used to find the mode. For some technologies, some data in the model set is used to verify whether the model is stable. A model set can also be used to evaluate the performance of a model. To create a model set, you need to aggregate data from multiple data sources to form a customer signature, and then prepare the data for analysis.
4.1 aggregate customer signatures
A model set is a table or a series of tables. Each row indicates a project to be studied, and a field indicates that the project is conducive to modeling. When the data describes the customer, the row of the model set is usually called the customer signature. The customer signature is uniquely identified by the customer's departure trace. You can use the trace to fully understand each customer.
Aggregating customer signatures from relational databases requires complex queries. These queries often need to be associated with a large number of tables to query data, and then use data from other sources to enhance the results. A part of the process of aggregating data is to make the data at the correct aggregation level, and then each row contains all the information of the customer first.
4.2 create a balanced sample.
A common practice in standard statistical analysis isDiscard outlier-- Observations far beyond the normal range. However, in the process of data mining,These benefits may be exactly what you are looking.. Maybe they carry fraud, which may be some errors in your business, or some amazing market opportunities. In this case, we do not want to throw an outlier to know and understand them.
Knowledge Discovery algorithms must be learned through instances.Without a sufficient number of examples of a particular class or behavior model, data mining tools cannot come up with a model that predicts this class or model.In this case, the model set is enriched using rare event examples to increase the probability of the Event During modeling. If it is rareTwo methods can balance samples: 1. stratified sampling. 2. Weight.
For example, a bank needs to establish a model to determine which customers are potential customers of a private bank plan. These plans are only for very wealthy customers, and they are very rare in a large bank customer sample. How to build a model that can discover such users, this model set may require customers of 50% of private banks, even if they represent less than 1% of all check shares. In addition, private bank customers may be given a weight of 1, and other customers have a weight of 0.01. Therefore, the total weight of private customers in western Zhejiang is equal to the total weight of other customers. By adding the customer weights for some isolated points, the model can reasonably sort the data.
4.3 Time Frame
Creating a model based on data within a period of time increases risks, that is, the learned knowledge is not authentic. Combining Multiple time frames in the model set can eliminate the influence of seasonal factors. Since the seasonal effects are so important, they should be explicitly added to the customer signature. Holiday shopping models are also very important.Segment customer information by time or tag the corresponding data.
4.4 create a prediction model
When a model set is used for prediction, another question is how long the model set should be included and how to divide the time period.Any customer tag should have a time difference between the prediction variable and the target variable.Time can be divided into past, present, and future. Of course, all data comes from the past, and the past is divided into three periods: distant past, not too distant past and recent.A prediction model is a model that discovers a distant past and is used to explain the recent output. When a model is deployed, it can use the latest data to predict the future. If you build a model to use the data from January (not too distant past) to predict January (recent), the data before January is available, however, it cannot be used to predict a month. But is the data in March available? No, because the data is still being generated. It will not be the first week of July because the data needs to be collected, cleaned, loaded, tested, and recognized. Data for January may be available in mid-December or mid-December. At this time, no one will worry about the forecast for August. The solution is to skip one month in the model set.
4.5 create an analytic model set
The profiling model set is similar to the testing model, but there is one thing: the target time frame overlaps with the input time frame. Looking at the small differences, there is a great impact on modeling. Because the input may "pollute" the target mode.For example, in a bank, customers with an investment account tend to have a very low balance in their savings account, because they can get a better return from the investment account. Does this mean that banks need to identify customers with low storage account balances for investment accounts? It may not be necessary because these customers have very few assets.
Solve this problemOne method is to carefully select the input of the analysis model.. Combine all account balances into two groups: "savings" and "loan. Savings groups include all types of savings and investment. This method is very effective, and it turns out that the model is stable. A better solution: create a model for the account before activating the investment account. IConcurrency issues: since the time frame of each customer depends on the time when the customer opens the account, it is more difficult to establish such a model set.
When the time frame of the target variable is the same as the time frame of the input variable, this model is an analysis model, and this input may introduce some plausible modes, these models may confuse data mining technologies. You need to carefully select input or recreate a model set to generate a prediction model.
4.6 division model set
After you get pre-classified data from the appropriate time frame, you can instruct Fang fenghui of data mining to divide it into three parts.I. Training set. You can create an initial model. Ii. Verification set, which is used to adjust the initial model and reduce its binding to the training set features, so as to be more general. 3. Test Set, which is used to measure the effects that may occur when the model application and unknown data are used.Three datasets are necessary, because if a data has been used in a certain step in this process, the information contained in it has become part of the model. Therefore, it cannot be used to modify or judge a model.
It is often hard to understandWhy does a training set and verification set become a "stain" after being used to create a model ".This is like taking the test. You think that you are doing the right thing. The teacher asked you to predict the test score. You obviously think that the score is very high, because you think that if there is no answer, if you take the same exam the next day, your thoughts will not change. At this time, there is no new standard in your system. At this time, you need a verification set.
Now, after imagining the test results, the teacher asked you to take a look at your students. If they are different from your results, you may mark your own answer as a wrong answer. If the teacher gave the correct result the next day and asked you to do the same test, you may have different results. This is why the verification set should be different from the test set.
A good idea for a prediction model is that the time period of the test set is different from that of the training set and verification set.Evidence of the stability of a model is that it runs well in consecutive months. ComeThe test set is also called an out-of-date test set in different time periods. Although such a test set is not always available, it is a good method to verify the stability of the model.
Next we will discuss the next problem: how to fix the problem Data
See Data Mining Technology
CopyrightBUAA