Large Data Technology stickers: Building a guided data mining model

Source: Internet
Author: User
Keywords Large data data mining data visualization
Tags accounts analysis application bank customers based business business people business systems

The purpose of data mining is to find more quality users from the data. Next, we continue to explore the model of the guidance data mining method. What is a guided data mining method model and how data mining builds the model. In building a guided data mining model, the first step is to understand and define the target variables that the model attempts to estimate. A typical case, two-dollar response model, such as selecting a customer model for direct mailing and e-mail marketing campaigns. The build of the model selects historical customer data that responds to similar activities in the past. The goal of directing data mining is to find more similar customers to improve the responsiveness of future activities.

In the process of constructing a guided data mining model, you first define the structure and objectives of the model. Second, increase the response modeling. Third, consider the stability of the model. The stability of the model is discussed through the Prediction model and the analysis model. Here's how to build a guided data mining model from the concrete steps.

Guidelines for data mining:

Translating business issues into data mining issues

Choose the right data

Understanding Data

Create a model set

Fix problem data

Transform data to reveal information

Building Models

Evaluation model

Deployment model

Assessment results

Start again

  

Steps:

1. Transform business problem into data mining problem

Alice in Wonderland, Alice says, "I don't care where I go." The cat said, "Well, there's nothing wrong with the way you go." Alice added, "As long as I can get somewhere." Cat: "Oh, you must be able to do it as long as you can walk long enough." ”

Cats may have another meaning, and if you don't have a definite destination, you can't be sure you've been gone long enough.

The goal of a guided data mining project is to find a solution that defines a well-defined business problem. Data mining Objectives for a particular project should not be broad, generic regulations. Those broad goals, materialization, refinement, and in-depth observation of customer behavior may become specific goals:

Identify customers who are unlikely to be renewed

Set up a call plan for home-based enterprise customers that will reduce the customer's exit rate

Make sure those Internet deals could be fraudulent.

If wine and beer have ceased to be sold, list products at risk of sales

Forecast the number of customers over the next three years based on current marketing strategy

A guide to data mining is often a technical problem, namely finding a model to explain the relationship between a set of input variables and the target variable. This is often the center of data mining, but if the target variable is not defined correctly and the appropriate input variable is not determined. These tasks, in turn, depend on the degree of understanding of the business issues to be addressed. There is no way to translate data into a mining task without having a proper understanding of the business problem. Before the technology begins, two issues must be known: How to use the results and how to deliver the results?

Both of these questions are based on what the customer's real needs are, not what data mining engineers think of as useful data and what the best data is for customers. These results seem to help customers improve sales, but what are the results of our deliveries, do we understand the needs of our customers? All premises, do not rush to start, understand the real needs, not blind main break.

1.1 How do I use the results?

For example, many data mining efforts are designed to increase customer retention

Proactively offer a discount to high-risk or high-value customers to retain their

Change the combination of access channels for the most loyal customers

Predict the number of customers in the next few months

Change the defect of a product that affects customer satisfaction

These goals can create shadows on the data mining process. Rang。 Contacting existing customers by phone or direct mailing marketing activities means that in addition to identifying customer risks, you need to understand why they are at risk and thus build an attractive offer. Telephone, not too early or too late. Forecasting means determining how many new customers are joining and how long they will stay in addition to identifying which customers may be leaving. The addition of new customers does not just mean that the forecast model is to be solved, or is incorporated into the business goals and budgets.

1.2 How to deliver the results?

A guided data mining project may produce several different types of delivery forms. The delivery form is often a report or full of icons and graphic briefings. Delivery forms can affect the results of data mining. When our goal is to remind sales of thunder, it is not enough to produce a customer list of marketing tests. The so-called How to deliver results, that is, after the mining results are generated, we want to provide users with this result, the purpose is good, but the actual process will encounter, will encounter we have no way to deliver this result. Because the results of your delivery may lead to some customers who should not be lost, but are lost. This is also a question that we have to consider before we have a specific job.

The role of data mining personnel is to ensure that the final statement of business problems can be translated into a technical problem. The premise is the correct business problem.

2, choose the right data

2.1 What data is available?

The first place to look for customer data is the enterprise Data Warehouse. The data in the warehouse has been cleaned and verified, and multiple data sources have been consolidated together. A single data model promises to ensure that naming similar fields has the same meaning throughout the database and compatible data types. The enterprise database is a historical database, the new data is constantly appended, but the historical data is unchanged. From this point is more conducive to decision-making support.

The problem is that in many enterprise organizations such data warehouses do not actually exist, or there is one or more data warehouses that do not conform to data directly used as specifications for mining. In this case, the digger must seek data from different departments ' databases and business systems. Business system data refers to the execution of a specific task, such as a website run, claim processing, completion call or billing processing. Their goal is to process transactions quickly and accurately, and data can be saved in any format. And these for some enterprises without data warehouse, this data is often hidden deep, need a lot of enterprise scheduling and planning to collate this data. This is also a question: the importance of data warehousing to a business, and the establishment of an enterprise-class data Warehouse, the need for decision-making is not a manager can be completed, this may require the highest level of enterprise leadership orders, all the following departments to cooperate.

It is quite difficult for the enterprise to determine which data is available. Because many documents are lost or obsolete. Usually, no one can provide all the answers. Determine what data is available, traverse the data dictionary, understand the specific business, communicate with each department, access users and DBAs, review existing reports, and find out whether the data itself is useful. There are also issues that require not only data about the customer, but also potential customer data. When this data is needed, external resources and business systems, such as web logs, call detail records, call center systems, and sometimes even mail or spreadsheets, are the source of data information.

The way data mining works is not always waiting for the perfect and clean data to work on the next step. Although additional clean data is required, the mining must be able to use the current data, start early and begin work.

Is it enough to have more than 2.2 minorities?

One, the more data the better, more means better. During modeling, the model set must be balanced so that the number of results is equal. If there is a very small percentage of rare data in a large sample, a smaller, balanced sample would be more popular.

Second, when the model set is sufficient to establish a good, stable model, it will have the opposite effect, because it takes longer time to run on the larger model, because data mining is a repetitive process, which leads to waste of time. If it takes hours instead of minutes to run a modeling routine, the consumption of that time can be wasted. This leads to the fact that the data is not as good as the model is determined.

2.3 How long does it take to history?

Data mining uses past data to predict the future. But how long does the data need to come from? This is not a qualitative answer, there are many factors to consider. On the other hand, data that is too long in history may not be useful for data mining because the market environment is changing, especially when some external events, such as changes in regulatory regimes, intervene. For many customer-centric applications, a 2-3-year history is appropriate. However, in this case, the customer relationship does exist effectively to prove to be valuable, so what is important: what is the original channel? What was the initial offer? How did the customer initially pay?

How many variables:

People who are not skilled in data mining are sometimes too eager to throw variables that are less likely to make sense, and only save a few carefully selected variables that they deem important. Data mining methods require that the data itself can reveal its importance or unimportant.

Usually, when used in conjunction with other variables, the previously neglected variables have predictive value. For example: A credit card has never used the cash brush to continue to brush, through data mining to find that they only in November and December only use credit card advances. Presumably, these people are very cautious, most of the time they avoid high interest rates due to multiple brushes, and caution infers a conclusion (they are less likely to opt for arrears than those who are accustomed to using cash advances), but during the holidays they need some extra cash and are willing to pay higher interest rates.

2.4 What must the data contain?

At a minimum, the data must contain examples of potentially meaningful results. The purpose of the guidance data mining is to predict the value of a particular target variable, but in the guided data mining, the model set must be composed of the data that is classified well. Each class in the model set needs tens of thousands of examples to distinguish which people will not default on the loan. When a new application is produced, his application will be compared with the application of a former client, and the new application can be categorized directly. The implication is that data can be used to describe what happened in the past, to learn from mistakes, and first we have to identify what we have done wrong.

3. Know the data

Before data is used to build a model, the importance of time spent exploring data is often not given enough attention. We'll take an absolute length to illustrate the problem. Good data mining engineers seem to rely heavily on intuition--for example, to some extent to guess what the outcome of a variable will derive. The only way to get a sense of what's going on in a strange dataset is to be in the data, you'll find a lot of data quality problems and be inspired to ask questions that are not easy to spot in other situations.

3.1 Check Distribution

In the initial stage of the database, data visualization tools are very useful, such as: Lietu, bar graph, geographical map, Excel and other visual tools to provide strong support for observation data.

When you start working on a data file in a new data source, you should dissect the data to see what's going on, including the count and the summary statistics for each field, the number of different values for the classification variable, and, where appropriate, cross tables based on products and regions. In addition to providing an understanding of the data, profiling may cause inconsistencies or a warning of the definition of problems that could be problematic for subsequent analysis.

3.2 Comparison of values and descriptions

Observe the values of each variable and compare them with the description of the variable in the existing file. This work can uncover inaccurate or incomplete data descriptions. In fact, the data you record is consistent with the data you want to describe, and this is determined first. What is the purpose? In the actual data mining process, you have to speculate on this field of data in the end what is the meaning? If the business people know, that's the best. If the business people do not know, this time, may need to rely on experience to speculate, and this happens frequently, the field definition is ambiguous.

3.3 Asking Big questions

If the data seems unwise or not as you wish, record it. An important output of the data discovery process is to give a list of questions to the person who provided the data. Often, these issues will need to be studied further because few users look at the data as closely as data mining engineers. The preliminary work of the data exploration, the judgment field, the meaning, whether useful, whether missing, whether has the question and so on a series of questions, needs the massive work, simultaneously also is a cautious process.

4. Create a model set

The model set contains all the data used during the modeling process. Some of the data in the model set is used to find patterns, and for some techniques, some data in the model set is used to verify that the model is stable. Model sets can also be used to evaluate the performance of a model. Creating a model set requires aggregating data from multiple data sources to form a customer signature and then preparing the data for analysis.

4.1 Aggregation of customer signatures

A model set is a table or series of tables in which each row represents a project to be studied, and a field represents everything that the project is good for modeling. When data describes a customer, the rows of the model set are often referred to as customer signatures. Customer signature, each customer is only determined by the trail he leaves, you can use the trail to fully understand each customer.

Aggregating customer signatures from a relational database the need for complex queries, which often need to correlate many tables to query data, and then use data from other sources to enhance the results. Part of the process of aggregating data is to keep the data at the correct level of aggregation, and then each row contains all the information that first closes the customer.

4.2 Create a balanced sample

A common practice in standard statistical analysis is to discard outliers-far beyond the normal range of observations. In the process of data mining, however, these points of interest may be exactly what you are looking for. Maybe they're cheating, maybe some mistakes in your business, or some amazing market opportunities. In this case, we do not want to throw away from the group point, to know and understand them.

Knowledge discovery algorithms need to be studied through examples. Without a sufficient number of examples of a particular class or behavior model, the data mining tool cannot produce a model that predicts the class or pattern. In this case, the model set is enriched by an example of rare events, which increases the probability of the event in modeling. If rare, there are two ways to balance the sample: first, stratified sampling. Second, the weight.

For example, a bank should establish a model to determine which customers are potential customers for private banking schemes. These programs are intended for very wealthy clients, and they are rare in a sizeable sample of bank customers. How to build a model that discovers this type of user, which may require 50% of private bank customers, even if they represent less than 1% of all check shares. In addition, private banking customers may be given a value of 1 weight, the other customer's weight is 0.01, so the total weight of the exclusive customers in Western Zhejiang equals the total weight of the remaining customers. By increasing the weight of some isolated customers, the model can be sorted out reasonably.

4.3 Time frames

Building a model based on data within a time period increases the risk that learning knowledge is untrue. The effects of seasonal factors can be eliminated by combining multiple time frames in a model set. Because seasonal effects are so important, they should be explicitly added to the customer's signature. And the holiday shopping model is also very important. The customer's information breakdown by time, or in the corresponding data tag.

4.4 Creating a forecast model

When the model set is used for forecasting, another question is how long the model set should be, and how the time period should be divided. Any customer tag should have a time lag between the forecast variable and the target variable. Time can be divided into past, present and future. Of course all the data comes from the past, and the past is divided into three periods: the distant past, the not-too-distant past and the recent. The predictive model is to discover distant past models to explain the most recent output. When the model is deployed, it can use the most recent data to predict the future. If you build a model that uses data from June (not too distant past) to predict July (most recently), data prior to August is available, but it cannot be used to predict September. But is August data available? Certainly not, because at this point, the data is still generating data. Nor will it be the first week of September, as this data needs to be collected, cleaned, loaded, tested, and approved. The August data may have to be understood in the middle of September or October, and no one will worry about the September forecast by this time. The solution is to skip 1 months in the model set.

4.5 Creating a profiling model set

Profiling a model set is similar to a test model, but there is one point: the target's time frame overlaps the input time frame. The impact of small differences on modeling work is great. Because the input may "contaminate" the target mode. For example: Banks, who have investment accounts, often have very low balances in their savings accounts-because they can get better returns from their investment accounts. Does this mean that banks have to identify customers with low savings account balances for investment accounts? It may not be necessary because these customers have very few assets.

One way to solve this problem is to choose carefully the input of the profiling model. Combine all account balances into "savings" and "loan" groups. The savings group includes all types of savings and investments. This method is very effective and it turns out that the model is stable. A better approach: Create a model for your account before opening an investment account. A concurrent problem is that because each customer's time frame depends on the time the customer opened the account, creating such a model set is more difficult.

When the time frame of the target variable is the same as the time frame of the input variable, the model is an analytic model, and the input may introduce plausible patterns that may confuse data mining techniques. You need to be very careful about selecting inputs or rebuilding model sets to generate predictive models.

4.6 Dividing model sets

When you get prescaler good data from the appropriate time frame, there is guidance data mining Fenghui divides it into three parts. First, the training set, the user establishes the initial model. The validation set, which is used to adjust the initial model and reduce the amount binding to the training set characteristics, is more general. The test set is used to measure the possible effects of the model application and the unknown data. Three datasets are necessary because if a data has been used at one step in the process, the information it contains is already part of the model. Therefore, it cannot be used to modify or judge the model.

It is often difficult to understand why a training set and a validation set are used to create a model and then become "tainted". This is like you take the exam, you think this problem you do is correct, the teacher asked you to predict test results, you obviously think the score is high, because you think, if there is no answer, the next day in the same exam, your idea will not change. At this point, your system does not have a new standard, then you need a validation set.

Now, imagine the test results, the teacher asked you to look at some of your classmates ' papers. If they are not the same as your results, then you may mark your own answer as a wrong answer. If the teacher gives the correct result the next day, and this time lets you do the same test paper, you may come up with different results. This is why the validation set should be different from the test set.

A good idea for a predictive model is that the time period of the test set is different from the time period of the training set and the validation set. The stability evidence for a model is that it works well in successive months. Test sets from different time periods, also known as outdated test sets, are a good way to validate model stability, although they are not always available.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.