Typical DM model of Thinking in BigDate (13) Big Data (4)

Source: Internet
Author: User

The foothold of this article is not a data mining-based algorithm or some detailed algorithm implementation. I have already written a lot of details about this in some blog posts, but at first we saw this pure technology blog, some formulas and some algorithms, which are hard to avoid. Therefore, in the early stage, it is necessary to provide overall conceptual guidance. It is a good thing for those who want to work hard in data mining. In fact, our confusion is that we do not know what it can do. This is to begin to know what it can do and prepare.

The difference between data mining and statistics is described in detail in the previous Thinking in BigDate (5) Big Data Statistics and Data Mining blog. In fact, all data mining technologies useProbability Theory and statistics.

Next we will discuss howUse a model to represent simple and descriptive statistical data. If we can describe what we are looking for, it will be easy to find it. This isSimilarity ModelHistory --The more similar a thing is to be searched, the higher the score.

Below isQuery ModelThis model is very popular in the direct selling industry and is widely used in other fields.Naive Bayes modelIs a very useful generalized model in the table search model,Common Table query modelApplies to lower dimensions, while Naive Bayes model allows more dimensions to be added. AndLinear regression and Logistic Regression ModelsIs the most common predictive modeling technology. A regression model that represents the relationship between two variables in a scatter chart. Multiple regression model, which is accurate to many single-value input. Then we will introduce the logistic regression analysis. This technology expands multiple regression to limit its target range, for example, limited probability estimation. There are also fixed effects and hierarchical regression models that can apply regression to individual customers and build a bridge between many customer-centric data mining technologies.

1. Similarity Model

Similarity ModelCompare the observed value with the prototypeTo obtain the corresponding similarity score.. The higher the similarity between the observed value and the prototype, the higher the score.One way to measure similarity is to measure distance.The closer the distance between the observed value and the prototype value, the higher the observed value score.When each customer segment has a prototype, the model can assign the customer to the customer segment with the most similar prototype based on the score.

The similarity model consists of an original type and a similarity function. The new data can calculate the similarity score by calculating its similarity function.

1.1. Similarity distance

The readers of publishing houses are richer than the general public, and the degree of education is higher. Generally, the former is three times higher than the latter in terms of richness and education. In this way, we can give readers a message-"high salary and good education ".

To express the reader's description as a model that can identify the potential readers of the magazine, you need to precisely define the ideal readers, in this way, we can quantify the similarity between potential and ideal readers.

Similarity and distance are two different descriptions of the same concept, but they have different measurement directions. When distance is used as a metric, if two things are very close to each other, they are very similar. Therefore, when the distance between the two is small, the similarity will be very high.

For example, the ideal reader of a publishing house has an education level of 16 years and an annual income of $100000. What is the similarity between potential customers with an annual income of 75000 US dollars and ideal customers after 14 years of education? In addition, how many of them are similar to those of potential customers who have been educated for 12 years and earn an annual income of $150000? At this time, we need to select a measurement standard, Euclidean distance. When we calculate the distance between a potential customer and an ideal customer (x = 16, y = 100000), we will find that revenue is dominant in computing, because the value is much larger than the educational period. This introduces another problem: measurement scale. Solution: remove the two values by the average value and divide them by the corresponding standard deviation. In this way, the two are converted into scores, and then the original values are replaced by scores to calculate the Euclidean distance.

Euclidean distance is only one of the distance calculation methods.The Euclidean distance is used only to combine a statistical description of the prototype object with a certain distance function to build a similarity model. With the distance between potential users and ideal customers, you can sort the potential customers, or use the distance as the input of another calculation to obtain the expected revenue or probability.

1.2 steps for building a similarity model

To build a similarity model, you must first describe the prototype or obtain an ideal object for comparison with other objects. These descriptions must be expressed as measurements. For objects that are closer or farther away from the ideal value, the values of these variables must be significantly different.

First, we need to solve three problems.

(1) What is the difference between a "bad" record and a "good" record?

(2) What does the ideal "good" Record look like?

(3) how to measure the distance between ideal objects?

2. Table query model

A simple method to implement a data mining model is to query tables. The idea of the Table query model is that similar people make similar responses. Scoring a new observed value involves two steps.1. Specify a specific label or primary key for the observed value. The primary key corresponds to a cell in the query table. 2. All records allocated to a cell have a score, which is assigned to the cell during model training.

You can assign a primary key in multiple ways.The decision tree model applies to the rule set to distribute observed values to specific leaf nodes. The leaf node ID can be used as a primary key for querying scores. Clustering technology specifies the tag for the record. The clustering tag here can be used as the primary key for query.

Create a query table. 1. Select input variables for the query table. Precisely allocate each record in the training set to a cell in the table. The statistical data in the training set is used to characterize the characteristics of cells. These statistical data include the average value, standard deviation, and the number of training instances falling into the cell. These statistics are used when scoring a model. The score can be the average value of a numeric target, the proportion of a specific category, or the dominant category in a cell.

2.1 select a dimension

Each dimension should be a variable with an impact on the target.Ideally, input variables should not be related to each other. In fact, it is difficult to avoid non-correlation between them. The actual impact of related variables is that some cells contain only a few training instances after training, which leads to a low confidence level of the estimated value. The actual situation may be better, because the new data to be scored is sparse in those cells.

For example, in the RFM model, one dimension is the total number of purchases, and the other dimension is the cost of the entire survival period. These two variables are highly correlated, because additional purchases usually generate additional income. Few records will fall into cells with the largest number of purchases but few incomes, or with a very high revenue but few purchases.

Avoid using highly relevant variables as the dimension of the query table, because these variables will cause a large number of sparse cells. Cells that contain too few training samples produce a low confidence level target estimate.

The primary limit on the number of dimensions is the number of training records in cells. There is a trade-off between the number of dimensions and the number of training samples assigned to each dimension. With fewer dimensions, you can perform more precise division on each dimension. In actual processing, nothing in the cell may exist. Sometimes, this situation does exist. In this case, the table should contain cells with default scores, so that scores can be allocated for records that do not match any primary key. A Typical default error cell score is the average value.

2.2. Dimension Division

In the actual process, each category is divided into one dimension. Dimension segmentation should be used based on actual conditions. A dimension can be divided by height, medium, or low, while another dimension can be divided by percentage. In some cases, the splitting points are determined based on the business rules. The Division records following these specific splitting points may be more meaningful than the Division records. Supervised segmentation can be used to ensure the effectiveness of the segmentation. This will be discussed later.

2.3 From training data to score

After dimensions are divided, it is easy to calculate the score of each cell in the training set. For a numeric target, the score is equal to the average value. For category objectives, each category has a score = the proportion of each cell class label. In this way, each class has a probability estimation, that is, the probability that the data records to be scored belong to this class.

2.4 delete a dimension to process sparse and missing data

Some cells are not allocated enough data, which leads to a low confidence level of the target estimate. What should I do for such cells?1. Reduce the number of partitions in each dimension. 2. Reduce the number of dimensions that define sparse cells.

For example, build a competitive model for the price of an item list on a shopping website. Based on the familiarity with the list, the analysis of click attraction takes four dimensions into consideration:

· Products

· Region

· Supplier type

· Day of the week

It makes sense to use these four dimensions for some popular products. For undesirable products, there is not enough list to support all dimensions, so we need to discard some dimensions. For some products, it is okay to discard the day of the week. For products that have been negotiated, it is only based on three dimensions rather than comparison between the four dimensions. For some products, you can even leave only one dimension. For such products, you need to delete the dimension and merge the cells until each cell contains enough data.

3. RFM: a widely used query model

The RFM model is called "Recent", "Frequency", and "currency. The logic behind RFM is simple. Customers who place orders in the near future are likely to purchase again in the near future. In the past, many customers with purchase records were more likely to purchase again in the near future, and customers who consumed more in the past were more likely to purchase more in the future. RFM is a technology that maximizes the benefits of existing customers, rather than attracting new customers.

The three RFM variables in the large RFM Unit allocated by the customer need to be converted into three quantitative indicators. Recent: the number of days or weeks from the last purchase to obtain the R score.

The second variable frequency, usually the total number of previous orders, records the score of F. The last one is the total cost of the customer's lifetime. This value is used to create the M Score. Each dimension has a score of 5. Because dimensions are correlated, such as F and M dimensions, the number of customers in each cell is not equal. All you need to do is allocate all the data to the appropriate cells, and each cell must have enough records, so that the target estimation value has an acceptable confidence level.

3.1 RFM cell transfer

For each marketing activity, the customerTransfer between cells.Customers who respond to increase their consumption frequency and total consumption, and reduce the time from the previous Purchase.These new values are usually migrated to cells. Customers with no response may also be transferred to a new cell as the last purchase time increases. In fact, this is a regular data update and model update. Data migration will lead to the original expected changes. During the data cell migration process, you must constantly understand the customer's needs and change the data in a timely manner.

3.2. RFM and incremental response modeling

The goal of incremental response modeling is to identify potential customers who are easily persuaded-those most affected by marketing. RFM can be seen as a prediction of the customer's marketing campaign response capability. After defining the RFM cell, you need to assign members to each cell, either members of the test group who receive marketing information or members of the control group who do not accept the information. The marketing activity's ability to discover potential customers is determined based on the difference between the test group and the control group. Marketing has the greatest impact on cells with the highest response rates between the test group and the control group. However, the response rate of these cells may not be the largest.


We will introduce the naive Bayes model, linear regression, multiple regression, Logistic regression analysis and other models.



See Data Mining Technology



CopyrightBUAA


 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.