Four types of problems are mainly solved by data mining

Source: Internet
Author: User
Keywords Data mining

Four kinds of problems that data mining solves mainly

Data mining is a very clear definition of several types of problems it can solve. This is a high degree of induction, and the application of data mining is a process to deduce these types of problems. So let's look at how the four types of problems it solves are defined:

1. Classification problem

The classification problem is a predictive problem, but it is different from the general prediction problem, but the difference is that the results of the prediction are categories (such as a, B, C three) rather than a specific value (such as 55, 65, 75 ...).

For example, you and your friend are walking along the road, and there comes a person, you say to a friend: "I think this person is a shanghainese, then this question is a classification problem, and if you say to a friend:" I guess this person's age is about 30 years old, then this question belongs to in the later we have to talk about the prediction problem.

In business cases, the classification problem is the most: give you a customer information, predict whether he will be out of the net in the future? is Credit good/general/poor? Will you use one of your products? Will you be a high/medium/low value customer in the future? Will you respond to one of your promotional campaigns?......。

There is a very special classification problem, that is the "two-point" problem, it is obvious that the "two-point" problem means that the forecast classification results are only two categories: yes/no; good/bad; high/low ... This type of problem is also known as the 0/1 problem. The reason for this is that it is very special, mainly because when solving such problems, we only need to pay attention to the probability of one of the types of prediction, because the probability of two classes can be deduced from each other. If the probability of predicting x=1 is P (x=1), then x=0 probability P (x=0) =1-p (x=1). This is very important.

Maybe a lot of people are already concerned about how the data mining method predicts P (x=1), but it's not difficult. One of the major issues in solving such problems is the collection of historical data, which has clearly known the classification results of some users, such as the 10,000-user classification results, 7,000 of which belong to "1" and 3,000 belong to "0". Along with the collection of classification results, a number of characteristics of these 10,000 users (indicators, variables) were collected. Such data sets are generally called training sets in data mining, as the name suggests, the rules of classification prediction are trained through this dataset. The general idea of training is this: Analyze all the features/variables that have been collected, look for the characteristics/variables associated with the target 0/1 variables, then sum up the relationship between P (X=1) and the selected related features/variables (the expressions of the relationships concluded by different methods are not the same, such as the regression method is through the function relation, the decision tree method is through the rule set.

For details, refer to: Decision Tree, Logistic regression, discriminant analysis, neural network, inpurity, Entropy, Chi-Square, Gini, odds, odds ... and other related knowledge.

2. Clustering problem

Clustering problem is not a predictive problem, it mainly solves the problem of dividing a group of objects into several groups. The basis of division is the core of cluster problem. The so-called "birds of a feather flock together", hence the name clustering.

Clustering problem is easy to confuse with classification problem, mainly is the reason of language expression, because we often say such words: "according to Customer's consumption behavior, we divide the customer into three classes, the main characteristic of the first class is ...", in fact it is a clustering problem, but it is easy to misunderstand that it is a classification problem. There are essential differences between classification problems and clustering problems: classification is the prediction of an unknown category of users in which category (equivalent to doing a single topic), and clustering is based on selected indicators for a group of users to divide (equivalent to do open discourse), it is not a prediction problem.

Clustering problems are also very common in business cases, such as the need to select a number of indicators (such as value, cost, use of products, etc.) to the existing user base division: the characteristics of similar users clustered into a class, the characteristics of different users are classified into different classes.

Clustering methods are emerging, based on the distance between users of the length of the user to cluster classification method is still the most popular method. The general idea is this: first determine which indicators to select the user cluster; Then in the selection of indicators to calculate the distance between the user, the distance of the calculation of a lot of formulas, the most common is the line distance (the choice of indicators as a dimension, the user in each indicator has a corresponding value, can be seen as a point in multidimensional space, The distance between users can be understood as a line between the two. Finally, the clustering method gathers the users with shorter distances from each other to a class, and the distance between class and class is comparatively long.

For details, please refer to: Cluster analysis, System clustering, K clustering, Euclidean distance, Minsi distance, MA distance and other knowledge.

3. Related issues

Talking about related issues, you may want to start with "beer and diapers." Some people say that beer and diapers are a classic case of Wal-Mart, and some say it is a fictional "support" fabricated for the purpose of propagating data mining/data warehouses. In any case, "beer and diapers" give us a revelation: the world of everything is inextricably linked, we have to be good at discovering this association.

The main problem to be solved by association analysis is: After a group of users buy a lot of products, which products are more likely to buy at the same time? Buy a product at the same time buy which product is more likely? may be because the initial correlation analysis is mainly in the supermarket application is more extensive, so called "shopping basket Analysis", English abbreviation for MBA, Of course, this MBA is not the MBA, meaning harsh basket analysis.

If, in the study, all the products a user buys are assumed to be purchased at the same time, the focus of the analysis is on the relevance of the products that are purchased by all users; If you assume that the time a user buys is different, and that the analysis needs to highlight the time series, what do you buy first and then what? Then this kind of question is called the sequence problem, it is a special case of the correlation problem. In a sense, sequence problems can also be manipulated according to associated problems.

Correlation analysis has three very important concepts, that is, "three degrees": support, credibility, promotion. Suppose 10,000 people buy a product, of which the person who buys a product is 1000, the person who buys B product is 2000, AB also buys person is 800. The degree of support refers to the proportion of the total number of people who purchase the associated product (assuming a product is associated with the B product), that is 800/10000=8%, 8% of the users bought a and B two products at the same time; credibility refers to the possibility of buying another product after a product is purchased, For example, after the purchase of a product, the credibility of the purchase of B products =800/1000=80%, that is 80% of users after the purchase of a product will buy B products, the promotion is the purchase of a products under the condition of the possibility of buying B products and no such conditions to purchase B products, the likelihood of the ratio, There is no condition to purchase B product possibility =2000/10000=20%, then the promotion degree =80%/20%=4.

For details, consult: Association rules, apriror algorithms, medium-related knowledge.

4. Forecasting Problems

The forecast problem here refers to a narrowly defined prediction, which does not include the classification problem described above, since the classification problem is also predictive. Generally speaking, we talk about prediction problem mainly refers to the case that the value of predictor variable is continuous numerical type.

For example, weather forecasts predict the temperature of tomorrow, the GDP growth rate for the next year, and the Revenue and subscribers of telecom operators.

The solution to the prediction problem is to use statistical techniques, such as regression analysis and time series analysis. Regression analysis is a very classical and far-reaching statistical method, the first is by Darwin's cousin Galton in the study of biological statistics proposed by the method, its main purpose is to study the relationship between the target variable and the impact of several related variables, through the proposed and similar y=ax1+bx2+ ... Relationship to reveal the relationship between variables. Through this relationship, in a given set of X1, X2 ... Can be used to predict unknown y values.

In contrast, regression analysis for forecasting problems is far less useful in business applications than in medicine, psychology, and science. The main reason is that the latter is more inclined to the theoretical study of natural science, need to have theoretical support of empirical analysis, and in business statistics analysis, more use descriptive statistics and statements to reveal what happened in the past, or more applied classification, clustering problems.

For details, please refer to: one-dimensional linear regression analysis, multivariate linear regression analysis, least squares, and other related knowledge.

Application fields of data mining

Data mining is the beginning of the application-oriented, the first mentioned data mining to solve the four major problems, if these problems deduction to different industries, we will see the application of data mining is very wide.

In the mobile communications industry that we have often contacted, we look at the applications of data mining in the communications industry, in conjunction with the four major issues mentioned earlier.

Classification problem:

Off-grid prediction: Predicting the risk of users leaving the net for a period of time.

Credit Application Score: According to the user data to assess whether the user can credit (such as prepaid users can overdraft, after paying users can extend the billing period).

Credit behavior Score: According to the user's past consumption behavior characteristics to assess the level of credit score, easy to adjust the amount of the overdraft or pay billing period.

Target users of products (such as ring tones, WAP, value-added data services, etc.): Build the model to filter the target user groups of product marketing.

Clustering problem:

User segmentation: Select a number of indicators to group users into a number of groups, the characteristics of similar groups, the differences between the characteristics of the group are obvious. Of course, there are many methods of user segmentation, not necessarily using clustering method. The advantage of clustering is that it can deal with multidimensional variables comprehensively, and the disadvantage is that it is difficult to explain. A kind of subdivision method that is convenient to explain is to combine the business to the user group to divide artificially, the custom is called Pre-define method. The advantage of this method is easy to explain and strong application, the disadvantage is that the business requirements are relatively high, the demarcation of the boundary is difficult to set, the multidimensional variable processing is difficult.

Related issues:

Cross-selling: Recommend products and services that are not used by users, but may be of interest to them. The problem of cross-selling can be understood as a classification problem from some point of view, which is similar to the problem of locating product target users.

Prediction PROBLEM:

Compared with the application of the molding is not much, generally more for the user number prediction, income forecasting.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.