On data mining--four types of problems in data mining

Last Update:2015-01-22 Source: Internet

Author: User

Tags flock

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Business Intelligence product Data mining focuses on solving four types of problems: classification, clustering, correlation, prediction (which will be explained in detail after the four types of questions), while conventional data analysis focuses on solving other data analysis problems, such as descriptive statistics, cross-reporting, hypothesis testing, etc. Data mining is a very clear definition of the kinds of problems it can solve. This is a high degree of induction, the application of data mining is to take these kinds of problems to deduce a process. Let's take a look at how the four types of problems it solves are defined:

1, classification problems

The classification problem is a predictive problem, but it differs from the common prediction problem in that the result of the prediction is a category (such as A, B, C three) rather than a specific value (such as 55, 65, 75 ...). ）。

For example, you and friends on the road, walking towards a person, you say to a friend: "I guess this person is a Shanghai people, then this question belongs to the classification problem; If you say to a friend:" I guess this person is about 30 years old, then this question belongs to the prediction problem that comes up later. "

In business cases, the classification problem is the most: give you a customer's information, predict whether he will leave the net for some time? Credit is good/general/bad? Will you use one of your products? Will you be a high/medium/low value customer in the future? Will it respond to one of your promotions? ......。

There is a very special classification problem, that is the "dichotomy" problem, it is obvious that the "two points" problem means that the predicted classification results only two classes: yes/no; good/bad; high/low ... This type of problem is also called the 0/1 issue. It is very special, mainly because when solving such problems, we only need to pay attention to the probability that the prediction belongs to one of them, because the probability of two classes can be deduced from each other. If the probability of predicting x=1 is P (x=1), then the probability of x=0 p (x=0) =1-p (x=1). This is very important.

Many people may already be concerned about how the data mining method predicts P (x=1), which is not difficult. One of the major prerequisites for solving such problems is that through the collection of historical data, the classification results of certain users have been clearly known, such as the classification results of 10,000 users, of which 7,000 are classified as "1" and 3,000 belonging to the category "0". Several features (indicators, variables) of these 10,000 users were collected along with the collection of the results of the classification. Such datasets are commonly called training sets in data mining, as the name implies, and the rules for categorical predictions are trained through this data set. The general train of thought is this: to analyze all the features/variables that have been collected, to find the characteristics/variables associated with the target 0/1 variables, and then to summarize the relationship between P (X=1) and the selected related features/variables (the expressions of the relationships that are induced by different methods are not the same, The method of regression is through the function relation, the decision tree method is through the rule set).

For details, please refer to: Decision Tree, Logistic regression, discriminant analysis, neural network, inpurity, Entropy, Chi-Square, Gini, Odds, Odds Ratio ... and other relevant knowledge.

Mobile communication Industry common applications:

Off-grid prediction: Predict the risk of users leaving the net over a period of time.

Credit Application score: According to user data to assess whether the user can be trusted (such as prepaid users can overdraft, after the payment of users can extend the billing period).

Credit behavior Score: According to the user's past consumption behavior characteristics to assess the credit score, easy to adjust the amount of the overdraft or pay the bill period.

Targeting products (such as ringtones, WAP, value-added data services, etc.) Target users: Build models to filter the target user groups for product marketing.

2, clustering problems

Clustering problem is not a predictive problem, it mainly solves the problem of dividing a group of objects into several groups. The basis of division is the core of cluster problem. The so-called "birds flock together, flock together", so named Cluster.

Clustering problems are easily confused with classification problems, mainly because of the language expression, because we often say: "According to the customer's consumption behavior, we divide the customer into three classes, the main feature of the first class is ...", actually this is a clustering problem, but in the expression is easy to misunderstand that it is a classification problem. The classification problem is essentially different from the clustering problem: the classification problem is to predict which category the user belongs to in the unknown category (equivalent to doing a single choice), and the clustering problem is to divide a group of users according to the selected indicators (equivalent to doing open discourse), which is not a predictive problem.

Clustering issues are also very common in business cases, such as the need to select several indicators (such as value, cost, products used, etc.) to divide existing user groups: Users with similar characteristics are clustered into one category, and users of different characteristics belong to different classes.

Clustering methods are endless, based on the length of the distance between users to cluster the user method is still the most popular method. The general idea is this: first determine which indicators are selected to cluster users, and then on the selected indicators to calculate the distance between users, the distance calculation formula is many, the most common is the straight line distance (the selection of indicators as a dimension, the user in each indicator has a corresponding value, can be regarded as a point in the multidimensional space, The distance between the users can be understood as the straight distance between the two. The last clustering method is to gather a group of users with shorter distances from each other, and the distance between classes is comparatively long.

For details, please refer to: Cluster analysis, System clustering, K-means clustering, Euclidean distance, he distance, Markov distance and other knowledge.

Common applications in the communications industry:

User segmentation: Select a number of indicators to cluster users into several groups, the group has similar characteristics, the differences between the characteristics of the group is obvious. Of course, there are many methods of user segmentation, not all of which are clustering methods. The advantage of clustering is that it can deal with multidimensional variables comprehensively, and the disadvantage is that it is not easy to explain. An easy-to-interpret subdivision method is a method which is used to divide the user group into a business and is customarily called pre-define. The advantages of this method are easy to explain and strong application, the disadvantage is that the business requirements are relatively high, the demarcation of the boundary is more difficult to deal with the multidimensional variable processing is difficult.

3. Related issues

Speaking of related issues, you may have to start with "beer and diapers". Some say that beer and diapers are a classic example of Wal-Mart supermarkets, and others say that it is a fictional "support" invented to promote data mining/data warehousing. Anyway, "beer and diapers" gave us a revelation: everything in the world is inextricably linked, and we have to be good at discovering that connection.

The main problem to be solved in association analysis is: When a group of users buy a lot of products, which products are more likely to be purchased at the same time? What are the odds of buying a product at the same time? May be because the initial correlation analysis is mainly in the supermarket application is more extensive, so also called "shopping basket Analysis", English abbreviation for MBA, of course, this MBA non-MBA, meaning the Market basket analyses.

If, in the case of a study, all products purchased by one user are assumed to be purchased at the same time, the analysis focuses on the correlation between the products purchased by all users, and if the time of the product purchased by a user is assumed to be different, and when the analysis needs to highlight the correlation of the time, such as what to buy first, and then buy what So this kind of problem is called the sequence problem, it is a special case of the correlation problem. In a sense, sequence problems can also be manipulated by associated problems.

Correlation analysis has three very important concepts, that is, "three degrees": the degree of support, credibility, the degree of Ascension. Suppose 10,000 people buy a product, which buys a product of 1000, the person who buys the B product is 2000, AB buys the person is 800. Support degree refers to the associated product (assuming a product and B product Association) at the same time the proportion of the total number of people purchased, that is, 800/10000=8%, 8% of users also purchased A and b two products; credibility refers to the possibility of buying another product after purchasing a product, For example, the purchase of a product after purchasing the credibility of the B product =800/1000=80%, that is, 80% of users buy a product after the purchase of B products, the degree of promotion is the purchase of a products under the condition of the possibility of purchasing B products and not under this condition to purchase B products, Without any conditions to purchase B product possibility =2000/10000=20%, then the lift degree =80%/20%=4.

For details, please refer to: Association rules, APRIROR algorithm medium related knowledge.

Common applications in the communications industry:

Cross-sell: for products and businesses that users have already used, recommend to them that they are not using but may be interested in the product. The problem of cross-selling can be understood as a classification problem from a certain point of view, which is similar to the problem of locating product target users.

4. Forecast problems

The prediction problem referred to here refers to the narrow-sense projections and does not include the classification problems described earlier, since classification is also a prediction. In general, we talk about the prediction problem mainly refers to the prediction variable value is a continuous numerical type case.

For example, weather forecasts predict tomorrow's temperature, the country's GDP growth rate for the next year, and telecom operators ' forecasts for the next annual revenue, number of users, etc.

The problem of prediction is solved more by using statistical techniques such as regression analysis and time series analysis. Regression analysis is a very classical and far-reaching statistical method, which was first proposed by Darwin's cousin Galton in the study of biological statistics, its main purpose is to study the relationship between the target variable and some related variables affecting it, through quasi-and similar y=ax1+bx2+ ... To reveal the relationship between variables. Through this relationship, given a set of X1, X2 ... Can predict the unknown Y value after the value is evaluated.

In contrast, the regression analysis used in forecasting problems is far less used in business than in medicine, psychology, and natural science. The main reason is that the latter is more in favor of the theoretical research of natural science, need to have theoretical support of empirical analysis, and in business statistics analysis, more use of descriptive statistics and reports to reveal what happened in the past, or the application of more powerful classification, clustering problems.

For details, see: Unary linear regression analysis, multivariate linear regression analysis, least squares method and other related knowledge.

Common applications in the communications industry:

Compared with the application of molding is not many, generally more for the number of users forecast, revenue forecast and so on.

On data mining--four types of problems in data mining

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More