Data mining is not as mysterious as it is imagined!

Source: Internet
Author: User

A large part of the success of a data mining project depends on the close collaboration between the IT department and the business, as data mining is tightly coupled with the business, requires both data and professional business experience and understanding, and the data is generally in the hands of the IT department, and the business people are of course most aware of their business, so This communication and coordination with each other often encounter some problems. Before a lot of blood lessons, IT staff finally analyzed the model, the Business Unit Trust degree is not high, the analysis results are not in place, not good feedback to the IT staff to do further optimization, resulting in the model can not be really applied to the business, nor can further optimize the model, resulting in time, cost of waste.

In order to enable business people to believe that analysis results can bring value to the business, they need to be more involved in the project, so that the business people understand, participate in, or even do their own analysis of the process is the current enterprise to do such projects trend. Only in this way can the results of data mining analysis be better applied to the business.

We have always stressed that the greatest advantage of IBM SPSS Modeler is ease of use, it provides a graphical interface, so that users can easily drag and drop the data analysis process, so that we have more time and focus on business understanding, rather than programming debugging. This may sound like a bit of a flicker of feeling, we below through an example, to experience, of course, there will be a difference, we use R and SPSS respectively to achieve a cluster analysis, to see the pros and cons of the implementation process.

Analysis Scenarios

To do cluster analysis for telecom customers, to understand all kinds of customer characteristics.

Existing data include: Customer ID, Rateplan, longdistance (long-distance call), International (international Call), local (domestic call), drop (number of lost lines), Paymethod (payment method), Localbilltype (local call type) and Longdistancebilltype (long-distance phone type).

Analysis Steps

Read the data source, select the indicator to be the input factor, use the clustering algorithm to model, and export the analysis results.

Implementing the analysis process with R

If implemented with R, we need the following code:

The code does not look complicated, and the power of the R language is that its language is concise, and the result of the analysis is that it shows the clustering results:

Combining the results of this cluster with the original data, you can see the category each customer belongs to.

If you are a business person, can you see the merits of this result? Or can you see what the business results are? Think about it, what is the process of using SPSS?

The analysis process is realized through SPSS modeler

According to the previous steps, we drag and drop the related function nodes on the SPSS Modeler interface, and the data analysis flows connected together are

Each feature node can be annotated to explain the functionality implemented, allowing business people to glance at the following:

Next we look at the results of SPSS Modeler, double click on the resulting model, you can see the following results:

First, the left of the model summary results, through the cluster quality values to tell us the model clustering results, the more close to 1 is the better, the business staff is very good understanding, of course, if you want to trace back to the essence of this cluster quality value is what kind of statistical indicators, then we can use it to bring the help document also know, This value is actually a silhouette measurement: the average of all records measured, (B? A)/MAX (A, a), where a is the distance to record the center of its cluster, and B is the distance from the nearest cluster center to which it is not affiliated. As a business person, it is not necessary to delve into this statistical index, only to compare the merits and demerits of various clustering results by this numerical value.

To the right of the pie chart visually see the various types of proportion, we can judge by the various types of distribution is uniform, if uneven, from the business will not be good to implement management.

In addition, we need to understand what factors affect my clustering results, what factors are important, and which are unimportant, and we can learn by predicting the importance of variables such as:

In addition, we can see the following results by selecting "Cluster" and "cluster comparison" in the View selection box below:

On the left side of the graph, we can see that each class, the average of each indicator, we think of the clustering results just implemented by R, there are all kinds of averages, this is theoretically the same, but why does the result look different? We'll explain this later. Let's look at the diagram above and how we can understand the results of the analysis from the business.

From this, we can clearly see the average of each category of indicators, you can first from the numerical analysis of the various types of indicators compared to other categories, for example, cluster 5, the average number of dropped lines is 3.52, much higher than the other categories, this from the following box diagram can be clearly seen (black box identified).

With these two graphs, we can quickly find out each category is different from other categories of characteristics, and from a business perspective to describe the characteristics of this group of people, such as the 5th category, we can probably conclude that: the number of lost the most, international long-distance almost no, local and long-distance phone calls and so on. SPSS Modeler Such analysis to show the results, our business staff can be very good combination of their business experience to describe the group, and then according to the characteristics of each class, we then according to our business objectives, the development of marketing strategy or management strategy, this is the problem that the business people are good at.

The final result of the analysis, which category each customer belongs to, we can see directly through the form:

Analysis Summary

From the above example, we make the following summary:

1. From the user's point of view, the R language certainly has its strong place, but for those who do not understand the R language programming or is not very skilled personnel, there will be some obstacles, each function actually includes the relevant parameters, to understand these parameters must consult the Help document, all English version, if the English is not very good, is another obstacle. If it does not take a certain amount of time, it is not easy to master the R language skillfully, and the maintenance and optimization of the final model also takes time and labor costs.

2. From the analysis process and results, we can see two analysis results, although all using the K-means algorithm, are divided into 5 categories, but the results are very different, why? There are two main reasons:

(1) K-means algorithm logic, is to use the distance to do the calculation, so the general requirements before analysis, need to do normalization of indicators, if 0-1 of the indicators and 10000-100000 of the indicators at the same time to calculate the distance, the results will definitely focus on the 10000-100000 indicator, And the above with R calculation, I did not do normalization, then if the use of personnel to this logic is not familiar with, then the result will certainly have a problem. And the K-means algorithm in the SPSS modeler, it has already considered this problem, in this algorithm, already covered the data normalization this processing, therefore even if does not quite understand k-means computation logic, uses this algorithm, also does not have the big problem.

(2) The K-means algorithm itself is random when it chooses the initial center point, so it may cause inconsistent results.

3.SPSS Modeler Encapsulation algorithm, in order to let people who do not know the algorithm can be used, its algorithm will be embedded in some data processing functions, or K-means as an example, itself K-means only support numeric data, if the use of R platform, if the data has blank values, need to be processed first, Otherwise it will be an error. But in the SPSS modeler, if the data has the classification data, and some data has the blank value, it can still calculate the analysis result, also because it has already done the data processing in advance, the conversion of the type to the numerical type, has the blank value to fill. That's why it's loved by business people or people who don't have a lot of statistical background.

4.R language also has its own very large advantage, because it is open source, its algorithm is very broad, especially some innovative algorithms, sometimes users also very much want to try to use. IBM also saw this, from the SPSS Modeler 16 version, has encapsulated the R node, you can directly in the SPSS Modeler r node, write r code, introduce a new algorithm, even can design their own panels, custom package R algorithm, the next time you use, do not need to modify the code, You can use it directly. If you are interested in this, we can do a specific introduction next.

The visualization and ease-of-use of SPSS Modeler is not only reflected in its graphical interface, but also reflected in the comprehensiveness of the algorithm package, the readability of the analysis results, even if you do not know the statistical analysis, can use it to achieve business analysis, bring business value.

SPSS Modeler Trial Version: http://bigdata.evget.com/product/168.html

Data mining is not as mysterious as it is imagined!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.