Brief introduction
In data mining with WEKA, part 1th: Introduction and regression, I introduced the concept of data mining and free open source software Waikato Environment for Knowledge Analysis (WEKA), which can be used to mine data to obtain trends and patterns. I also talked about the first method of data mining-regression-using it to predict numeric values based on a given set of input values. This analysis is very easy to do and is one of the least powerful data mining methods, but through it, the reader has a good understanding of WEKA, and it also provides a good example of how the original data is converted into meaningful information.
In this article, I'll take you through two other methods of data mining, which are slightly more complex than regression models, but more powerful. If the regression model can provide only one numeric output for a specific input, the two models allow you to do different parsing of the data. As I said in the 1th part, the core of data mining is to apply the correct model to data. Even with the best data about the customer, whatever that means, the data doesn't make any sense if the correct model is not applied to the data. Consider this from another perspective: if you only use a regression model that generates numerical output, how can Amazon tell you that "customers who have purchased X products have also purchased Y products?" There is no numerical function to tell you this type of information. So let's delve into the other two models that are available in the data.
In this article, I will repeatedly refer to the data mining method called "nearest neighbor", but I will not dissect it too much, the detailed introduction will be given in the 3rd part. However, I included this in the comparison and Description section of this article to make the discussion more complete.
Category vs. cluster vs. nearest neighbor
Before I delve into the details of each method and use it through WEKA, I think we should first understand each model-what kind of data each model fits into and what each model tries to achieve. We will also include our existing models-regression models-also included in our discussion so that you can see the comparison of the three new models with the model we have already known. I'll show you the use of each model and the different points in the actual example. These practical examples revolve around a local BMW dealership to study how it can increase sales. The dealership has kept all of its past sales information and information about every customer who has purchased a BMW, paid attention to BMW, or has been to the BMW showroom. The dealership wants to increase future sales and deploy data mining to achieve this goal.
Regression
Question: "How do we price the new BMW M5 model?" "The regression model can only give a numerical answer to the problem. The regression model uses BMW and M5 's past sales figures to determine the price that people used to buy cars in the dealership based on the properties and selling points of the cars sold. The regression model then allows the BMW dealership to insert the properties of the new car to determine its price.
For example: Selling price = $25,000 + ($2900 * liters in Engine) + ($9000 * Issedan) + ($11,000 * isconvertible) + (USD * inches o F car) + ($22,000 * IsM).
Classification
Question: "So how big is customer X likely to buy the latest BMW M5?" "Create a taxonomy tree (a decision tree) and use this data mining to determine how likely the person is to buy a new M5." The nodes on this tree can be age, income levels, the number of cars currently owned, marital status, whether there are children, homeowners or tenants. Using this person's attributes on the decision tree determines the likelihood that he will purchase M5.
Cluster
The question is: "Which age group prefers the silver BMW M5?" "This requires digging up data to compare the age of the car buyers and the colors of the cars they bought in the past." From this data, it is possible to find a higher propensity for a certain age group (e.g. 22-30 years old) with a BMW M5 ordering a color (75% to buy blue). Similarly, it can also show that another age group (e.g. 55-62) is more inclined to order a silver BMW (65 buys silver, 20 buys grey). These data, when excavated, tend to focus on certain age groups and specific colors around them, facilitating users to quickly determine the patterns within the data.
Nearest neighbor
Question: "When people buy BMW M5, what other options do they tend to buy at the same time?" "Data mining shows that when people go into a store and buy a BMW M5, they tend to buy a matching suitcase." (This is the so-called shopping basket analysis). Using this data, the car dealership will put the matching suitcase promotional ads in a conspicuous place in the store, even in the newspaper to do promotional ads, if they buy M5, the matching suitcase will be free/discount, in order to increase sales.