Data mining with Weka, part 3rd nearest neighbor and server-side library

Source: Internet
Author: User
Tags commerce store

Brief introduction

In the two articles before the "Data mining with WEKA" series, I introduced the concept of data mining. If you haven't read data mining with Weka, part 1th: Introduction and regression and data mining with Weka, part 2nd: Classification and clustering, read these two sections, because they cover some of the key concepts that you must understand before continuing. And, more importantly, in these two sections I've talked about three of the techniques commonly used in data mining, which can transform unintelligible useless data into meaningful rules and trends. The first technique is regression, which is used to predict a numerical output (such as a house value) based on other sample data. The second technique is the classification (i.e., the classification tree or the decision tree), which is used to create an actual branch tree to predict the output value of an unknown position. (In our case, we're predicting a reaction to the BMW promotional campaign.) The third technique I'm introducing is clustering, which you can use to create data groups (clusters) and identify trends and other rules (in our case, BMW sales). The similarity between the three is that they can transform data into useful information, but their respective implementations and the data used vary, which is the most important point of data mining: The correct model must be used for the correct data.

This article discusses the last of the four common data mining techniques: the nearest neighbor. You will see it more like a combination of classification and clustering, and provides another useful weapon for our mission to eliminate data misdirection.

In our previous article, we used WEKA as a standalone application. So how useful can it be in practice? Obviously, it's not perfect. Because WEKA is a Java based application, it has a Java library that can be used in our own server-side code. This is probably the most common usage for most people, because you can write code to constantly analyze your data and make adjustments dynamically without having to rely on others to extract the data, convert it to Weka format, and then run it in Weka Explorer.

Nearest neighbor

The nearest neighbor (also known as collaborative filtering or instance-based Learning) is a very useful data mining technique that can be used to predict the unknown output value of a new data instance with a previous data instance known to the output value. From the present description, the nearest neighbor is very similar to the regression and classification. So how is it different from the two? First, the regression can only be used for numerical output, which is the most direct difference from the nearest neighbor. Categories, as we've seen in the previous article, using each data instance to create the tree, we need to traverse the tree to find the answer. This is a serious problem for some data. For example, companies like Amazon often use the "X-bought customers also buy the Y" feature, and if Amazon is going to create a classification tree, how many branches and nodes will it need? Its products are up to hundreds of thousands of. How big will this tree be? How accurate can a tree of such magnitude be? Even a single branch, you will be surprised to find that it has only three products. Amazon's pages usually have 12 products to recommend to you. For this type of data, the classification tree is a very unsuitable data mining model.

The nearest neighbor can solve all these problems very effectively, especially in the Amazon example above. It will not be limited by quantity. Its scalability is no different from the 20-customer database to the 20 million-customer database, and you can define the number of results you want to get. It seems to be a great technique! It's really great-and probably the most useful for shopkeepers who are reading the E-commerce store in this article.

Let's explore the mathematical theories behind the nearest neighbor so that we can better understand the process and understand some of the limitations of the technology.

The mathematical theory behind the nearest neighbour

The mathematical theory behind the nearest neighbor technology is very similar to the mathematical theory involved in clustering technology. For an unknown data point, the distance between the unknown position and each given stronghold needs to be computed. It would be tedious to compute this distance with a spreadsheet, and a high-performance computer would be able to complete those calculations immediately. The easiest and most common way to compute a distance is "normalized Euclidian Distance". It looks complicated, but it's not. Let's use an example to figure out what product the 5th customer might buy.

Listing 1. The nearest adjacent mathematical theory

Customer   Age   Income   Purchased Product
1      45    46k    Book
2      39    100k   TV
3      35    38k    DVD
4      69    150k   Car Cover
5      58    51k    ???

Step 1: Determine Distance Formula
Distance = SQRT( ((58 - Age)/(69-35))^2) + ((51000 -  Income)/(150000-38000))^2 )

Step 2: Calculate the Score
Customer   Score   Purchased Product
1      .385     Book
2      .710     TV
3      .686     DVD
4      .941     Car Cover
5      0.0     ???

If you use the nearest neighbor algorithm to answer the question "The 5th customer is most likely to buy a product" above, the answer will be a book. This is because the distance between the 5th customer and the 1th customer is shorter (actually a lot shorter) than the 5th customer and any other customer. Based on this model, it can be concluded that customers with the most like 5th customers can predict the behavior of the 5th customer.

But the benefits of the recent neighbourhood are much more than that. The nearest neighbor algorithm can be extended to not only a recent match, but can include any number of recent matches. These recent matches can be described as "N-Nearest Neighbors" (for example, 3-nearest neighbors). Back to the example above, if we want to know the product that the 5th customer is most likely to buy, the conclusion is that the book and the DVD. For the Amazon example above, if you want to know the 12 products that a customer is most likely to buy, you can run a 12-nearest neighbor algorithm (but Amazon actually runs a much more complex algorithm than a simple 12-nearest neighbor algorithm).

Moreover, this algorithm is not limited to predicting which products customers purchase. It can also be used to predict the output value of a yes/no. Consider the above example if we change the last column to (from Customer 1 to Customer 4) "Yes,no,yes,no,", then the 1-nearest neighbor model predicts that the 5th customer will say yes, and if a 2-nearest neighbor algorithm is used, the predicted result is "yes" (both 1 and 3 of the customers say "Ye S "), if you use the 3-nearest neighbor model will still get" yes "(customers 1 and 3 said" Yes ", customer 2 said" No ", so their average is" yes ").

The last question we consider is "how many neighbors should we use in our model?" "Aha--not everything is so simple. In order to determine the optimal number of adjacent neighbors, tests are required. Also, if you want to predict the output of columns with values of 0 and 1, it is clear that you need to select the odd number of neighbors in order to break the tie.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.