Data mining with Weka, part 1th introduction and regression

Source: Internet
Author: User

Brief introduction

What is data mining? You will ask yourself this question from time to again, because this topic is getting more and more attention from the technical circles. You may have heard that companies like Google and Yahoo! are generating billions of of data points about all their users, and you wonder, "What do they want all this information for?" "You may also be surprised to find that Walmart is one of the most advanced companies to conduct data mining and apply the results to the business." Today, almost all companies in the world are using data mining, and companies that are not currently using data mining will find themselves at a great disadvantage in the near future.

So how can you and your company keep up with the tide of data mining?

We would like to be able to answer all of your primary questions about data mining. We also want to introduce you to a free Open-source software Waikato Environment for Knowledge Analysis (WEKA), which you can use to mine data and transform your knowledge of your users, customers, and businesses into useful information to increase revenue. You'll find that the task of digging through your data is not as difficult as you might think.

In addition, this article introduces the first technique of data mining: regression, which means predicting the value of future data based on existing data. It's probably the easiest way to mine data, and you've even done this kind of initial data mining with one of your favorite popular spreadsheet software (although WEKA can do more complex calculations). Subsequent articles in this series will cover other methods of mining data, including clustering, nearest neighbors, and classification trees. (It doesn't matter if you don't quite know what these terms mean.) We will introduce them in this series. )

What is data mining?

Data mining, at its core, refers to the transformation of large amounts of data into meaningful patterns and rules. And, it can be divided into two types: direct and indirect. In direct data mining, you try to predict a particular data point-for example, to predict the price of another house in a neighboring region at the price of a given house.

In indirect data mining, you try to create a data group or find patterns within existing data-for example, to create a "middle-class woman" crowd. In fact, every time the American demographics are in data mining, the government wants to collect data from each national and turn it into useful information.

Modern data mining began in the the 1990s, when computing power and the cost of computing and storage reached a high level, companies began to be able to calculate and store themselves without resorting to external computing help.

In addition, terminology data mining is comprehensive and can refer to many techniques and processes for viewing and transforming data. Because this series touches only the fur of features that can be implemented with data mining. Data mining experts are often PhD in data statistics and have over 10-30 years of research experience in this field. This will leave you with the impression that only large companies can afford data mining.

We want to be able to clear up these misconceptions about data mining and want to make it clear that data mining is neither as simple as running a spreadsheet function on a series of data, nor as difficult as anyone might think. This is a good example of the 80/20 paradigm-even further it can be a 90/10 paradigm. You can create a data mining model with 90% effectiveness with the expertise of 10% of so-called data mining experts. And to make up for the remaining 10% of the model and create a perfect model will take 90% extra time, even up to 20 years. So unless you aspire to data mining as a career direction, "good enough" is OK. On the other side, using data mining to achieve "good enough" is always better than the other technologies you use now.

The ultimate goal of data mining is to create a model that can improve the way you interpret existing and future data. Since there are many data mining technologies now, the main step in creating a good model is to decide which technology to use. This, in turn, relies heavily on practice and experience as well as effective guidance. After that, the model needs to be optimized to make it more satisfying. After reading this series of articles, you should be able to correctly determine the technology you want to use based on your dataset, and then take the necessary steps to optimize it. You will be able to create a model that is good enough for your own data.

Weka

Data mining is not proprietary to large companies, nor is it expensive software. In fact, there is a software that can achieve all the functions that expensive software can achieve-the software is WEKA. WEKA was born in University of Waikato (New Zealand) and was first implemented in its modern format in 1997. It uses the GNU general public License (GPL). The software is written in the Java™ language and contains a GUI to interact with data files and produce visual results (such as tables and curves). It also has a generic API, so you can embed WEKA into your own application as you would embed other libraries to accomplish tasks such as server-side automated data mining.

Let's continue and install WEKA. Because it is based on Java, if you do not have a JRE installed on your computer, download a version of WEKA that contains the JRE.

Figure 1. The Start screen of WEKA

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.