Data Mining Series (4) Mining Association rules using Weka

Source: Internet
Author: User
Tags numeric

Several basic concepts and two basic algorithms for association rules are described in the previous few. But actually in the commercial application, the writing algorithm is less than, understands the data, grasps the data, uses the tool to be important, the preceding basic article is to the algorithm understanding, this article will introduce the open source utilizes the data Mining tool Weka to carry on the management rule mining.

Weka Data Set Format Arff

Introduction to ARFF standard data set

The data file suffix of Weka is Arff (attribute-relation file format, that is, the attribute relation files), Arff file is divided into annotation, relation name, attribute name, data field a few parts, comment with percent sign, relation name with @relation affirm, Properties with @attribute, the data fields begin with @data, and look at this sample dataset (after installing Weka, you can find Weather.numeric.arff under the installation directory/data of Weka):

%weather DataSet
@relation weather
    
@attribute Outlook {Sunny, overcast, rainy}
@attribute temperature Numeric
@attribute Humidity numeric
@attribute Windy {TRUE, FALSE}
@attribute Play {yes, no}
    
@data
sunny,85,85,false,no
sunny,80,90,true,no
overcast,83,86,false,yes
rainy,70,96,false,yes
rainy,68,80,false,yes
rainy,65,70,true,no
overcast,64,65,true,yes
sunny,72,95,false,no
sunny,69,70,false,yes
rainy,75,80,false,yes
sunny,75,70,true,yes
overcast,72,90,true,yes
Overcast,81,75,false,yes
Rainy,71,91,true,no

When the data is numeric, add numeric to the property name and, if it is a discrete value (an enumeration value), use a curly brace to list the range. @data the next line after the data record, the data is the matrix form, that is, each of the data elements are equal, if there is a missing value, the question mark?

Arff Sparse datasets

We do mining Association rules, for example, shopping basket analysis, our shopping list data must be quite sparse, supermarkets have 10000 kinds of goods, and everyone buy things will only buy several goods, so if the matrix form to represent the data obviously wasted a lot of storage space, we need to use sparse data to express, Look at our shopping list sample (Basket.txt):

Freshmeat Dairy confectionery
freshmeat confectionery cannedveg-frozenmeal    Fish
Dairy    wine
freshmeat    wine    Fish
fruitveg
softdrink Beer Fruitveg    frozenmeal
fruitveg    fish
fruitveg    freshmeat    dairy    Cannedveg Wine    fish
fruitveg    Fish
Dairy    cannedmeat    frozenmeal    Fish

Each row of the dataset represents a shopping list after the Heavy Mining Association rules, we can first map the product name to ID number, the process of mining only ID number is, to the rules to dig out and then back to the product name is, Retail.txt is a conversion to the ID number of the retail data set, The preceding lines of the dataset are as follows:

1    2    3    4    5    6    7 8 9 (    32) (in)) The same as the other
   .    58 in all of the same.
   69 in the same--

There are 16,469 items in this dataset, the number of items in a shop is far less than the number of items, so in sparse data tables, Weka supports sparse data representations, but I have a problem using the Apriori algorithm, first look at the sparse data requirements of Weka: sparse data and other parts of standard data are the same, The only difference is the @data data record, as the following example (Basket.arff):

@relation ' basket ' @attribute Fruitveg {f, t}
@attribute freshmeat {f, t}
@attribute Dairy {f, T}
@ Attribute Cannedveg {f, t}
@attribute cannedmeat {f, t}
@attribute frozenmeal {f, t}
@attribute Beer {f, T}
   @attribute Wine {f, t}
@attribute Softdrink {f, t}
@attribute fish {f, t}
@attribute confectionery {f, t}< c12/> @data
{1 T, 2 T, ten T}
{1 T, ten T}
{3 T, 5 T, 6 T, 9 T}
{2 T, 7 T} {1 T,
7 T, 9 T}
{0 t  , 8 T}
{6 T} {
0 T, 5 t} {0
T, 9 T}
{0 T, 1 T, 2 T, 3 T, 7 T, 9
T} {0 T, 9 t} {2 T, 4 t
, 5 T, 9 T}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.