Several basic concepts and two basic algorithms for association rules are described in the previous few. But actually in the commercial application, the writing algorithm is less than, understands the data, grasps the data, uses the tool to be important, the preceding basic article is to the algorithm understanding, this article will introduce the open source utilizes the data Mining tool Weka to carry on the management rule mining.
Weka Data Set Format Arff
Introduction to ARFF standard data set
The data file suffix of Weka is Arff (attribute-relation file format, that is, the attribute relation files), Arff file is divided into annotation, relation name, attribute name, data field a few parts, comment with percent sign, relation name with @relation affirm, Properties with @attribute, the data fields begin with @data, and look at this sample dataset (after installing Weka, you can find Weather.numeric.arff under the installation directory/data of Weka):
%weather DataSet
@relation weather
@attribute Outlook {Sunny, overcast, rainy}
@attribute temperature Numeric
@attribute Humidity numeric
@attribute Windy {TRUE, FALSE}
@attribute Play {yes, no}
@data
sunny,85,85,false,no
sunny,80,90,true,no
overcast,83,86,false,yes
rainy,70,96,false,yes
rainy,68,80,false,yes
rainy,65,70,true,no
overcast,64,65,true,yes
sunny,72,95,false,no
sunny,69,70,false,yes
rainy,75,80,false,yes
sunny,75,70,true,yes
overcast,72,90,true,yes
Overcast,81,75,false,yes
Rainy,71,91,true,no
When the data is numeric, add numeric to the property name and, if it is a discrete value (an enumeration value), use a curly brace to list the range. @data the next line after the data record, the data is the matrix form, that is, each of the data elements are equal, if there is a missing value, the question mark?
Arff Sparse datasets
We do mining Association rules, for example, shopping basket analysis, our shopping list data must be quite sparse, supermarkets have 10000 kinds of goods, and everyone buy things will only buy several goods, so if the matrix form to represent the data obviously wasted a lot of storage space, we need to use sparse data to express, Look at our shopping list sample (Basket.txt):
Freshmeat Dairy confectionery
freshmeat confectionery cannedveg-frozenmeal Fish
Dairy wine
freshmeat wine Fish
fruitveg
softdrink Beer Fruitveg frozenmeal
fruitveg fish
fruitveg freshmeat dairy Cannedveg Wine fish
fruitveg Fish
Dairy cannedmeat frozenmeal Fish
Each row of the dataset represents a shopping list after the Heavy Mining Association rules, we can first map the product name to ID number, the process of mining only ID number is, to the rules to dig out and then back to the product name is, Retail.txt is a conversion to the ID number of the retail data set, The preceding lines of the dataset are as follows:
1 2 3 4 5 6 7 8 9 ( 32) (in)) The same as the other
. 58 in all of the same.
69 in the same--
There are 16,469 items in this dataset, the number of items in a shop is far less than the number of items, so in sparse data tables, Weka supports sparse data representations, but I have a problem using the Apriori algorithm, first look at the sparse data requirements of Weka: sparse data and other parts of standard data are the same, The only difference is the @data data record, as the following example (Basket.arff):
@relation ' basket ' @attribute Fruitveg {f, t}
@attribute freshmeat {f, t}
@attribute Dairy {f, T}
@ Attribute Cannedveg {f, t}
@attribute cannedmeat {f, t}
@attribute frozenmeal {f, t}
@attribute Beer {f, T}
@attribute Wine {f, t}
@attribute Softdrink {f, t}
@attribute fish {f, t}
@attribute confectionery {f, t}< c12/> @data
{1 T, 2 T, ten T}
{1 T, ten T}
{3 T, 5 T, 6 T, 9 T}
{2 T, 7 T} {1 T,
7 T, 9 T}
{0 t , 8 T}
{6 T} {
0 T, 5 t} {0
T, 9 T}
{0 T, 1 T, 2 T, 3 T, 7 T, 9
T} {0 T, 9 t} {2 T, 4 t
, 5 T, 9 T}