HotSpot association rule algorithm (2) -- mining continuous and discrete data and hot spot discretization

Last Update:2015-03-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This code can be downloaded at (updated tomorrow.

The previous article hot spot association rule algorithm (1) -- mining discrete data analyzes the hot spot Association Rules of discrete data. This article analyzes the hot spot Association Rules Mining of discrete and continuous data.

1. First, take a look at the data format (txt document ):

@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute humidity numeric@attribute windy {TRUE, FALSE}@attribute play {yes, no}sunny,85,85,FALSE,nosunny,80,90,TRUE,noovercast,83,86,FALSE,yesrainy,70,96,FALSE,yesrainy,68,80,FALSE,yesrainy,65,70,TRUE,noovercast,64,65,TRUE,yessunny,72,95,FALSE,nosunny,69,70,FALSE,yesrainy,75,80,FALSE,yessunny,75,70,TRUE,yesovercast,72,90,TRUE,yesovercast,81,75,FALSE,yesrainy,71,91,TRUE,no

For this data, see weather. arff, and the data format. For example, writing @ attribute is based on weka's data format. The data format used in the following code is described above. The format is described as follows: 1) The first m line starts with @ attribute, and the code m attributes, the last of which is the target attribute; 2) if the attribute is numeric, spaces are followed by attribute names, spaces are followed by numeric. If the attribute is discrete, spaces are followed by attribute names, enclose discrete values with spaces and separate discrete values with commas. 3) the target attribute must be discrete (the target attribute must be discrete, in fact, this is just what I said in the Code. The general HotSpot algorithm does not have this requirement. If the target attribute must be continuous, you can modify it based on the lz code ).

2. Data Reading

Data Reading in "Hot Spot association rule algorithm (1)" is for discrete data, so it needs to be modified. After modification, the data is only encoded for discrete data, and the continuous data can be kept, you also need to set a Boolean array to indicate whether the attribute columns are discrete or continuous. The read code is as follows:

While (tempString = reader. readLine ())! = Null) {// The first line of data is the title if (tempString. indexOf (HSUtils. FILEFORMAT) = 0) {String attr = ""; String [] attrStates = null; if (tempString. contains ("{") {attr = tempString. substring (HSUtils. FILEFORMAT. length (), tempString. indexOf ("{")). trim (); attrStates = tempString. substring (tempString. indexOf ("{") + 1, tempString. indexOf ("}")). split (","); for (int I = 0; I <attrStates. length; I ++) {attrStates [I] = attrState S [I]. trim ();} numericList. add (false); this. attributeStates. put (attr, attrStates); // Add here} else {// numericif (tempString. contains ("numeric") {attr = tempString. substring (HSUtils. FILEFORMAT. length (), tempString. indexOf ("numeric ")). trim (); numericList. add (true);} else {// error Data Format error throw new Exception ("Data Format error, please check! ") ;}} AttrList. add (attr); line ++; continue;} if (flag) {this. attributes = new String [line]; this. isNumeric = new Boolean [line]; attrList. toArray (this. attributes); // copy the value to the numericList in the array. toArray (this. isNumeric); flag = false;} String [] tempStrings = tempString. split (splitter); lists. add (strArr2IntArr (tempStrings ));}

Only the code in the while loop is pasted here. The code here initializes the variable for the previously described data format rules (in fact, here we use List to store the converted data, generally, Arrays can be used for storage. Convert the List data to an array, which can be faster in subsequent operations. If you want to optimize the data, you can start with this ).

3. node definition of the HotSpot association rule tree:

Because continuous attribute data is added here, a Boolean variable lessThan must be added for a single node to indicate that the data is greater than or less than the node data, at the same time, stateIndex should be a value (the value of the current node), rather than a subscript of the discrete data status.

4. algorithm pseudocode (Build Process)

When Calculating Potential nodes in pseudo-code of the algorithm, different methods are used for continuous variables. In weka source code, evaluateNumeric is used for determination. In the lz code, this part fully references the code in the source code, but one thing is that after the evaluateNumeric algorithm is called, a column is sorted, that is, a two-dimensional array is globally ordered by a column. In weka source code, This method uses the Instances quickSort Method for sorting (recursion is used, but you have not taken a closer look ). Here, lz directly converts the List into a two-dimensional array and then sorts it. The method is as follows:

/*** Sort by attrIndex. attrIndex must be numeric. This method may need to be optimized. * does List use arrays faster? Consider using an array * @ param intData * @ param attrIndex * @ return */private List <float []> sortBasedOnAttr (List <float []> intData, final int attrIndex) {float [] [] tmpData = new float [intData. size ()] []; intData. toArray (tmpData); Arrays. sort (tmpData, new Comparator <float []> () {@ Overridepublic int compare (float [] o1, float [] o2) {if (o1 [attrIndex] = o2 [attrIndex]) {return 0;} return o1 [attrIndex]> o2 [attrIndex]? 1:-1 ;}}); List <float []> returnList = new ArrayList <float []> (); for (int I = 0; I <tmpData. length; I ++) {returnList. add (tmpData [I]);} return returnList ;}

At the same time, when a child node is recursively constructed and node rules are generated, the numeric and discrete generation methods are also different, as shown below:

double[] newSplitVals = splitVals.clone();byte[] newTests = tests.clone();newSplitVals[attrStateSup.getAttrIndex()] = attrStateSup.getStateIndex() + 1;newTests[attrStateSup.getAttrIndex()] = isNumeric[attrStateSup.getAttrIndex()]?attrStateSup.isLessThan()?(byte)1:(byte)3:(byte) 2;HotSpotHashKey key = new HotSpotHashKey(newSplitVals, newTests);

When building a child node recursively, you also need to adjust the subdataset generation method as follows:

/*** Obtain all the data of the same attribute and stateIndex as splitAttributeIndex ** @ param intData * @ param splitAttributeIndex * @ param splitValue * @ return */private List <float []> getSubData (List <float []> intData, int splitAttributeIndex, float splitValue, boolean lessThan) {List <float []> subData = new ArrayList <float []> (); for (float [] d: intData) {if (isNumeric [splitAttributeIndex]) {if (lessThan) {if (d [splitAttributeIndex] <= splitValue) {subData. add (d) ;}} else {if (d [splitAttributeIndex]> splitValue) {subData. add (d) ;}} else {if (d [splitAttributeIndex] = splitValue) {subData. add (d) ;}}return subData ;}

ToString method of the node, used to print the HotSpot association rule tree

/*** Format output */public String toString () {String tmp = HSUtils. isNumeric (splitAttrIndex )? This. lessThan? "<=": ">": "="; String attrState = HSUtils. isNumeric (splitAttrIndex )? String. valueOf (this. attrStateIndex): HSUtils. getAttrState (splitAttrIndex, (int) attrStateIndex); return HSUtils. getAttr (this. splitAttrIndex) + tmp + attrState + "(" + HSUtils. formatPercent (this. support) + "[" + this. stateCount + "/" + this. allCount + "])";}

When printing the association rule tree, you also need to determine whether the current attribute is discrete or continuous.

Code output:

File Reading is complete, and various statuses of attributes and attributes are initialized! Attribute outlook status: [sunny --> 0, overcast --> 1, rainy --> 2,] attribute temperature status: [numeric] attribute humidity status: [numeric] attribute windy status: [TRUE --> 0, FALSE --> 1,] attribute play status: [yes --> 0, no --> 1,] the rule tree is as follows: play = no (35.71% [5/14]) | temperature> 83.0 (100.00% [1/1]) | humidity> 90.0 (66.67% [2/3]) | temperature> 70.0 (100.00% [2/2]) | humidity <= 95.0 (100.00% [2/2])

Share, grow, and be happy

Down-to-earth, dedicated

Reprinted please indicate blog address: http://blog.csdn.net/fansy1990

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

HotSpot association rule algorithm (2) -- mining continuous and discrete data and hot spot discretization

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

HotSpot association rule algorithm (2) -- mining continuous and discrete data and hot spot discretization

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support