Hotspot Association rule Algorithm (2)--mining continuous and discrete data

Source: Internet
Author: User

This code can be downloaded (updated tomorrow).

In the previous article, the Hotspot Association rule Algorithm (1)-mining discrete data analyzes the hotspot Association rules of discrete data, and this paper analyzes the mining of the Hotspot Association rules of discrete and continuous data.

1. First look at the data format (TXT document):

@attribute Outlook {Sunny, overcast, rainy} @attribute temperature Numeric@attribute humidity Numeric@attribute Windy { TRUE, FALSE} @attribute play {yes, no}sunny,85,85,false,nosunny,80,90,true,noovercast,83,86,false,yesrainy,70,96, False,yesrainy,68,80,false,yesrainy,65,70,true,noovercast,64,65,true,yessunny,72,95,false,nosunny,69,70,false, Yesrainy,75,80,false,yessunny,75,70,true,yesovercast,72,90,true,yesovercast,81,75,false,yesrainy,71,91,true,no
This data reference Weka data Weather.arff, and the data format, such as write @attribute, etc. are referenced weka data format. The data format used in the code below is described as follows: 1) The first M-line starts with @attribute, the code m attributes, the last one is the target attribute, and 2) if the property is numeric, then the space followed by the attribute property name, then the space followed by the numeric If it is discrete, then attribute the space followed by the property name, and then spaces use curly braces to enclose the discrete values, discrete values separated by commas; 3) The target attribute must be discrete (the requirement that the target attribute should be discrete, in fact, it's just that in my Code, The general hotspot algorithm does not have this requirement. If the target attribute must be continuous type, it can be modified on the basis of LZ code.

2. Data read

The data read in the Hotspot Association Rules algorithm (1) is for discrete data, so it needs to be modified to encode only discrete data, continuous data retention, and a Boolean array to indicate whether the attribute column is discrete or continuous. Its reading code is as follows:

while ((tempstring = Reader.readline ()) = null) {//The first row of data is the header if (Tempstring.indexof (hsutils.fileformat) = = 0) {String attr = ""; string[] attrstates = null;if (Tempstring.contains ("{")) {attr = Tempstring.substring (HSUtils.FILEFORMAT.length (), Tempstring.indexof ("{")). Trim (); attrstates = tempstring.substring (Tempstring.indexof ("{") + 1,tempstring.indexof (" } "). Split (", "); for (int i = 0; i < attrstates.length; i++) {Attrstates[i] = Attrstates[i].trim ();} Numericlist.add (false); This.attributeStates.put (attr, attrstates);//Add it here} else {//NUMERICIF ( Tempstring.contains ("numeric")) {attr = Tempstring.substring (HSUtils.FILEFORMAT.length (), Tempstring.indexof (" Numeric ")). Trim (); Numericlist.add (true);} else {//Error data format errors throw new Exception ("Data format error, please check!") ");}} Attrlist.add (attr); line++;continue;} if (flag) {this.attributes = new string[line];this.isnumeric = new Boolean[line];attrlist.toarray (this.attributes);// Copy the value into the array numericlist.toarray (this.isnumeric); flag = false;} string[] Tempstrings = tEmpstring.split (splitter); Lists.add (Strarr2intarr (tempstrings));} 
Here, only the code inside the while loop is pasted, where the code is initialized for the data format rule described above (in fact, using the list to store the converted data, it is generally possible to use the array to store, the list of data into an array, so that in the subsequent operations can be faster, If you want to optimize, you can start with this.)

3. Description of the node definition for the Hotspot Association rule tree:

Since the continuous attribute data is added here, a Boolean variable LessThan is added for a single node to indicate that it is greater than or less than the node data, and that the stateindex should be a numeric value (the current node's values) instead of the subscript for the discrete data state.

4. Algorithm pseudo-code (build process)

When calculating potential nodes in the pseudo-code of the algorithm, we use different methods for the continuous variables, and use the method in the source code of Weka: Evaluatenumeric to judge. In the LZ code this part is fully referenced in the source code, but one thing is that after calling the Evaluatenumeric algorithm, will be sorted for a column, that is, a two-dimensional array of a column for global sorting. This method is sorted using the Quicksort method of instances in the Weka source code (using recursion, not looking carefully). Here the LZ is directly to the list into a two-dimensional array and then sorted, the method is as follows:

/** * Sorted according to Attrindex, Attrindex must be numeric this method may need to be optimized * is the List using an array faster? You can consider using an array * @param intdata * @param attrindex * @return */private list<float[]> sortbasedonattr (list<float[]> i Ntdata, Final int attrindex) {float[][] tmpdata = new Float[intdata.size ()][];intdata.toarray (tmpdata); Arrays.sort (tmpdata,new comparator<float[]> () {@Overridepublic int compare (float[] O1, float[] O2) {if (o1[ Attrindex]==o2[attrindex]) {return 0;} return o1[attrindex]>o2[attrindex]?1:-1;}}); list<float[]> returnlist = new arraylist<float[]> (); for (int i = 0; i < tmpdata.length; i++) {Returnlist.ad D (Tmpdata[i]);} return returnlist;}
At the same time, the generation of node rules is different for both numerical and discrete types when constructing child nodes recursively, as follows:

double[] newsplitvals = Splitvals.clone (); byte[] newtests = Tests.clone (); Newsplitvals[attrstatesup.getattrindex ()] = Attrstatesup.getstateindex () + 1;newtests[attrstatesup.getattrindex ()] = Isnumeric[attrstatesup.getattrindex ()]? Attrstatesup.islessthan ()? (byte) 1: (Byte) 3: (byte) 2; Hotspothashkey key = new Hotspothashkey (newsplitvals, newtests);
When a child node is built recursively, the way in which the child datasets are generated also needs to be adjusted, as follows:

/** * Gets and Splitattributeindex the properties of the same subscript and all data for Stateindex *  * @param intdata * @param splitattributeindex * @param splitva Lue * @return */private list<float[]> getsubdata (list<float[]> intdata,int splitattributeindex, float Splitvalue,boolean lessthan) {list<float[]> subdata = new arraylist<float[]> (); for (float[] d:intdata) {if ( Isnumeric[splitattributeindex]) {if (LessThan) {if (D[splitattributeindex] <= splitvalue) {                subdata.add (d);            }} Else{if (D[splitattributeindex] > SplitValue) {                subdata.add (d);}}            } Else{if (D[splitattributeindex] = = SplitValue) {subdata.add (d);}}} return subdata;}

The ToString method of the node used to print the Hotspot association rule Tree

/** * Formatted output */public string toString () {string tmp = Hsutils.isnumeric (splitattrindex)? This.lessthan? "<=": ">": "= "; String attrstate = Hsutils.isnumeric (splitattrindex)? String.valueof (This.attrstateindex): Hsutils.getattrstate (Splitattrindex, (int) attrstateindex); Return hsutils.getattr (This.splitattrindex) +tmp+attrstate+ "  (" +hsutils.formatpercent (this.support) + "[" + this.statecount+ "/" +this.allcount+ "])";}

When printing an association rule tree, it is also necessary to determine whether the current attribute is discrete or continuous.


The code output is:

The file reads are complete, and the properties and properties of the various states are initialized! Properties Outlook state: [Sunny-->0,overcast-->1,rainy-->2,] property temperature state: [Numeric] Property humidity Status: [Numeric] The state of the property windy: [True-->0,false-->1,] Properties play State: [Yes-->0,no-->1,] the rule tree is as follows: Play = no  (35.71% [5/14]) | Temperature > 83.0  (100.00% [1/1]) |humidity > 90.0  (66.67% [2/3]) | | Temperature > 70.0  (100% [2/2]) | | Humidity <= 95.0  (100% [2/2])



Share, grow, be happy

Down-to-earth, focus

Reprint Please specify blog address: http://blog.csdn.net/fansy1990



Hotspot Association rule Algorithm (2)--mining continuous and discrete data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.