Hotspot Association rule Algorithm (2)--mining continuous and discrete data

Source: Internet
Author: User

This code can be downloaded in

In the previous article, the Hotspot Association rule Algorithm (1)-mining discrete data analyzes the hotspot Association rules of discrete data, and this paper analyzes the mining of the Hotspot Association rules of discrete and continuous data.

1. First look at the data format (TXT document):

@attribute Outlook {Sunny, overcast, rainy} @attribute temperature Numeric@attribute humidity Numeric@attribute Windy { TRUE, FALSE} @attribute play {yes, no}sunny,85,85,false,nosunny,80,90,true,noovercast,83,86,false,yesrainy,70,96, False,yesrainy,68,80,false,yesrainy,65,70,true,noovercast,64,65,true,yessunny,72,95,false,nosunny,69,70,false, Yesrainy,75,80,false,yessunny,75,70,true,yesovercast,72,90,true,yesovercast,81,75,false,yesrainy,71,91,true,no
This data refers to weka with data Weather.arff, and data format, such as @attribute written on the Weka data format. The following code uses the data format as described above, with its format descriptive narrative such as the following: 1) The first M-line begins with @attribute, and the Code m properties. The last one is the target attribute; 2) Assuming that the attribute is numeric, the space followed by the attribute property name and the space followed by the numeric; Then attribute the trailing space followed by the property name. Spaces use curly braces to enclose discrete values, and discrete values are separated by commas; 3) The target attribute must be discrete (the requirement that the target attribute should be discrete, in fact it is only in my code, and the general hotspot algorithm does not have this requirement.)

It is assumed that the target attribute must be continuous and can be modified on the basis of LZ code.

2. Data read

The data read in the Hotspot Association Rules algorithm (1) is for discrete data, so it needs to be changed, which is only coded for discrete data. Continuous data hold yes, at the same time, you need to set a Boolean array to indicate whether the attribute column is discrete or continuous.

Its reading substitution code is seen in the following example:

while ((tempstring = Reader.readline ()) = null) {//The first row of data is the header if (Tempstring.indexof (hsutils.fileformat) = = 0) {String attr = ""; string[] attrstates = null;if (Tempstring.contains ("{")) {attr = Tempstring.substring (HSUtils.FILEFORMAT.length (), Tempstring.indexof ("{")). Trim (); attrstates = tempstring.substring (Tempstring.indexof ("{") + 1,tempstring.indexof (" } "). Split (", "); for (int i = 0; i < attrstates.length; i++) {Attrstates[i] = Attrstates[i].trim ();} Numericlist.add (false); This.attributeStates.put (attr, attrstates);//Join here can} else {//NUMERICIF ( Tempstring.contains ("numeric")) {attr = Tempstring.substring (HSUtils.FILEFORMAT.length (), Tempstring.indexof (" Numeric ")). Trim (); Numericlist.add (true);} else {//Error data format errors throw new Exception ("Data format error, please check!") ");}} Attrlist.add (attr); line++;continue;} if (flag) {this.attributes = new string[line];this.isnumeric = new Boolean[line];attrlist.toarray (this.attributes);// Copy the value into the array numericlist.toarray (this.isnumeric); flag = false;} string[] Tempstrings =Tempstring.split (splitter); Lists.add (Strarr2intarr (tempstrings));} 
This is just the code inside the while loop, where the code is initialized for the data format rules described previously (in fact, the list stores the transformed data. Typically, arrays can be used to store them. It is possible to convert the list data into an array. This can be done faster in later operations. Assume that you want to optimize. Be able to start with this).

3. Description of the node definition for the Hotspot Association rule tree:

Because the continuous attribute data is added here. Therefore, a Boolean variable LessThan is added for a single node to indicate that it is greater than or less than the node data, and that at the same time the stateindex should be a numeric value (the values of the current node) instead of the subscript of the discrete data state.

4. Algorithm pseudo-code (build process)

When a potential node is computed in the algorithm pseudo-code. Use different methods for continuous variables in Weka source code: Evaluatenumeric to infer. In the LZ code this part is fully reference to the code in the source code, but one thing is that after calling the Evaluatenumeric algorithm, it will be sorted for a column, that is, a two-dimensional array to sort by a column global ordering. This method is sorted in the Weka source code using the Instances Quicksort method (recursive, not detailed). Here the LZ is directly to the list into a two-dimensional array and then sorted, its methods such as the following:

/** * Sorted by Attrindex, Attrindex must be numeric this method may need to be optimized * is the List using arrays faster? Ability to consider using arrays * @param intdata * @param attrindex * @return */private list<float[]> sortbasedonattr (list<float[]> i Ntdata, Final int attrindex) {float[][] tmpdata = new Float[intdata.size ()][];intdata.toarray (tmpdata); Arrays.sort (tmpdata,new comparator<float[]> () {@Overridepublic int compare (float[] O1, float[] O2) {if (o1[ Attrindex]==o2[attrindex]) {return 0;} return o1[attrindex]>o2[attrindex]?1:-1;}}); list<float[]> returnlist = new arraylist<float[]> (); for (int i = 0; i < tmpdata.length; i++) { D (Tmpdata[i]);} return returnlist;}
At the same time, when building a child node in a recursive manner, the generation of node rules is different for both numerical and discrete types. For example, the following:

double[] newsplitvals = Splitvals.clone (); byte[] newtests = Tests.clone (); Newsplitvals[attrstatesup.getattrindex ()] = Attrstatesup.getstateindex () + 1;newtests[attrstatesup.getattrindex ()] = Isnumeric[attrstatesup.getattrindex ()]? Attrstatesup.islessthan ()? (byte) 1: (Byte) 3: (byte) 2; Hotspothashkey key = new Hotspothashkey (newsplitvals, newtests);
When you are building a child node recursively, you need to make adjustments to the way the child datasets are generated, such as the following:

/** * Get and Splitattributeindex the properties of the same subscript and all data for Stateindex *  * @param intdata * @param splitattributeindex * @param splitva Lue * @return */private list<float[]> getsubdata (list<float[]> intdata,int splitattributeindex, float Splitvalue,boolean lessthan) {list<float[]> subdata = new arraylist<float[]> (); for (float[] d:intdata) {if ( Isnumeric[splitattributeindex]) {if (LessThan) {if (D[splitattributeindex] <= splitvalue) {                subdata.add (d);            }} Else{if (D[splitattributeindex] > SplitValue) {                subdata.add (d);}}            } Else{if (D[splitattributeindex] = = SplitValue) {subdata.add (d);}}} return subdata;}

The ToString method of the node used to print the Hotspot association rule Tree

/** * Formatted output */public string toString () {string tmp = Hsutils.isnumeric (splitattrindex)? This.lessthan?

"<=": ">": "="; String attrstate = Hsutils.isnumeric (splitattrindex)?

String.valueof (This.attrstateindex): Hsutils.getattrstate (Splitattrindex, (int) attrstateindex); Return hsutils.getattr (This.splitattrindex) +tmp+attrstate+ " (" +hsutils.formatpercent ( + "[" + this.statecount+ "/" +this.allcount+ "])";}

When the association rule tree is printed. The same need to infer whether the current attribute is discrete or continuous.

The code output is:

The file is finished reading. And the various states of properties and properties are initialized! Properties Outlook state: [Sunny-->0,overcast-->1,rainy-->2,] property temperature state: [Numeric] Property humidity Status: [Numeric] The state of the property windy: [True-->0,false-->1,] Properties play State: [yes-->0,no-->1,] rule tree For example: Play = no  (35.71% [5/14]) | Temperature > 83.0  (100.00% [1/1]) |humidity > 90.0  (66.67% [2/3]) | | Temperature > 70.0  (100% [2/2]) | | Humidity <= 95.0  (100% [2/2])

Share. Grow up, be happy

Down-to-earth, focus

Reprint Please specify blog address:

Hotspot Association rule Algorithm (2)--mining continuous and discrete data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.