Hotspot Association rule Algorithm (1)--mining discrete data

Source: Internet
Author: User

Refer to association Rules algorithm, generally think of Apriori or FP, generally very few think of the hotspot, the algorithm does not know the application is less or I check the data means too low, in the Internet only found very little content, this article http:/ /wiki.pentaho.com/display/datamining/hotspot+segmentation-profiling, probably analyzed a little, the other seems to have not seen how. More useful algorithm class software, such as Weka, which already contains the algorithm, in the associate--> hotspot inside can see, run the algorithm interface is generally as follows:

Where the red box is the set parameter, as follows:

-C Last, which is the column where the target is located, which is the final column, and is a numeric value, indicating the first column;

-v First, which indicates that the target column has a status value subscript value (it can be seen that the target column should be a discrete type), that is the No. 0, and can be numeric;

-S 0.13, minimum support, which will be multiplied by the total number of samples to obtain a numerical support degree;

-M 2, Max analysed by sub;

-I 0.01, which is interpreted inside weka as minimum improvement in target value, does not know whether the traditional confidence level is the same;


Related instructions: This article related code reference Weka inside the specific implementation of the Hotspot algorithm, this article only analyzes the discrete data, the code can be downloaded.

1. Data:

@attribute Age {Young, pre-presbyopic, presbyopic} @attribute spectacle-prescrip{myope, hypermetrope} @attribute Astigmatism{no, yes} @attribute tear-prod-rate{reduced, normal} @attribute Contact-lenses{soft, hard, None}young,myope , No,reduced,noneyoung,myope,no,normal,softyoung,myope,yes,reduced,none ... Presbyopic,hypermetrope,yes,normal,none
This data format is the reference Weka inside, the first 5 lines are added because of the need to encode the various attributes, so in advance to get the attributes of the various states, convenient follow-up operations;
2. Single node definition:

public class Hsnode {private int splitattrindex;//attribute subscript  private int attrstateindex;//attribute state subscript  /c2>private int allcount; Number  of current datasets private int statecount; The number  of state of the property is private double support; The support degree  of the property is private list Public Hsnode () {  }}

Splitattrindex is the subscript of the corresponding attribute astigmatism (should be 2nd, starting from 0); Attrstateindex corresponds to the subscript of the attribute, that is, the subscript of no (this should be 0); Allcount is 12, Statecount that is 5,support corresponds to 41.57% (that is, the value of 5/12), children is the child node, (here the subscript is encoded from the first few lines of the file, such as the property of age is the initial attribute, encoded as 0,young for its first state, encoded as 0);

3. Algorithm pseudo-code, (text description, too unprofessional, if you want to see, will look?) )

1. Create the root node; 2. Create a child node;       2.1 for all data, calculate the ' support ' for each attribute of each column,       if support>= minsupport          Add the current attribute of the column to the list of potential child nodes; End   2.2 Iterates over the potential child node list       if (! The rule order in global rule sequence generated by the current node)  adds the current node to the list of child nodes; The  rules generated by the current node are added to the global rules;       end   2.3 Traverse child node list        for the current node, return to 2 for recursion;   

4. The key steps of the code are implemented specifically:

4.1 Reading and initialization of data:

1) Read the first few lines of the file, initialize two variables, attributes and attributestates, respectively, corresponding to all the attributes and attributes of each state;

while ((tempstring = Reader.readline ()) = null) {//The first row of data is the header if (Tempstring.indexof (hsutils.filefor MAT) {==0) {String attr = tempstring.substring (HSUtils.FILEFORMAT.length (), Tempstring.indexof ("{")).            Trim ();            String[] Attrstates =tempstring.substring (Tempstring.indexof ("{") +1, Tempstring.indexof ("}")). Split (",");            for (int i=0;i<attrstates.length;i++) {Attrstates[i]=attrstates[i].trim ();            } attrlist.add (attr);            line++;            This.attributeStates.put (attr, attrstates);            Continue            } if (flag) {this.attributes= new string[line];            Attrlist.toarray (this.attributes);//Copy the value into the array flag=false;            } string[] tempstrings = Tempstring.split (splitter);            Lists.add (Strarr2intarr (tempstrings)); }
2) The following data is converted into a numeric array, the Strarr2intarr function is as follows:

/** * String array to int array * @param SARR * @return * @throws Exception  */private int[] Strarr2intarr (string[] sArr) throws exception{int[] Iarr = new int[sarr.length];for (int i=0;i<sarr.length;i++) {iarr[i]= getattrcode (sArr[i],i);} return Iarr;} /** * attrstate encoding for attrindex attribute * @param attrstate * @param attrindex * @return * @throws Exception  */private int ge Tattrcode (String attrstate,int attrindex) throws exception{string[] Attrstates = Attributestates.get (attributes[ Attrindex]); for (int i=0;i<attrstates.length;i++) {if (Attrstate.equals (Attrstates[i])) {return i;}} throw new Exception ("Coding Error! ");//return-1; If run to here should be error}

The main data read here is to convert the discrete string type data into numeric data, the encoding rules are as follows:

The state of the attribute age: [Young-->0,pre-presbyopic-->1,presbyopic-->2,] the state of the property Spectacle-prescrip: [Myope-->0, Hypermetrope-->1,] The state of the property astigmatism: [No-->0,yes-->1,] property tear-prod-rate state: [reduced-->0,normal-->1 ,] the state of the property contact-lenses: [Soft-->0,hard-->1,none-->2,]
4.2 Initializing the root node

Read file and assign value list<int[]> Intdata = readfileandinitial (hsutils.filepath,hsutils.splitter);; int splitattributeindex = attributes.length-1;//subscript required minus 1int Stateindex = hsutils.labelstate;int numinstances = Intdata.size ();//Data Total number int[] Labelstatecount = Attrstatecount (intdata,attributes.length-1); Hsutils.setminsupportcount (numinstances);d ouble targetvalue=1.0*labelstatecount[hsutils.labelstate]/ numinstances;//Create root node Hsnode root = new Hsnode (Splitattributeindex,stateindex,labelstatecount[stateindex], numinstances);d ouble[] splitvals=new double[attributes.length];        Byte[] Tests = new Byte[attributes.length];root.setchidren (Constructchildrennodes (Intdata,targetvalue,splitvals, Tests));
Labelstatecount that is the target attribute of the number of States, such as the target state here is soft, a total of 5 values, a total of 24 samples, so its support degree is 5/25=20.82%;

The Constructchildrennodes function creates all child nodes and receives parameters such as: Intdata: All data (encoded), Targetvalue: Current node support, splitvals and tests arrays are primarily used to generate rules for nodes ;

4.3 Creating a child node:

1) Calculate Potential child nodes:

Private list
The main evaluateattr here is to determine whether each of the properties of each state meets the requirements, is to join the PQ

/** * Whether the state of the Attrindex attribute is added as an alternative node to PQ * @param PQ * @param intdata * @param attrindex * @param targetvalue * @param stat Eindex * @param labelstatecount */private void evaluateattr (priorityqueue<attrstatesup> pq,List<int[]>      Intdata, int attrindex, double targetvalue) {int[] counts = Attrstatecount (Intdata,attrindex); Boolean ok = false; Only consider attribute values this result in subsets this meet/exceed min support for (int i = 0; i < COUNTS.L Ength;          i++) {if (Counts[i] >= hsutils.getminsupportcount ()) {OK = true;        Break }} if (OK) {double Subsetmatrix =0.0;for (int stateindex=0;stateindex<counts.length;stateindex++) {Subsetmatri x =attrstatecount (Intdata,attrindex,stateindex,attributes.length-1,hsutils.labelstate); if (Counts[stateIndex] >=hsutils.getminsupportcount () &&subsetmatrix>=hsutils.getminsupportcount ()) {Double merit = 1.0* Subsetmatrix/counts[stateindex]; Double Delta = MeRit-targetvalue; if (delta/targetvalue>=hsutils.minconfidence) {Pq.add (new Attrstatesup (Attrindex,stateindex,counts[stateindex]         , (int) subsetmatrix)); }}}}//OK}

In this case, the count of each state that is labeled Attrindex in the current dataset is counted to the counts[] array first, and if all counts of each state are less than the minimum support, the attribute is not added as an alternative to the PQ Otherwise continue to judge: count the number of times the set state of the target attribute (such as soft) and the state of the current attribute (young) appears together (first should be 2), assign to Subsetmatrix (then the value is 2), and determine whether Subsetmatrix >= minimum support , if the calculation is in accordance with the above code, and finally there is a decision to use the confidence level (the translation to the confidence level), if satisfied then add it to the PQ, that is, a list of alternative child nodes;

2) Create a global rule and add a child node

listThe global rules here are generated using Hotspothashkey, the meaning of the specific rules is not understood (probably related to the principle of the algorithm, can not find a relevant paper! )

After adding a node, the corresponding rules are added, which prevents children from having the same rules that the child node has added;

3) For each child node, the child who processes its node

Processing child nodes for (int i=0;i<children.size (); i++) {Hsnode = Children.get (i); Child.setchidren ( Constructchildrennodes (Getsubdata (Intdata,child.getsplitattrindex (), Child.getattrstateindex ()), Child.getsupport (), Keylist.get (i). Getm_splitvalues (), Keylist.get (i). Getm_testtypes ());}
Recursion is used here to make calls for easy handling. Note that the generation of node rules uses two arrays newsplitvalues and newtests need to be passed down, so when each child node generates a rule, it is added to a keylist, so that when the child is traversing the child node and handling the children of its node, The corresponding rule passing array can be found;

The getsubdata here is to find the data returned in the current dataset with the same property status as the given attribute, as follows:

/** * Get and Splitattributeindex the properties of the same subscript and all data for Stateindex * @param intdata * @param splitattributeindex * @param stateindex * @ return */private list<int[]> getsubdata (list<int[]> intdata,int splitattributeindex, int stateIndex) {List <int[]> subdata = new arraylist<int[]> (); for (int[] d:intdata) {if (D[splitattributeindex]==stateindex) { Subdata.add (d);}} return subdata;}

4.4 Print Rule Tree

/** * Print Rule Tree * @param node * @param level */public void Printhsnode (Hsnode node,int level) {Printleveltab (level); System.out.print (node+ "\ n"); List
Here you can see how to use direct printing for the current node, because the ToString method is covered here, so you can do so with the ToString method as follows:

/** * Formatted output */public String toString () {return hsutils.getattr (this.splitattrindex) + "=" +hsutils.getattrstate ( Splitattrindex, Attrstateindex) + "  (" +hsutils.formatpercent (this.support) + "[" +this.statecount+ "/" + this.allcount+ "])";}

4.5 Algorithm Invocation:

Package Fz.hotspot;import Fz.hotspot.dataobject.hsnode;public class Hotspottest {/** * @param args * @throws exception
   */public static void Main (string[] args) throws Exception {String file = "D:/jars/weka-src/data/contact-lenses.txt"; int labelstateindex = 0; The target attribute is located under the subscript int maxbranches=2; Maximum number of branches double minsupport = 0.13; Minimum support double minconfidence=0.01;//minimum confidence (used in Weka is minimprovement) hotspot hs = new hotspot (); Hsnode root = Hs.run (file,labelstateindex,maxbranches,minsupport,minconfidence); System.out.println ("\ nthe rule tree as follows: \ n"); Hs.printhsnode (root,0);}}
The rules tree is printed as follows:

Contact-lenses=soft  (20.83% [5/24]) |astigmatism=no  (41.67% [5/12]) | | Tear-prod-rate=normal  (83.33% [5/6]) | | | Spectacle-prescrip=hypermetrope  (100% [3/3]) | | Spectacle-prescrip=hypermetrope  (50.00% [3/6]) |tear-prod-rate=normal  (41.67% [5/12]) | | Spectacle-prescrip=hypermetrope  (50% [3/6])


Can be seen and weka given is consistent.


Recently in the "Dark time", mentioned that there are ideas best written down, so that not only to deepen their understanding, while in the process of writing, such as some of the expression can also be strengthened (as a programmer, this ability is really deficient), but also can let others test their own blind spots of thinking.

The relevant algorithms in this paper are understood and represent only their own viewpoints.


Share, grow, be happy

Down-to-earth, focus

Reprint Please specify blog address: http://blog.csdn.net/fansy1990



Hotspot Association Rule Algorithm (1)--mining discrete data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.