The algorithm of CBA algorithm---classification based on association rules

Source: Internet
Author: User

More data mining algorithms: Https://github.com/linyiqun/DataMiningAlgorithm

Introduction

CBA algorithm full name is classification base of association, is based on association rules classification algorithm, speaking of association rules, we will think of Apriori and Fp-tree algorithm are association rules mining algorithm, The CBA algorithm is the use of Apriori Mining Association rules, and then do the classification, so in a way, CBA algorithm can also be said to be an integrated mining algorithm.

algorithm principle

CBA algorithm as a classification algorithm, his classification is given a number of pre-known attributes, and then ask you to determine the value of his decision-making attributes. The basis of judgment is the frequent items mined by the Apriori algorithm, if an item set contains pre-known attributes and also contains categorical attribute values, then we calculate whether this frequent item can export the association rules for the value of a given attribute value, and if the minimum confidence level of the rule is met, The decision attribute values in the frequent items can then be used as the final classification result. The detailed algorithm details are as follows:

1, input data record, is the attribute value of a strip.

2, the value of the property to do a number substitution (according to the column from the top down to find the attribute value), similar to Apriori in the transaction records.

3, according to the transaction record, the Apriori algorithm is calculated and the frequent itemsets are excavated.

4, enter the property values of the query, find the frequent itemsets that meet the criteria (need to include query attributes and classification decision attributes), if you can deduce such association rules, even if the classification is successful, output classification results.

Here the test data of the cart algorithm that I did before is the test data of the CBA algorithm, as follows:

Rid Age Income Student creditrating BuysComputer1-High No Fair CLassNo2 one high no excellent CLASSNO3-high no Fair C LassYes4 Medium No Fair CLassYes5 Low Yes Fair CLASSYES6 (low yes excellent CLassNo7) low Yes excellent classyes  8 Medium No Fair CLassNo9 9 Low Yes Fair CLASSYES10-Medium Yes Fair CLassYes11 Medium Yes excellent CLASSYES12 33 Medium no excellent CLassYes13 high Yes Fair CLassYes14 Medium No excellent Classno
Number substitution graph for attribute values:

Medium=5, classyes=12, excellent=10, low=6, fair=9, classno=11, Young=1, middle_aged=2, yes=8, No=7, High=4, Senior=3
The data after the experience becomes the following transaction entry:

each record to see the transaction item, and the Apriori algorithm input format is basically the same, followed by the connection operation and pruning steps and other Apriori algorithm steps, here is not described in detail, Apriori algorithm implementation can click here to understand.

code implementation of the algorithm

The test data is the above content.

Cbatool.java:

Package Datamining_cba;import Java.io.bufferedreader;import Java.io.file;import java.io.filereader;import Java.io.ioexception;import Java.util.arraylist;import Java.util.hashmap;import Java.util.regex.Matcher;import Java.util.regex.pattern;import DATAMINING_CBA. Aprioritool.aprioritool;import DATAMINING_CBA. aprioritool.frequentitem;/** * CBA Algorithm (association rule Classification) Tool class * * @author Lyq * */public class Cbatool {//Age category divided public final String Age = ' age ';p ublic final string Age_young = "Young";p ublic final String age_middle_aged = "middle_aged";p ublic final Strin G Age_senior = "Senior";//test data address private String filepath;//minimum support threshold rate private double minsupportrate;//minimum confidence threshold value, Used to determine whether you can become an association rule private double minconf;//minimum support private int minsupportcount;//Property column name private string[] attrnames;// The set of numbers represented by the category attribute private arraylist<integer> classtypes;//to save the test data with a two-dimensional array private arraylist<string[]> Totaldatas ;//Apriori Algorithm tool class private Aprioritool aprioritool;//attribute to number map private hashmap<string, integer> Attr2num;privatE Hashmap<integer, string> num2attr;public cbatool (String filePath, double minsupportrate, double minconf) { This.filepath = filepath;this.minconf = Minconf;this.minsupportrate = Minsupportrate;readdatafile ();} /** * Read data from file */private void Readdatafile () {File File = new file (FilePath); arraylist<string[]> DataArray = new arraylist<string[]> (); try {bufferedreader in = new BufferedReader (new FileReader (file)); String str; String[] Temparray;while ((str = in.readline ()) = null) {Temparray = Str.split ("");d Ataarray.add (Temparray);} In.close ();} catch (IOException e) {e.getstacktrace ();} Totaldatas = new arraylist<> (); for (string[] array:dataarray) {totaldatas.add (array);} Attrnames = totaldatas.get (0); minsupportcount = (int) (Minsupportrate * totaldatas.size ()); Attributereplace ();} Replace the/** * attribute value with the form of a number to be mined for frequent items */private void Attributereplace () {int currentvalue = 1;int num = 0; String s;//Property name to number map Attr2num = new hashmap<> (); num2attr = new hashmap<> (); cLasstypes = new arraylist<> ();//in 1 columns, scan from left to right, skip column name row and ID column for (int j = 1; j < Attrnames.length; J + +) {for (int i = 1; I < totaldatas.size (); i++) {s = totaldatas.get (i) [j];//if it is in the form of a number, only the age class is converted, the other numbers are similar to if (Attrnames[j].equals (ages)) {num = Integer.parseint (s); if (num <= && num > 0) {totaldatas.get (i) [j] = Age_young;} else if (num > & & Num <=) {totaldatas.get (i) [j] = age_middle_aged;} else if (num > +) {totaldatas.get (i) [j] = Age_senior;}} if (!attr2num.containskey (Totaldatas.get (i) [j])) {Attr2num.put (Totaldatas.get (i) [j], CurrentValue); Num2attr.put ( CurrentValue, Totaldatas.get (i) [j]); if (j = = attrnames.length-1) {//If the following column is a group, the description is a category column, recorded Classtypes.add ( CurrentValue);} currentvalue++;}}} For the original data as a property substitution, each record becomes similar to the transaction data in the form of (int i = 1; i < totaldatas.size (); i++) {for (int j = 1; j < Attrnames.length; j+ +) {s = totaldatas.get (i) [J];if (Attr2num.containskey (s)) {totaldatas.get (i) [j] = Attr2num.get (s) + "";}}} /** * Apriori calculate all frequent itemsSet * @return */private arraylist<frequentitem> aprioricalculate () {string[] temparray; Arraylist<frequentitem> Totalfrequentitems; Arraylist<string[]> CopyData = (arraylist<string[]>) totaldatas.clone ();//Remove attribute name row copydata.remove (0);//  Remove first column idfor (int i = 0; i < copydata.size (); i++) {string[] array = copydata.get (i); temparray = new String[array.length- 1]; System.arraycopy (array, 1, temparray, 0, temparray.length); Copydata.set (i, Temparray);} Aprioritool = new Aprioritool (CopyData, Minsupportcount); Aprioritool.computelink (); totalfrequentitems = Aprioritool.gettotalfrequentitems (); return totalfrequentitems;} /** * Classification based on Association rules * * @param attrvalues * Pre-known properties * @return */public string Cbajudge (string attrvalues) {int V Alue = 0;//Final classification category string ClassType = null; string[] temparray;//known attribute value arraylist<string> attrvaluelist = new arraylist<> (); arraylist<frequentitem> Totalfrequentitems;totalfrequentitems = Aprioricalculate ();//split the query condition by attribute of the String[] Array = Attrvalues.split (","); for (String record:array) {temparray = Record.split ("="); value = Attr2num.get (Tempa RRAY[1]); Attrvaluelist.add (value + "");} Find eligible items in the frequent itemsets for (Frequentitem item:totalfrequentitems) {//filter out not satisfy number of frequent items if (Item.getidarray (). Length < ( Attrvaluelist.size () + 1)) {continue;} To ensure that the properties of the query are included in the frequent itemsets if (itemissatisfied (item, attrvaluelist)) {Temparray = Item.getidarray (); ClassType = Classificationbaserules (Temparray); if (ClassType! = null) {///as Property substitution ClassType = Num2attr.get (Integer.parseint ( ClassType)); break;}}} return ClassType;} /** * Classification based on Association rules * * @param items * Frequent items * @return */private String classificationbaserules (string[] items) {St Ring ClassType = null; string[] Arraytemp;int count1 = 0;int Count2 = 0;//confidence double confidencerate; string[] Noclasstypeitems = new String[items.length-1];for (int i = 0, k = 0; i < items.length; i++) {if (!classtypes . Contains (Integer.parseint (items[i))) {Noclasstypeitems[k] = items[i];k++;} else {ClassType = Items[i];}} For (string[] array:totaldatas) {//Remove ID Number number arraytemp = new String[array.length-1]; System.arraycopy (array, 1, arraytemp, 0, Array.length-1), if (Isstrarraycontain (Arraytemp, Noclasstypeitems)) {count1+ +;if (Isstrarraycontain (arraytemp, items)) {count2++;}}} Calculate the confidence level confidencerate = Count1 * 1.0/COUNT2;IF (confidencerate >= minconf) {return ClassType;} else {//If the minimum confidence level is not met Required, this association rule is invalid return null;}} /** * Determines whether a single character is contained in an array of characters * * @param array * Character array * @param s * To determine the individual characters * @return */private boolean Stri Scontained (string[] array, string s) {Boolean iscontained = False;for (string str:array) {if (Str.equals (s)) {Iscontaine d = True;break;}} return iscontained;} /** * Array array2 is included in Array1, does not need to be exactly the same * * @param array1 * @param array2 * @return */private boolean isstrarraycontain (String  [] array1, string[] array2) {Boolean iscontain = True;for (string s2:array2) {Iscontain = False;for (String s1:array1) {//As long as the S2 character exists in Array1, this character is included in the Array1 if (S2.equals (s1) {iscontain = True;break;}} Once a character is found that does not contain, the array2 array is not included in Array1 if (!iscontain) {break;}} return iscontain;} /** * Determine if a frequent itemsets satisfies a query * * @param item * Frequent itemsets to be judged * @param attrvalues * Query property values list * @return */private bo Olean itemissatisfied (Frequentitem item,arraylist<string> attrvalues) {Boolean iscontained = false; string[] array = Item.getidarray (); for (String s:attrvalues) {iscontained = True;if (!striscontained (array, s)) {Isconta ined = False;break;} if (!iscontained) {break;}} if (iscontained) {iscontained = false;//also verifies whether frequent itemsets contain categorical attributes for (Integer type:classtypes) {if (striscontained (array, Typ E + "")) {iscontained = True;break;}}} return iscontained;}}
Call class Client.java:

Package Datamining_cba;import java.text.messageformat;/** * CBA algorithm--A classification algorithm based on association rules * @author LYQ * */public class Client {Publ IC static void Main (string[] args) {String FilePath = "C:\\users\\lyq\\desktop\\icon\\input.txt"; String Attrdesc = "Age=senior,creditrating=fair"; String classification = null;//Minimum support threshold rate double minsupportrate = 0.2;//minimum confidence threshold double minconf = 0.7; Cbatool tool = new Cbatool (FilePath, Minsupportrate, minconf); classification = tool. Cbajudge (ATTRDESC); System.out.println (Messageformat.format ("{0} 's association classification result is {1}", Attrdesc, classification));}}
The result of the code is:

Frequent 1 itemsets: {1,},{2,},{3,},{4,},{5,},{6,},{7,},{8,},{9,},{10,},{11,},{12,}, frequent 2 itemsets: {1,7,},{1,9,},{1,11,},{2,12,},{3,5, },{3,8,},{3,9,},{3,12,},{4,7,},{4,9,},{5,7,},{5,9,},{5,10,},{5,12,},{6,8,},{6,12,},{7,9,},{7,10,},{7,11,},{ 7,12,},{8,9,},{8,10,},{8,12,},{9,12,},{10,11,},{10,12,}, frequent 3 itemsets: {1,7,11,},{3,9,12,},{6,8,12,},{8,9,12,}, Frequent 4 itemsets: frequent 5 itemsets: Frequent 6 itemsets: Frequent 7 itemsets: Frequent 8 itemsets: Frequent 9 itemsets: frequent 10 itemsets: Frequent 11 itemsets: Age=senior,creditrating=fair's association classification result is Classyes
Some of the above itemsets are empty without this set. The Apriori algorithm class can be consulted here , only the part of the CBA algorithm is shown here.

analysis of the algorithm

When I was ready to implement the CBA algorithm, I foresaw that this algorithm is a wrapper for the Apriori algorithm, that is, 2 points, the input data format for the conversion of digital, there is the time to do the output of the digital substitution, the core or the Apriori algorithm of the itemsets frequent mining.

Problems encountered when implementing a program

In this period encountered a bug is frequently 1 items in the sorting of the time there is a problem, and later found that the reason is String.CompareTo (), originally should be 1, 2, .... 11,12, using the previous method will become 1,10,2,。。 Is 10 will be compared to 2 small case, later checked the String.CompareTo () of the comparison rules, understand he is a bit more than Ascall code value, because 10 of 1:2 small, finally decisively changed back to the comparison method with the integer. This problem is a small problem, the 1-item set if there is no sequence, the subsequent connection operation may be less likely, this has been eaten before this loss.

my understanding of the CBA algorithm

CBA algorithm and clever use of association rules to classify categories, and other classification algorithms. His algorithm will depend on the implementation of the Apriori algorithm is good or bad.

The algorithm of CBA algorithm---classification based on association rules

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.