Algorithm learning and java implementation, and java
Association Rule Mining can discover interesting associations or relationships between item sets in a large amount of data. A typical example of association rule mining is the shopping basket analysis, which helps retailers specify marketing strategies by discovering the relationship between different products placed in the shopping basket by customers and analyzing customers' shopping habits, guide sales. There are stories of beer and diapers abroad, and stories of noodles and ham in China. This article uses the Apriori algorithm as an example to introduce Association Rule Mining and java implementation.
What are association rules:
For the set D and record A of records, record B, A, and B belong to D: A ---> B [support (A-> B) = p (AUB ), confidence (A-> B) = p (B | A)]
Representation of association rules:
Instant Noodles ------> ham [support = 0.2, confidence = 0.8]
The rule's support and confidence level are two rule's Interest Degree measurements, which respectively reflect the usefulness and certainty of the discovery rule. The above formula indicates that the purchase of instant noodles and ham accounts for 2% of all records (actually there should be not so many; otherwise, instant noodles are eaten every day). the confidence level of 0.8 indicates that in the purchase of instant noodles, 80% of people buy ham at the same time (I belong to 80% anyway.
If the mined association rules meet the minimum supported threshold and minimum confidence threshold, the association rules are interesting.
Important: all non-empty subsets of a frequent item set must be frequent. (If a set cannot pass the test, all its supersets cannot pass the test)
The idea of the Apriori algorithm: An iterative method of layer-by-layer search. First, we look for a set of 1-item frequent sets, and combine the set into L1. L1 is used to find two frequent sets L2 and L2 is used to find L3, so on until K items cannot be found frequently.
Two phases of the Apriori algorithm iteration:
1 connection step; in order to find L (k), a set of candidate K items is generated by connecting L (k-1) with itself.
2. Remove a set of non-frequent candidates based on the item's support count, and determine the repeated iterations of the frequent set until the set meeting the minimum support cannot be generated.
Important nature: all non-empty subsets of a frequent item set must be frequently used in branch reduction. Candidates can be deleted as long as they are not frequent, this greatly reduces the amount of data.
The following shows the algorithm flowchart:
The following is an example:
The following code is directly pasted: Some places write a little redundant, and the main reason for a long program is to output the mining process to the console, so that you can easily understand the algorithm mining process.
However, the algorithm logic is clear and can be basically done with a while.
Package cluster; import java. io. bufferedReader; import java. io. file; import java. io. fileInputStream; import java. io. inputStreamReader; import java. util. arrayList; import java. util. hashSet; import java. util. iterator; import java. util. list;/*** the maximum Pattern Mining implemented by the Apriori algorithm, involving support, but without confidence calculation * @ author push_pop **/public class extends orimyself {private static final double MIN_SUPPROT = 0.2; // minimum support: private static boolean end Tag = false; // static List of cyclic States <List <String> record = new ArrayList <List <String> (); // dataset public static void main (String args []) {// *************************** record = getRecord (); // console output record System. out. println ("read dataset record in matrix form"); for (int I = 0; I <record. size (); I ++) {List <String> list = new ArrayList <String> (record. get (I); for (int j = 0; j <list. size (); j ++) {System. out. print (list. get (j) + "");} System. out. prin Tln ();} // ************ obtain the candidate item set ***************** List <String> candidateItemset = findFirstCandidate (); // The console outputs a candidate set System. out. println ("Level 1 alternative set CandidateItemset" after the first scan); for (int I = 0; I <CandidateItemset. size (); I ++) {List <String> list = new ArrayList <String> (CandidateItemset. get (I); for (int j = 0; j <list. size (); j ++) {System. out. print (list. get (j) + "");} System. out. println ();} // *************************** L Ist <List <String> FrequentItemset = getSupprotedItemset (CandidateItemset); // The console outputs a frequent set of System. out. println ("Level 1 Frequent Set FrequentItemset" after the first scan); for (int I = 0; I <FrequentItemset. size (); I ++) {List <String> list = new ArrayList <String> (FrequentItemset. get (I); for (int j = 0; j <list. size (); j ++) {System. out. print (list. get (j) + "");} System. out. println ();} // **************** iteration process **************** while (endTag! = True) {// *********** connection operation ***** candidate k item set is obtained by the K-1 item Frequent Set *********** * ** List <String> nextCandidateItemset = getNextCandidate (FrequentItemset ); system. out. println (""); for (int I = 0; I <nextCandidateItemset. size (); I ++) {List <String> list = new ArrayList <String> (nextCandidateItemset. get (I); for (int j = 0; j <list. size (); j ++) {System. out. print (list. get (j) + "");} System. out. println ();} // **************** branch reduction operation ** is obtained from the k-item set of the candidate Frequent k item set ***************** List <String> nextFrequentItemset = getSupprotedItemset (nextCandidateItemset); System. out. println ("frequent sets after scanning"); for (int I = 0; I <nextFrequentItemset. size (); I ++) {List <String> list = new ArrayList <String> (nextFrequentItemset. get (I); for (int j = 0; j <list. size (); j ++) {System. out. print (list. get (j) + "");} System. out. println ();} // ********** if the loop ends, maximum output mode *************** if (endTag = true) {Syste M. out. println ("Apriori algorithm ---> Frequent Set"); for (int I = 0; I <FrequentItemset. size (); I ++) {List <String> list = new ArrayList <String> (FrequentItemset. get (I); for (int j = 0; j <list. size (); j ++) {System. out. print (list. get (j) + "");} System. out. println ();}} // **************** initial values of the next cycle ****************** ** CandidateItemset = nextCandidateItemset; frequentItemset = nextFrequentItemset;}/*** read txt data * @ return */public static List <List <String> getRecord () {List <String> record = new ArrayList <List <String> (); try {String encoding = "GBK "; // character encoding (solves Chinese garbled characters) File file = new File ("simple.txt"); if (file. isFile () & file. exists () {InputStreamReader read = new InputStreamReader (new FileInputStream (file), encoding); BufferedReader bufferedReader = new BufferedReader (read); String lineTXT = null; while (lineTXT = bufferedRead Er. readLine ())! = Null) {// read a line of file String [] lineString = lineTXT. split (""); List <String> lineList = new ArrayList <String> (); for (int I = 0; I <lineString. length; I ++) {// T, F, YES, NOif (lineString [I] In the processing matrix. endsWith ("T") | lineString [I]. endsWith ("YES") lineList. add (record. get (0 ). get (I); else if (lineString [I]. endsWith ("F") | lineString [I]. endsWith ("NO"); // F, NO record does not save elselineList. add (lineString [I]);} record. add (lineList );} Read. close ();} else {System. out. println ("the specified file cannot be found! ") ;}} Catch (Exception e) {System. out. println ("An error occurred while reading the file content"); e. printStackTrace ();} return record ;} /*** self-connection with the current frequent item set to find the next candidate set * @ param FrequentItemset * @ return */private static List <String> getNextCandidate (List <string> FrequentItemset) {List <String> nextCandidateItemset = new ArrayList <List <String> (); for (int I = 0; I <FrequentItemset. size (); I ++) {HashSet <String> hsSet = new HashSet <Str Ing> (); HashSet <String> hsSettemp = new HashSet <String> (); for (int k = 0; k <FrequentItemset. get (I ). size (); k ++) // obtain the row I hsSet of the frequent set. add (FrequentItemset. get (I ). get (k); int hsLength_before = hsSet. size (); // Add hsSettemp = (HashSet <String>) hsSet. clone (); for (int h = I + 1; h <FrequentItemset. size (); h ++) {// rows I and j of frequent sets (j> I) concatenates a row that is added each time and adds an element to form a new frequent item set. hsSet = (HashSet <String>) hsSettemp. clone ();//!!! The hasSet to be connected remains unchanged for (int j = 0; j <FrequentItemset. get (h ). size (); j ++) hsSet. add (FrequentItemset. get (h ). get (j); int hsLength_after = hsSet. size (); if (hsLength_before + 1 = hsLength_after & isSubsetOf (hsSet, record) = 1 & isnotHave (hsSet, nextCandidateItemset) {// if not equal, indicates that a new element is added, and whether it is a subset of a row of record is regarded as an Iterator in the candidate set <String> itr = hsSet. iterator (); List <String> tempList = new ArrayList <String> (); while (itr. HasNext () {String Item = (String) itr. next (); tempList. add (Item);} nextCandidateItemset. add (tempList) ;}}return nextCandidateItemset ;} /*** determine whether the candidate set formed by the newly added element is in the new candidate set * @ param hsSet * @ param nextCandidateItemset * @ return */private static boolean isnotHave (HashSet <String> hsSet, list <String> nextCandidateItemset) {// TODO Auto-generated method stubList <String> tempList = new ArrayList <String> (); I Terator <String> itr = hsSet. iterator (); while (itr. hasNext () {String Item = (String) itr. next (); tempList. add (Item) ;}for (int I = 0; I <nextCandidateItemset. size (); I ++) if (tempList. equals (nextCandidateItemset. get (I) return false; return true ;} /*** determine whether the hsSet is a subset of a record in record2 * @ param hsSet * @ param record2 * @ return */private static int isSubsetOf (HashSet <String> hsSet, list <String> record2) {// convert hsSet Change to ListList <String> tempList = new ArrayList <String> (); Iterator <String> itr = hsSet. iterator (); while (itr. hasNext () {String Item = (String) itr. next (); tempList. add (Item) ;}for (int I = 1; I <record. size (); I ++) {List <String> tempListRecord = new ArrayList <String> (); for (int j = 1; j <record. get (I ). size (); j ++) tempListRecord. add (record. get (I ). get (j); if (tempListRecord. containsAll (tempList) return 1;} return 0 ;}/** * The k-item frequent set is obtained by pruning k-item candidate sets * @ param CandidateItemset * @ return */private static List <String> getSupprotedItemset (List <String> CandidateItemset) {// TODO Auto-generated method stubboolean end = true; List <String> supportedItemset = new ArrayList <List <String> (); int k = 0; for (int I = 0; I <CandidateItemset. size (); I ++) {int count = countFrequent (CandidateItemset. get (I); // count records if (count> = MIN_SUPPROT * (record. size ()-1) {supportedItemset. add (CandidateItemset. get (I); end = false ;}} endTag = end; // if (endTag = true) System is not terminated if (endTag = true) if a frequent item set exists. out. println ("No supported item set satisfied, end connection"); return supportedItemset ;} /*** count the number of list sets in record * @ param list * @ return */private static int countFrequent (List <String> list) {// TODO Auto-generated method stubint count = 0; for (int I = 1; I <record. size (); I ++) {boolean notH AveThisList = false; for (int k = 0; k <list. size (); k ++) {// judge record. whether get (I) contains listboolean thisRecordHave = false; for (int j = 1; j <record. get (I ). size (); j ++) {if (list. get (k ). equals (record. get (I ). get (j) // list. Get (k) in record. ThisRecordHave = true;} if (! ThisRecordHave) {// If a list element cannot be found, the other elements are compared and the next record is performed. Get (I) Compare notHaveThisList = true; break ;}}if (notHaveThisList = false) count ++;} return count ;} /*** obtain a candidate set * @ return */private static List <String> findFirstCandidate () {// TODO Auto-generated method stubList <List <String> tableList = new ArrayList <List <String> (); hashSet <String> hs = new HashSet <String> (); for (int I = 1; I <record. size (); I ++) {// first behavior item information for (int j = 1; j <record. get (I ). size (); j ++) {hs. add (record. get (I ). get (j) ;}} Iterator <String> itr = hs. iterator (); while (itr. hasNext () {List <String> tempList = new ArrayList <String> (); String Item = (String) itr. next (); tempList. add (Item); tableList. add (tempList);} return tableList ;}}
The defects of the Apriori algorithm are also obvious:
1. If the data volume is large, a large number of candidate sets are involved. N frequent 1 item set may produce (N-1) * N/2 candidate 2 item set
2. The database needs a multilateral scan. If the data is frequently collected once for each self-connection, the data needs to be scanned again.
The improvements will be written in the next blog-association rule mining algorithm FPTree without generating candidate sets