Apriori algorithm Learning and Java implementation

Source: Internet
Author: User

Association rule Mining can discover interesting associations or related relationships between itemsets in large amounts of data. A typical mining example of association rules is shopping basket analysis, which can help retailers to specify marketing strategies, guide sales and so on, by discovering the relationship between different products in their shopping baskets and analyzing customers ' shopping habits. Foreign countries have "beer and diapers" story, the domestic bubble noodles and ham story. This paper introduces the mining of association rules and implements it in Java with Apriori algorithm as an example.

What are Association rules:

For collections of records D and records A, records b,a,b belong to D:A--->b [Support (A->b) =p (AUB), confidence (a->b) =p (b| A)]

Representation of association Rules:

Noodles------> Ham [support=0.2,confidence=0.8]

The support and confidence of the rule are two rule interest measures, which reflect the usefulness and certainty of the discovery rule, respectively. The above-said that the purchase of both the noodles and Ham records accounted for 2% of all records (the actual should not so much, or every day to eat bubble noodles), the confidence of 0.8 in the purchase of the noodle record, 80% of the people at the same time to buy ham (anyway I belong to 80%.)

If the mining association rules meet the minimum support threshold and the minimum confidence threshold, then the association rules are called interesting.

Important properties: All non-empty sets of frequent itemsets must be frequent. (If a collection cannot pass the test, then all its superset will not pass the test)


Apriori algorithm idea: The iterative method of searching by layer, first look for the set of 1- item frequent sets, set to do L1, L1 Used to find two frequent sets of L2,L2 used to find L3, so go on, until you can't find K Frequent collection of items.

Two phases of the Apriori algorithm iteration:

1 connection step; to find L (k), a set of candidate K-Itemsets is produced by connecting L (K-1) with itself.

2 reduction Step; The non-frequent candidate sets are removed based on the support degree count of the items, and the frequent set iterations are determined until the set that satisfies the minimum support level is not produced.

Apriori Important properties: all non-empty-empty sets of frequent itemsets must be frequent in the application of the branch in which the candidate sets can be deleted as long as it is not frequent, thus greatly reducing the amount of data.

Directly below the algorithm flowchart:


The following examples illustrate:



The following direct code: Some places write a bit redundant, the main reason for the long program is to output the process of mining to the console, so as to facilitate the understanding of the algorithm's mining process

But the algorithm idea is clear, basically a while can be done.


Package Cluster;import Java.io.bufferedreader;import Java.io.file;import java.io.fileinputstream;import Java.io.inputstreamreader;import Java.util.arraylist;import Java.util.hashset;import Java.util.Iterator;import java.util.list;/** * Apriori algorithm for maximum pattern mining, involving support degree, but no confidence calculation * @author Push_pop * */public class Apriorimyself {private static Final double Min_supprot = 0.2;//Minimum support private static Boolean Endtag = false;//loop State static list<list<string>> R Ecord = new arraylist<list<string>> ();//DataSet public static void Main (String args[]) {//************* Read Data set * * * Record = Getrecord ();//console output record System.out.println ("read data set record in matrix form"); for (int i=0;i<record.size (); i+ +) {list<string> list= new arraylist<string> (Record.get (i)); for (int j=0;j<list.size (); j + +) { System.out.print (List.get (j) + ""); System.out.println ();} Get candidate 1 itemsets **************list<list<string>> candidateitemset = Findfirstcandidate ();// Console Output 1 candidate set System.out.println ("Level 1 option after first scanSet Candidateitemset "); for (int i=0;i<candidateitemset.size (); i++) {list<string> List = new arraylist<string > (Candidateitemset.get (i)); for (int j=0;j<list.size (); j + +) {System.out.print (List.get (j) + "");} System.out.println ();} Gets the frequent 1 itemsets ***************list<list<string>> frequentitemset = Getsupproteditemset ( Candidateitemset);//console Output 1 frequent set System.out.println ("1-level frequent set frequentitemset after first scan"); for (int i=0;i< Frequentitemset.size (); i++) {list<string> List = new Arraylist<string> (Frequentitemset.get (i)); for (Int J =0;j<list.size (); j + +) {System.out.print (List.get (j) + "");} System.out.println ();} Iterative process **************while (endtag!=true) {//********** Connection Operations * * * * * * * * * * * * * * * * * * candidate K itemsets set **************LIST&L T list<string>> Nextcandidateitemset = getnextcandidate (Frequentitemset); SYSTEM.OUT.PRINTLN ("Scan backup selected"), for (int i=0;i<nextcandidateitemset.size (); i++) {list<string> List = new Arraylist<string> (Nextcandidateitemset.get (i)); for (int j=0;j<list.size (); j + +) {System.out.print (List.get (j) + "");} System.out.println ();} Branch reduction Operation * * gets frequent k itemsets by candidate K itemsets ****************list<list<string>> nextfrequentitemset = Getsu Pproteditemset (Nextcandidateitemset); System.out.println ("Frequent set after Scan"), for (int i=0;i<nextfrequentitemset.size (); i++) {list<string> List = new Arraylist<string> (Nextfrequentitemset.get (i)); for (int j=0;j<list.size (); j + +) {System.out.print (list.get (j) + "");} System.out.println ();} If the loop ends, the output maximum mode **************if (Endtag = = True) {System.out.println ("Apriori Algorithm---> Frequent set"); for (int i=0;i <frequentitemset.size (); i++) {list<string> List = new Arraylist<string> (Frequentitemset.get (i)); for ( int J=0;j<list.size (); j + +) {System.out.print (List.get (j) + "");} System.out.println ();}} The next cycle initial value ********************candidateitemset = Nextcandidateitemset; Frequentitemset = Nextfrequentitemset;}} /** * Read txt data * @return */public static list<list<stRing>> Getrecord () {list<list<string>> record = new arraylist<list<string>> (); try { String encoding = "GBK";  Character encoding (can solve Chinese garbled problem) file File = new file ("Simple.txt"), if (File.isfile () && file.exists ()) {InputStreamReader Read = new InputStreamReader (new FileInputStream (file), encoding); BufferedReader BufferedReader = new BufferedReader (read); String linetxt = Null;while ((linetxt = Bufferedreader.readline ()) = null) {//read one line of file string[] lineString = Linetxt.split ("" ); list<string> linelist = new arraylist<string> (); for (int i = 0; i < linestring.length; i++) {//T, F, YE in the processing matrix S, Noif (Linestring[i].endswith ("T") | | linestring[i].endswith ("YES") Linelist.add (Record.get (0). Get (i)), else if ( Linestring[i].endswith ("F") | | Linestring[i].endswith ("no"));//F,no record does not save Elselinelist.add (Linestring[i]);} Record.add (linelist);} Read.close ();} else {System.out.println ("Cannot find the file specified! ");}} catch (Exception e) {System.out.println ("Error reading file contents operation"); E.printstacktrace ();} Return Record;} /** * There are current frequent itemsets self-joins to find the next candidate set * @param frequentitemset * @return */private static list<list<string>> Getnextcandida Te (list<list<string>> frequentitemset) {list<list<string>> nextCandidateItemset = new Arraylist<list<string>> (); for (int i=0; i<frequentitemset.size (); i++) {hashset<string> HsSet = New Hashset<string> (); hashset<string> hssettemp = new hashset<string> (); for (int k=0; k< frequentitemset.get (i). Size (); k++)// Get frequent episodes of line I Hsset.add (Frequentitemset.get (i). Get (k)), int hslength_before = Hsset.size ();//Add previous length hssettemp= (hashset< string>) Hsset.clone (); for (int h=i+1; h<frequentitemset.size (); h++) {//frequent set line I and Line J (J>i) connections are added each time and an element is added to form a new frequent A row of the itemsets, hsset= (hashset<string>) Hssettemp.clone ();//!!! The hasset to make the connection remains unchanged for (int j=0; j< frequentitemset.get (h). Size (); j + +) Hsset.add (Frequentitemset.get (h). Get (j)); int Hslength_after = Hsset.size (); if (hslength_before+1 = = Hslength_after && issubsetOf (Hsset,record) ==1 && isnothave (hsset,nextcandidateitemset)) {//If not equal, indicates that 1 new elements have been added and then judged whether they are a subset of a record row If it is a candidate concentration of a iterator<string> ITR = Hsset.iterator (); list<string> templist = new arraylist<string> (); while (Itr.hasnext ()) {string Item = (String) itr.next (); Templist.add (Item);} Nextcandidateitemset.add (templist);}}} return nextcandidateitemset;} /** * Determines whether the candidate set formed by the newly added element is in a new candidate set * @param hsset * @param nextcandidateitemset * @return */private static Boolean Isnothave (H Ashset<string> hsset,list<list<string>> Nextcandidateitemset) {//TODO Auto-generated method stublist<string> templist = new arraylist<string> ();iterator<string> ITR = HsSet.iterator (); while ( Itr.hasnext ()) {String Item = (string) itr.next (); Templist.add (Item);} for (int i=0; i<nextcandidateitemset.size (); i++) if (Templist.equals (Nextcandidateitemset.get (i)) return false; return true;} /** * Determine if Hsset is a subset of records in RECORD2 * @param hsset * @param record2 * @return */private static int IsSubsetOf (hashset<string> hsset,list<list<string>> record2) {//hsset converted to Listlist <String> templist = new arraylist<string> ();iterator<string> ITR = Hsset.iterator (); while ( Itr.hasnext ()) {String Item = (string) itr.next (); Templist.add (Item);} for (int i=1;i<record.size (), i++) {list<string> Templistrecord = new arraylist<string> (); for (int j=1;j <record.get (i). Size (); j + +) Templistrecord.add (Record.get (i). Get (j)); if (Templistrecord.containsall (templist)) return 1;} return 0;} /** * By pruning the K-term candidate set to get K term frequent set * @param candidateitemset * @return */private static list<list<string>> Getsupprotedite Mset (list<list<string>> candidateitemset) {//TODO auto-generated method Stubboolean end = true; list<list<string>> supporteditemset = new arraylist<list<string>> (); int k = 0;for (int i = 0; I &l T Candidateitemset.size (); i++) {int count = countfrequent (Candidateitemset.get (i));//Statistics Record count if (count >= min_sUpprot * (Record.size ()-1)) {Supporteditemset.add (Candidateitemset.get ()); end = false;}} Endtag = end;//The presence of frequent itemsets does not end if (endtag==true) System.out.println ("No satisfy support itemsets, end Connection"); return supporteditemset;} /** * The number of list collections appears in the statistics record * @param list * @return */private static int countfrequent (list<string> list) {//TODO Au To-generated method Stubint count = 0;for (int i = 1; i<record.size (); i++) {Boolean nothavethislist = false;for (int k= 0; K < List.size (); k++) {//Determine if Record.get (i) contains listboolean thisrecordhave = false;for (int j=1; J<record.get (i). Size (); j + +) {if ( List.get (k). Equals (Record.get (i). Get (j)))//list. Get (k) in the record. Get (i) can find Thisrecordhave = true;} if (!thisrecordhave) {//If a list element cannot be found, exit the remaining element comparison and make the next record. Get (i) compare nothavethislist = True;break;}} if (nothavethislist = = False) count++;} return count;} /** * Get a candidate set * @return */private static list<list<string>> findfirstcandidate () {//TODO auto-generated Metho D stublist<list<string>> tablelist = new Arraylist<lIst<string>> (); hashset<string> hs = new hashset<string> (); for (int i = 1; i<record.size (); i++) {//first act commodity information for (int j=1;j <record.get (i). Size (); j + +) {Hs.add (Record.get (i). Get (j));}} iterator<string> ITR = Hs.iterator (); while (Itr.hasnext ()) {list<string> templist = new arraylist<string > (); String Item = (string) itr.next (); Templist.add (Item); Tablelist.add (templist);} return tablelist;}}





The flaw of the Apriori algorithm is also obvious:

1. if the amount of data is large, a large number of candidate sets will be. N frequent 1 itemsets may produce (N-1) *N/2 candidate 2 itemsets

2 The database requires a multilateral scan, and the frequent set is re-scanned once per self-connected.

About its improvement will be written in the next blog----------association rule mining algorithm without generating candidate set Fptree

Apriori algorithm Learning and Java implementation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.