Introduction of frequent pattern mining Apriori algorithm and Java implementation

Source: Internet
Author: User

Frequent patterns are patterns that occur frequently in datasets, such as itemsets, sub-sequences, or sub-structures. For example, a collection of goods (such as milk and bread) that is frequently present in the transaction data set is a frequent itemsets.


Some basic concepts

Support: Supports (A=>B) =p (A and B)

Confidence level: Confidence (a=>b) =p (b| A

Frequent k itemsets: If the support degree of itemsets I satisfies the predefined minimum support threshold, I is called frequent itemsets, and the set of items containing K items is referred to as K-itemsets.


Algorithmic thinking

The Apriori algorithm is Agrawal and R. Srikant was proposed in 1994 as an original algorithm for mining frequent itemsets for Boolean association rules. By name you can see that the algorithm is based on the fact that the algorithm uses a priori knowledge of the nature of frequent itemsets. The Apriori algorithm uses an iterative algorithm that becomes a layered search, where K-itemsets are used to explore (k+1) itemsets. First, by scanning the database, accumulate the count of each item, and collect the items that meet the minimum support level, and find the collection of frequent 1 itemsets. The collection is recorded as L1. Then, using L1 to find the collection of frequently 2 itemsets L2, use L2 to find L3, so go on until you can no longer find frequent k itemsets.

As you can imagine, the computational complexity of the algorithm is very large. In order to improve the efficiency of frequent itemsets, it is necessary to use a priori nature ( all non-empty sets of frequent itemsets must be frequent; in other words, if a collection has a non-empty set that is not a frequent itemsets, the collection is not a frequent itemsets) to compress the search space.

How is a priori property used in the algorithm? To understand this, we examine how to use Lk-1 to find LK, where k>=2. Consists of two main steps: The connection step and the pruning step.

Join Step : To find LK, create a set of K-Itemsets for the candidate set by connecting the Lk-1 to itself. The collection of the candidate sets is recorded as CK. Set L1 and L2 are itemsets in the Lk-1. Mark Li[j] indicates that Li's item j (for example, L1[k-2] represents the penultimate 2nd of L1). For effective implementation, the Apriori algorithm assumes that items in a transaction or an item set are ordered in dictionary order. for (k-1) itemsets, this means sorting items so that li[1]<li[2]<...<li[k-1]. Connect Lk-1 and Lk-1, where Lk-1 elements are connected if they are the same as the previous (k-2) entries. That is, the elements of the Lk-1 L1 and L2 are connected if (L1[1]=l2[1]) ^ (l1[2]=l2[2]) ^...^ (l1[k-2]=l2[k-2]) ^ (l1[k-1]<l2[k-1]). Condition L1[k-1]<l2[k-1] is simply guaranteed not to produce duplicates. The set of result itemsets that are generated by connecting L1 and L2 is {l1[1],l1[2],..., l1[k-1],l2[k-1]}

Pruning step : CK is a superset of LK, that is to say, the members of CK may or may not be frequent. By scanning all transactions (transactions), determining the count of each candidate in the CK, judging whether it is less than the minimum support count, if not, the candidate is considered frequent. To compress CK, you can take advantage of the Apriori property: all non-empty sets of any frequent itemsets must also be frequent, and conversely, if a candidate's non-empty set is not frequent, then the candidate is definitely not frequent and can be removed from CK. (This step takes advantage of the priori properties of the red label)


Legend



Pseudo code

Algorithm: Apriori Input: D-transaction database; min_sup-Minimum support count threshold output: Frequent itemsets in l-d method: L1=find_frequent_1-itemsets (d); Find out all frequent 1 itemsets for (k=2; lk-1!=null;k++) {Ck=apriori_gen (Lk-1);//Generate candidates and prune for each transaction T in d{//scan D for candidate Count Ct =subset (Ck, T);        Get a subset of T for each candidate C belongs to Ct c.count++; } lk={c belongs to CK | C.count>=min_sup}}return l= all frequent sets; Procedure Apriori_gen (Lk-1:frequent (k-1)-itemsets) for each Itemsets L1 belong to Lk-1 for each itemsets L2 belong to Lk-1 If ((l1[1]=l2[1]) && (l1[2]=l2[2]) &&.......&am p;& (L1[k-2]=l2[k-2]) && (l1[k-1]<l2[k-1]) then{C=L1 connection L2//Connection step: Generate candidate if Has_infrequent_subset (c,lk-1) then delete C;                  Pruning Step: Delete Non-frequent candidate else add C to Ck; } Return Ck; Procedure has_infrequent_sub (C:candidate k-itemset;       Lk-1:frequent (k-1)-itemsets) for each (k-1)-subset S of C If s does not belong to Lk-1 then        Return true; Return false;


Java implementation

The Java code is basically written in strict accordance with the process of pseudo-code, it is relatively easy to understand.

Package Com.zhyoulun.apriori;import Java.util.arraylist;import Java.util.hashmap;import java.util.List;import Java.util.map;import Java.util.set;public class apriori2{private final static int support = 2;//supporting threshold private final Stati C Double CONFIDENCE = 0.7; Confidence threshold Private Final static String item_split = ";"; The delimiter between items private final static String CON = "-"; Separator between items/** * Algorithm Main program * @param dataList * @return */public map<string, integer> apriori (arraylist<string> data List) {map<string, integer> stepfrequentsetmap = new hashmap<> (); Stepfrequentsetmap.putall ( Findfrequentonesets (dataList)); map<string, integer> frequentsetmap = new hashmap<string, integer> ();//frequent itemsets Frequentsetmap.putall ( STEPFREQUENTSETMAP); while (Stepfrequentsetmap!=null && stepfrequentsetmap.size () >0) {map<string, integer> Candidatesetmap = Apriorigen (Stepfrequentsetmap); set<string> Candidatekeyset = Candidatesetmap.keyset ();//Scan D, count for (String data:datalist) {for(String Candidate:candidatekeyset) {Boolean flag = true; String[] strings = Candidate.split (Item_split); for (String string:strings) {if (Data.indexof (string+item_split) ==-1) { flag = False;break;}} if (flag) Candidatesetmap.put (candidate, Candidatesetmap.get (candidate) +1);}} Find a frequent itemsets stepfrequentsetmap.clear () with support degrees from the candidate set; for (String candidate:candidatekeyset) {Integer count = Candidatesetmap.get (candidate); if (Count>=support) stepfrequentsetmap.put (candidate, count);} Merge all frequent sets Frequentsetmap.putall (Stepfrequentsetmap);} return frequentsetmap;} /** * Find frequent 1 itemsets * @param dataList * @return */private map<string, integer> findfrequentonesets (Arrayl Ist<string> dataList) {map<string, integer> resultsetmap = new hashmap<> (); for (String data:datalist) {string[] strings = Data.split (Item_split); for (string string:strings) {string + = Item_split;if (Resultsetmap.get ( String) {==null) {resultsetmap.put (string, 1);} else {resultsetmap.put (String, Resultsetmap.get (string) +1);}}} return reSultsetmap;} /** * Select candidate set based on the set of frequent itemsets in the previous step * @param setmap * @return */private map<string, integer> Apriorigen (map<string, integer& Gt Setmap) {map<string, integer> candidatesetmap = new hashmap<> (); set<string> Candidateset = Setmap.keyset (); for (String S1:candidateset) {string[] strings1 = S1.split (item_split) ; String s1string = ""; for (string temp:strings1) s1string + = Temp+item_split;for (string s2:candidateset) {string[] STRINGS2 = S2.split (Item_split); Boolean flag = true;for (int i=0;i<strings1.length-1;i++) {if (Strings1[i].compareto (Strings2[i])!=0) {flag = False;break;}} if (flag && Strings1[strings1.length-1].compareto (strings2[strings1.length-1]) <0) {//Connection step: Generate candidate string c = S1string+strings2[strings2.length-1]+item_split;if (Hasinfrequentsubset (c, Setmap)) {//Pruning step: Delete Non-frequent candidate}else { Candidatesetmap.put (c, 0);}}} return candidatesetmap;} /** * Use prior knowledge to determine if the candidate set is a frequent itemsets * @param candidate * @param setmap * @return */private boolean hasinfrequentsubset (String Candid Ateset, MaP<string, integer> setmap) {string[] strings = Candidateset.split (item_split);//Identify all subsets of the candidate set and determine if each subset belongs to frequent subset for ( int i=0;i<strings.length;i++) {String subString = ""; for (int j=0;j<strings.length;j++) {if (j!=i) {subString + = Strings[j]+item_split;}} if (Setmap.get (subString) ==null) return true;} return false;} /** * Association rules generated by frequent itemsets * @param frequentsetmap * @return */public map<string, double> getrelationrules (map<string, Int Eger> frequentsetmap) {map<string, double> relationsmap = new hashmap<> (); set<string> KeySet = Frequentsetmap.keyset (); for (String key:keyset) {list<string> keysubset = subset (key); for (String keysubsetitem:keysubset) {//Subset Keysubsetitem is also a frequent item of integer count = Frequentsetmap.get (Keysubsetitem); Count!=null) {Double confidence = (1.0*frequentsetmap.get (key))/(1.0*frequentsetmap.get (Keysubsetitem)); if ( confidence>confidence) Relationsmap.put (Keysubsetitem+con+expect (Key, Keysubsetitem), confidence);}}} return relationsmap;} /** * Ask for a collection of all non-empty true subsets *  * @param sourceset * @return * For later can be used elsewhere, here we are not using recursive method * Reference: http://blog.163.com/[email protected]/blog/static/ 3980524020109784356915/* Idea: Suppose the set S (a,b,c,d), its size is 4, has 2 of 4 sub-subsets, that is 0-15, the binary is represented as 0000,0001,...,1111. * The corresponding subset is an empty set, {d},...,{a,b,c,d}. */private list<string> subset (String sourceset) {list<string> result = new arraylist<> (); String[] strings = Sourceset.split (item_split);//non-null true subset for (int i=1;i< (int) (Math.pow (2, strings.length)) -1;i++) { String item = ""; String flag = ""; int ii=i;do{flag + = "" +ii%2;ii = II/2;} while (ii>0), for (int j=flag.length () -1;j>=0;j--) {if (Flag.charat (j) = = ' 1 ') {item = Strings[j]+item_split+item;}} Result.add (item);} return result;} /** * Set operation, A/b * @param A * @param B * @return */private string Expect (string stringa,string stringb) {string result = ""; S tring[] Stringas = Stringa.split (item_split); string[] Stringbs = Stringb.split (Item_split); for (int i=0;i<stringas.length;i++) {Boolean flag = true;for (int j=0;j <stringbs.length;j++) {if (StRingas[i].compareto (Stringbs[j]) ==0) {flag = False;break;}} if (flag) result + = Stringas[i]+item_split;} return result;} public static void Main (string[] args) {arraylist<string> dataList = new arraylist<> ();d atalist.add ("1;2;5;"); Datalist.add ("2;4;"); Datalist.add ("2;3;"); Datalist.add ("1;2;4;"); Datalist.add ("1;3;"); Datalist.add ("2;3;"); Datalist.add ("1;3;"); Datalist.add ("1;2;3;5;"); Datalist.add ("1;2;3;"); SYSTEM.OUT.PRINTLN ("= Data set =========="); for (string string:datalist) {System.out.println (string);} Apriori2 apriori2 = new Apriori2 (); System.out.println ("= frequent itemsets =========="); map<string, integer> frequentsetmap = Apriori2.apriori (dataList); set<string> KeySet = Frequentsetmap.keyset (); for (String Key:keyset) {System.out.println (key+ ":" + Frequentsetmap.get (key));} SYSTEM.OUT.PRINTLN ("= association rule =========="); map<string, double> relationrulesmap = Apriori2.getrelationrules (Frequentsetmap); set<string> Rrkeyset = Relationrulesmap.keyset (); for (String Rrkey:rrkeyset) {System.out.println (Rrkey + ":" + relationrulesmap.get (Rrkey));}}} 


Calculation results

= Data Set ==========1;2;5;2;4;2;3;1;2;4;1;3;2;3;1;3;1;2;3;5;1;2;3;= frequent itemsets ==========1;2; : 41;3; : 45; : 22;3; : 44; : 22;4; : 21;5; : 23; : 62; : 71; : 61;2;5; : 21;2;3; : 22;5; : 2= Association Rules ==========4;->2;  :  1.05;->1;2;  :  1.05;->1;  :  1.01;5;->2;  : 1.05;->2;  :  1.02;5;->1;  :  1.0


Reference:

http://blog.csdn.net/zjd950131/article/details/8071414

Http://www.cnblogs.com/zacard-orc/p/3646979.html

Data mining: Concepts and technologies


Reprint Please specify source: http://blog.csdn.net/zhyoulun/article/details/41978401


Introduction of frequent pattern mining Apriori algorithm and Java implementation

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.