Design and Implementation of association rule Apriori algorithm based on STL

Source: Internet
Author: User

Association rule association rule mining is one of the many "rule-based" data mining methods. The basic theory of Association Rules (assuming that the reader knows) is not described here in detail, the following describes the algorithm design.

    The main idea of the Apriori algorithm:    1. The candidate item set is constructed based on "the subset of a frequent item set must be a frequent item set, and the superset of a non-frequent item set must be a non-frequent item set, then, the support of candidate item sets is calculated by traversing the transaction database to obtain frequent item sets;    2. Association Rules are generated by frequent item sets.    In my opinion, association rules are relatively simple in theory. I believe many people will feel this way, but it is quite difficult to design and implement algorithms. The key to the problem is: the frequent item set cannot be designed (a suitable one-item set and two-item set... n-General Data Structure of the item set) and reasonable data structure of the rule. Without these two data structures, the association rules should be "on paper" or "confused by flowers.      The data structure of a rule is well designed. A rule contains only a few Members (Rule conditions, conclusions, support, reliability, and improvement ), therefore, a rule can be represented by a struct (rule:Typedef struct { Char condition [80];  Char conclusion [80];  Double sup;  Double conf;  Double lift;} Rule;     All rules can be viewed as a list of the struct (list <rule> lst_rule) or a variable array (vector <rule> vt_rule )./* Anything, send me Email datamining@163.com My QQ 275869936From http://blog.sina.com.cn/dataminer321 */       The data structure of the item set is complicated. First, describe the structure and usage of the item set, so that we can have an overall understanding of it.      The item set and the set of items for each transaction are similar. They are all collections of several items {A, B, C, D, E, G ...}, the number of elements (items) is 1... n (n-condition of the item set ).      Use of item set:        1. Two frequent N-item sets are connected into a Hou selected n + 1 item set      2. When calculating the support of candidate item sets, determine whether a candidate item set is in the current transaction (auxiliary function 2 ).      This requires that the data structure of the item set must be able to accommodate n items and n + 1 items. That is, the data structure of the item set must adapt to changes in the number of items.        Here, a Data Structure of the item set contains two parts: 1. A string of the collection of items (which is also suitable for transactions); 2. Supports counting.  Map <string, int> can be used for storage. The first parameter is a string, that is, string in STL. The item set (or transaction) is saved as a string, the item and item are separated by the symbol "|". For example, the item set composed of item A, item B, and item C is in the string m_itemset = "A | B | C" format. When used, auxiliary function 1 splits the m_itemset string according to "|" and stores each item in vector <string> vt_itemset. Each element in the vector is an item.      The second parameter supports counting.        Next we will discuss the second step, that is, rules generated by frequent item sets. The program implementation in this step is rarely mentioned in the paper. We will discuss the first step, that is, generating frequent item sets, it may be because the first step is crucial to improve the performance of the apsaradb for memcache.    One n-item set can generate 2 N-power minus two rules (the set of conditions and conclusions of each rule is the items in the entire item set, that is, N items) [method 1]. do not consider more here. For example, each n-item set contains N n-1 items, each n-1 item set can also generate two rules minus the n-1 power of 2. The reason is that our frequent item set includes frequent 1-item set and frequent 2-item set... until the frequent N-item set (instead of storing the maximum length frequent item set), as long as we follow [method 1] to generate corresponding rules for each frequent item set, all the rules are obtained.    Auxiliary Function 1:/* function call example: m_strsource = "A | B | C"; // input parameter substr = "| "; // Input parametersVector <string> vitem; getitemsfromstring (m_strsource, vitem, substr); vitem [0] = ""; // Output parametersVitem [1] = "B"; vitem [2] = "C"; */void getitemsfromstring (string & m_strsource, vector <string> & vitem, string substr)
{
  Vitem. Clear ();
  Int J;
  Int I = m_strsource.find (substr, 0 );
 
  If (I =-1)
  {
    // M_strsource is a item
    Vitem. push_back (m_strsource );
  }
  Else
  {
    String m_strtemp = m_strsource.substr (0, I );
    Vitem. push_back (m_strtemp );
   
    While (I! =-1)
    {
      J = m_strsource.find (substr, I + 1 );
     
      If (j =-1)
      {
        M_strtemp = m_strsource.substr (I + 1, m_strsource.size ()-i-1 );
        Vitem. push_back (m_strtemp );
      }
      Else
      {
        M_strtemp = m_strsource.substr (I + 1, j-i-1 );
        Vitem. push_back (m_strtemp );
      }
     
      I = J;
    } // End of while
  } // End of else
}
Auxiliary Function 2: judge whether the candidate item set V1 is in transaction v2

Bool isin (const vector <string> & V1, const vector <string> & V2)
{
 Int nsize1 = v1.size ();
 Int nsize2 = v2.size ();
 
 For (INT I = 0; I <nsize1; I ++) // for1
 {
  Bool m_bflag1 = false;
  For (Int J = 0; j <nsize2; j ++)// For2
  {
   If (V1 [I] = v2 [J])
   {
    M_bflag1 = true;
    Break;
   }
   
  } // End of for2
  
 If (! M_bflag1)
   Return false;
  
 } // End of for1
 
 Return true;
 
}

 

With these two data structures and auxiliary functions, we believe that you can design your association rule mining algorithm.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.