The algorithm of FP association rules for calculating confidence and realization of MapReduce

Source: Internet
Author: User

Description: Refer to mahout FP algorithm related source code.

Algorithmic engineering can be downloaded with the confidence level of the FP Association rules: (Just a standalone version of the implementation, and no MapReduce code)

Using the FP association rule algorithm to calculate confidence is based on the following ideas:

1. First use the original FP Tree Association rules to dig out all the frequent itemsets and its support degree; note here that this is the output of all the frequent itemsets, and does not merge the frequent itemsets, so you need to modify the relevant code of the FP tree, and in some steps, output all the frequent itemsets ; (PS: The implementation of the FP-tree version of the reference Mahout is modified to determine whether all frequent itemsets have been output)

For example, you can use the following data as an example (the original transaction set):

Milk, eggs, bread, potato chips, eggs, popcorn, potato chips, beer eggs, bread, potato chips, eggs, bread, popcorn, potato chips, beer milk, bread, beer eggs, bread, beer milk, bread, potato chips, milk, eggs, bread, butter, potato chips, milk, eggs, butter, potato chips

2. Get all the frequent itemsets as follows:

0,2,3=42,4=30,1,2,3=33=62=71,2=51=70=70,3=50,2=60,1=54=40,1,2=40,1,3=41,3=51,4=3
In the frequent itemsets above, the support degree is followed by the equals sign, and each frequent itemsets shows the encoded, coded rules as follows: {Potato chips = 0, milk = 3, egg = 2, bread = 1, beer = 4}. At the same time, you can see that the code in the frequent itemsets above is arranged in order (from small to large);

Calculate the confidence level for each frequent itemsets (only 2 items and more than 2 frequent itemsets are counted):

1) for frequent n Itemsets, find its forward (forward defined as the front n-1 itemsets, such as frequent itemsets: 0,2,3 then its forward to 0, 2) of the support degree, if frequent n itemsets exist, then its forward (frequent n-1 itemsets) inevitably exist (if the frequent itemsets are all frequent itemsets, This rule must be established);

2) The confidence degree of N Itemsets can be obtained by dividing the support degree of n itemsets by the forward support degree of n itemsets;

3. The confidence level of all frequent itemsets can be calculated in accordance with 2, but there is a problem: only calculate the confidence of the frequent itemsets, such as 0,2,3, but not the confidence of the frequent itemsets of 0,3,2 (the frequent itemsets 0,2,3 and 0,3,2 are the same frequent itemsets in the FP algorithm, But when the confidence level is calculated, it will be different);

4. For the 3 problem, the following solution can be given;

For each record (containing n items), each item as a back (defined as the last item, such as frequent itemsets 0,1,2,3 can be 0, 1, 2, 3 as the back, the output relative to the confidence degree of the frequent itemsets), the other forward to maintain relative order; For example, for frequent itemsets: 0,1,2,3 = 3, should output: 1,2,3,0=3 0,2,3,1=3 0,1,3,2=3 0,1,2,3=3 because of the same back, its forward order in fact for the calculation of confidence is not affected (for example, for the 0 as the back, the three-dimensional and 2,1,3 and 3,2,1 support should be the same), so for the algorithm of its output contains a confidence of the frequent itemsets is relatively complete, using the 4 method of the original FP tree to expand the frequent itemsets to obtain 2 of the data of the calculation method. 5. For 4 of the output (4 of the output is only frequent itemsets and support, compared to 3 of the results of the 1 records into N only), the calculation of the confidence of each frequent itemsets, in fact, is two tables of mutual search, that is, a table of n frequently set reference b table of n-1 item frequent set to calculate the N confidence degree, Table A and B are the same. The idea of MapReduce can be used, which is explained later.
Reference mahout of the single-machine FP implementation, the calculation of the confidence of the code is mainly as follows: 1. Because the implementation of mahout inside its interaction with the HDFs file, which can interact with local files, or directly into the memory, the implementation of this article is directly stored in memory, if you want to save the file is also possible (but may be filtered to duplicate records of the operation) Therefore, some code in the FP is modified, such as the following code:
Generatetopkfrequentpatterns (New transactioniterator<a> (//transactionstream, attributeIdMapping), Attributefrequency,//minsupport, K, Reversemapping.size (), Returnfeatures,//new topkpatternsoutputconverter<a > (output, reversemapping),//updater);
Modify to the following code:
Generatetopkfrequentpatterns (New transactioniterator<a> (Transactionstream, attributeidmapping), Attributefrequency,minsupport, K, Reversemapping.size (), returnfeatures,reversemapping);
This kind of modification in the fptree inside has many, does not repeat each one, the detailed reference this article source code downloads the project;
2. Since the frequent itemsets of the last output of the FP tree of Mahout are consolidated, some frequent itemsets have no output (that is, only the maximum frequent itemsets are output), and the algorithm presented above requires all frequent itemsets to be output. So in the function generatesinglepathpatterns (fptree tree, int k,long minsupport) and function Generatesinglepathpatterns (Fptree tree,int K, Long Minsupport) Adds a sentence to the return:
Addfrequentpatternmaxheap (Frequentpatterns);

The specific code for this method is:
/**   * Stores all frequent itemsets   * @param patternsout *  /private static void Addfrequentpatternmaxheap ( Frequentpatternmaxheap patternsout) {string[] pstr=null;//there is a problem with the pattern here, temporarily using string parsing for (pattern p:patternsout.getheap ()) {pstr=p.tostring (). Split ("-"); if (pstr.length<=0) {continue;} The string is processed in such a way as to reduce the storage pstr[0]=pstr[0].replaceall ("," ");p str[0]=pstr[0].substring (1,pstr[0].length ()-1); Patterns.containskey (Pstr[0]) {if (Patterns.get (pstr[0]) <p.support ()) {//FETCH only the most supported Patterns.remove (Pstr[0]); Patterns.put (Pstr[0], p.support ());}} Else{patterns.put (Pstr[0], p.support ());}}}
This assumes that the operation can get all the frequent itemsets and is deposited into the patterns static map variable.
3. Expand the frequent itemsets generated by the FP according to the description in the 4th of the idea:
/**   * Generate multi-frequency itemsets support according to the rank of frequent camera support *  /public void Generatefatpatterns () {  int[] patternints=null;  For (String P:p atterns.keyset ()) {  patternints = Getintsfrompattern (p);  if (patternints.length==1) {//For frequent one-set  fatpatterns.put (String.valueof (Patternints[0]), Patterns.get (p));  } else{  putints2fatpatterns (Patternints,patterns.get (P));}}}  /**   * Outputs each item in the array as a back-up, added to Fatpatterns   * @param patternints * @param support *    /  private void Putints2fatpatterns (int[] patternints, Long support) {//TODO auto-generated method stubstring patternstr =ints2str ( patternints); Fatpatterns.put (patternstr, support);//processing the Last post for (int i=0;i<patternints.length-1;i++) {//  The last post has been processed in the previous//cannot use the same array patternstr=ints2str (Swap (patternints,i,patternints.length-1)); Fatpatterns.put ( PATTERNSTR, support);}  }

4. Calculate the confidence level for the set of frequency items that are output above:
public void Savepatterns (String output,map<string,long> Map) {  //empties Patternsmap  patternsmap.clear ();    String Preitem=null;  For (String P:map.keyset ()) {  //item does not have forward, do not look for  if (P.lastindexof (",") ==-1) {  continue;  }  Find forward  Preitem = p.substring (0, P.lastindexof (","));  if (Map.containskey (Preitem)) {  //Patterns.get (p) Support degree, Patterns.get (Preitem) Forward Support  Patternsmap.put (p, Map.get (P) *1.0/map.get (Preitem));  }  }  Fptreedriver.createfile (patternsmap, Output);  }
This calculation is simple because the frequent itemsets and support are placed in the map. The correlation rules that are generated by using an expanded and non-expanded set of frequent itemsets are as follows:

Here you can see the extended association rules 1,3 and 3,1 are not the same, its confidence is 0.714, one is 0.8, so it can be explained that the probability of introducing 3 from 1 has not been introduced from the 3 probability of 1. A simple example of life: Buy a TV, and then buy the probability of remote control than buy a remote control and then buy a TV set the probability of large.
But what if all the expanded frequent itemsets and support data are too large to fit completely into memory?
This can be accomplished using the idea of MapReduce.
MapReduce realization of the calculation of the confidence level of thinking: 0. This assumes that the extended frequent itemsets have been supported and exist with the HDFs file system above, assuming file A;
1. Copy document A To get the document B;2. Refer to "Hadoop multi-file format input" http://blog.csdn.net/fansy1990/article/details/26267637, design two different mapper, one processing a, one processing B. The processing of a is the direct output frequent itemsets, the output key is the frequent itemsets, the value is the support degree and the label of a, and the processing of B is the output of the frequent itemsets with items greater than 1 in the frequent itemsets, whose output key is the forward of the frequent itemsets (defined as the n-1 items in front of the frequent itemsets). Value is the back of the frequent itemsets (the latter is defined as the last item in the frequent itemsets) and the support degree and label. 3. Data processed by different mapper in 2 will be pooled in reducer. The following processing is done in reducer: for the same key, traverse its value collection, if it is labeled a tag then assign a variable, used to do the denominator, if it is a B tag then calculate, calculate is the B-label data support divided by just a tag of the variable,  The confidence level is confidence, the output key is the back direction of the input Key+b label, and value is confidence.  For example, the output of A is < (0,2,3),5+a>; where (0,2,3) is a frequent itemsets, is Key;5+a is value, where 5 is the support degree, and a is the output in label B: < (0,2,3),4+3+b>;where (0,2,3,4) is a frequent itemsets, key is forward (0,2,3), 4+3+b is value, where 4 is the back direction, 3 is the support degree, and b is the label;< (0,2,3),5+3+b>, where (0,2,3,5) is a frequent itemsets, key is forward (0,2,3); 5+3+b is value, where 5 is back, 3 is support, and B is label;
< (0,2,3),6+2+b>; where (0,2,3,6) is a frequent itemsets, key is forward (0,2,3); 6+2+b is value, where 6 is back, 2 is support, and B is label;

Then the data summarized in reducer is:< (0,2,3), [(5+a), (4+3+b),(5+3+b),(6+2+b) >; iterates through the value collection, first, the support degree of label A as the denominator, that is, the denominator is 5; Then, the label of B is traversed, for example (4+3+B), because it is a label of B, the calculation of support is divided by A's denominator of 3/5 and the output to < (0,2,3,4),3/5> And then traverse to output:< (0,2,3,5),3/5>; < (0,2,3,6),2/5>; and other records;
It is important to note that 1) when there is only a label data in reducer, it is not necessary to output, because this is the maximum frequent itemsets and cannot be used as the forward direction of any frequent itemsets; 2) The above value can be designed as a custom writable type instead of a simple string; 3) The MapReduce code to calculate the confidence level is not implemented, the download provided in this article is a standalone version, and the extended frequent itemsets can be put into memory;

Share, grow, be happy

Reprint Please specify blog address: http://blog.csdn.net/fansy1990





The algorithm of FP association rules for calculating confidence and realization of MapReduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.