The algorithm of FP association rules for calculating confidence and realization of MapReduce

Source: Internet
Author: User

Description: Reference mahout FP algorithm related source code.

The algorithm project is able to download the confidence level in the FP Association rules: (Just a standalone version of the implementation, and no MapReduce code)

Using the FP association rule algorithm to calculate confidence is based on the following ideas:

1. First use the original FP Tree Association rules to dig out all the frequent itemsets and its support degree; it is important to note that this is the output of all frequent itemsets and does not merge frequent itemsets, so you need to change the relevant code of the FP tree, and in some steps, output all the frequent itemsets ; (PS: The implementation of the FP-tree version of the Mahout, has been changed, and is not sure whether all the frequent itemsets have been output)

For example, you can use the following data as an example (the original transaction set):

Milk, eggs, bread, potato chips, eggs, popcorn, potato chips, beer eggs, bread, potato chips, eggs, bread, popcorn, potato chips, beer milk, bread, beer eggs, bread, beer milk, bread, potato chips, milk, eggs, bread, butter, potato chips, milk, eggs, butter, potato chips

2. Get a full set of frequent itemsets such as the following:

0,2,3=42,4=30,1,2,3=33=62=71,2=51=70=70,3=50,2=60,1=54=40,1,2=40,1,3=41,3=51,4=3
In the frequent itemsets above, the support degree is followed by the equals sign, and each frequent itemsets shows encoded, coded rules such as the following: {Potato chips = 0, milk = 3, egg = 2, bread = 1, beer = 4}. At the same time. You can see that the codes in the frequent itemsets above are arranged in order (from small to large);

Calculate the confidence level for each frequent itemsets (only 2 items and more than 2 frequent itemsets are counted):

1) for frequent n itemsets, look for their forward (forward defined as front n-1 itemsets. For example, frequent itemsets: 0,2,3 so its forward is 0, 2) the degree of support. Assume frequent n itemsets exist. Then its forward (frequent n-1 itemsets) must exist (assuming that the frequent itemsets are all frequent itemsets, this rule must be set up);

2) The confidence of N itemsets can be obtained by dividing the support degree of n itemsets by the forward support of N itemsets.

3. The confidence level of all frequent itemsets can be calculated according to the calculation method in 2. But here's the problem: You can only calculate the confidence of the frequent itemsets, but not the confidence of the 0,3,2 (the frequent itemsets 0,2,3 and 0,3,2 are the same frequent itemsets in the FP algorithm). But when the confidence level is calculated, it will be different);

4. For the 3 problem, the following solutions can be given.

For each record (including n items). Use each item as a back (defined as the last item.) For example, frequent itemsets can 0,1,2,3 0, 1, 2 and 3 respectively.         The output is relative to the frequent itemsets with different confidence degrees, and the other forward remains relative to the order. For example, for frequent itemsets: 0,1,2,3 = 3, the output should be: 1,2,3,0=3 0,2,3,1=3 0,1,3,2=3 0,1,2,3=3 for the same back, its forward   The order in fact has no effect on the computational confidence (compared to the guideline of 0 as the back, the 2,1,3 and the 3,2,1 and the support degree should be the same), so that the output of the algorithm with confidence of the frequent itemsets is relatively complete; Using the 4 method to extend the frequent itemsets of the original FP tree, we can get the data of the calculation method in 2. 5. Output for 4 (the output of 4 is in fact only a frequent itemsets and support degree, more than 3 of the results are only 1 records into N only), the calculation of the confidence of each frequent itemsets, in fact, two tables of mutual search, that is, the N-term set of a table for the n-1 term of B table to calculate the N confidence degree. Table A and B are the same. The idea of being able to use MapReduce. I will elaborate later.


For the mahout of the single-machine FP implementation, the code to calculate the confidence is mainly as follows: 1. Because mahout inside the implementation of its interaction with the HDFs file interaction, where the ability to interact with local files, or directly into the memory, the implementation of this article is directly stored in memory, assuming that the file is also able to be stored (but may be filtered repeatedly recorded operation) , so I changed some of the code in the FP in the standalone version. For example, the following code:
Generatetopkfrequentpatterns (New transactioniterator<a> (//transactionstream, attributeIdMapping), Attributefrequency,//minsupport, K, Reversemapping.size (), Returnfeatures,//new topkpatternsoutputconverter<a > (output, reversemapping),//updater);
Change to the following code:
Generatetopkfrequentpatterns (New transactioniterator<a> (Transactionstream, attributeidmapping), Attributefrequency,minsupport, K, Reversemapping.size (), returnfeatures,reversemapping);
This change in fptree inside there are very many, do not repeat, the details of this article source code download project;
2. Because the frequent itemsets of the last output of the FP tree of Mahout are consolidated, some frequent itemsets have no output (that is, they only output the maximum frequent itemsets), and the algorithm presented above is required to output all frequent itemsets. So in the function generatesinglepathpatterns (fptree tree, int k,long minsupport) and function Generatesinglepathpatterns (Fptree tree,int K, Long Minsupport) Adds a sentence to the return:
Addfrequentpatternmaxheap (Frequentpatterns);

The detailed code for this method is:
/**   * Store all frequent itemsets   * @param patternsout *  /private static void Addfrequentpatternmaxheap ( Frequentpatternmaxheap patternsout) {string[] pstr=null;//there is a problem with the pattern here. Temporarily use string parsing for (Pattern p:patternsout.getheap ()) {pstr=p.tostring (). Split ("-"); if (pstr.length<=0) {continue;} The string is processed to lower the storage Pstr[0]=pstr[0].replaceall ("", "");p str[0]=pstr[0].substring (1,pstr[0].length ()-1); Patterns.containskey (Pstr[0])) {if (Patterns.get (pstr[0]) <p.support ()) {//only take the most supported Patterns.remove (Pstr[0]); Patterns.put (Pstr[0], p.support ());}} Else{patterns.put (Pstr[0], p.support ());}}}
This assumes that this operation is able to get the full set of frequent itemsets. It is also deposited into the patterns static map variable.


3. According to the description described in the 4th of the idea. Expand the frequent itemsets generated by the FP:
/**   * Generate multi-frequency itemsets support based on rank frequent camera support *  /public void Generatefatpatterns () {  int[] patternints=null;  For (String P:p atterns.keyset ()) {  patternints = Getintsfrompattern (p);  if (patternints.length==1) {//For frequent one-set  fatpatterns.put (String.valueof (Patternints[0]), Patterns.get (p));  } else{  putints2fatpatterns (Patternints,patterns.get (P));}}}  /**   * Outputs each item in the array as a back-up, added to Fatpatterns   * @param patternints * @param support *    /  private void Putints2fatpatterns (int[] patternints, Long support) {//TODO auto-generated method stubstring patternstr =ints2str ( patternints); Fatpatterns.put (patternstr, support);//processing the Last post for (int i=0;i<patternints.length-1;i++) {//  The last post has been processed in the previous//cannot use the same array patternstr=ints2str (Swap (patternints,i,patternints.length-1)); Fatpatterns.put ( PATTERNSTR, support);}  }

4. Calculate the confidence level for the set of frequency items that are output above:
public void Savepatterns (String output,map<string,long> Map) {  //empties Patternsmap  patternsmap.clear ();    String Preitem=null;  For (String P:map.keyset ()) {  //Item no forward. Do not look for  if (P.lastindexof (",") ==-1) {  continue;  }  Find forward  Preitem = p.substring (0, P.lastindexof (","));  if (Map.containskey (Preitem)) {  //Patterns.get (p) Support degree, Patterns.get (Preitem) Forward Support  Patternsmap.put (p, Map.get (P) *1.0/map.get (Preitem));  }  }  Fptreedriver.createfile (patternsmap, Output);  }
Because the frequent itemsets and the support are put into the map, it is easier to calculate.

The association rules that are generated from the use of extended and non-expanded frequent itemsets are, for example, seen:

Here can see the extended association rules 1,3 and 3,1 are not the same, its confidence is 0.714, one is 0.8. So it is possible to explain that the probability of introducing a 3 from 1 is not as great as the probability of introducing 1 from 3. Give a simple example of life: Buy a TV set. The probability of buying a remote control is certainly greater than buying a remote control and then buying a TV.
However, assuming that the extended itemsets and support data are too large to fit into memory completely, what should be done?
This can be accomplished using the idea of MapReduce.


MapReduce realization of the calculation of the confidence level of thinking: 0. This assumes that the extended frequent itemsets have been supported and exist with the HDFs file system above. Assumed to be file a.
1. Copy document A To get the document B;2. Refer to "Hadoop multi-file format input" http://blog.csdn.net/fansy1990/article/details/26267637, design two different mapper, one processing a, one processing B.

The processing of a is the direct output frequent itemsets, the output key is the frequent itemsets, the value is the support degree and A's label, the processing of B is the frequent itemsets that output all the frequent itemsets items greater than 1, and the key of the output is the forward of the frequent itemsets (defined as the n-1 items in front of the frequent itemsets). Value is the back of the frequent itemsets (the latter is defined as the last item in the frequent itemsets) and the support degree and label.

3. Data processed by different mapper in 2 will be pooled in reducer. For example, the following processing is done in reducer: for the same key. Iterates through its value collection. Assuming that a tag is affixed, a variable is assigned, which is used to do the denominator, assuming that the B tag is calculated. The calculation is to divide the support of the B label data by the variable of the a tag just now. The confidence level is confidence, the output key is the back direction of the input Key+b label, and value is confidence.

For example, the output of A is < (0,2,3),5+a>; (0,2,3) is a frequent itemsets, which is key. 5+a is value. 5 of them are support. A is the output in label B: < (0,2,3),4+3+b>;(0,2,3,4) is a frequent item set. Key is forward (0,2,3); 4+3+b is value, 4 is back, 3 is support, B is label;< (0,2,3),5+3+b>, among them (0,2,3,5) is a frequent itemsets. Key is forward (0,2,3); 5+3+b is value. 5 is the back direction, 3 is the support degree, B is the label;
< (0,2,3),6+2+b>. (0,2,3,6) is a frequent itemsets, key is forward (0,2,3); 6+2+b is value. 6 of them are back. 2 is the support level, B is the label.

Then the data summarized in reducer is:< (0,2,3), [(5+a), (4+3+b),(5+3+b),(6+2+b) >; the value collection in the traversal, first, the support degree of label A as the denominator, that is, the denominator is 5, and then, traversing the label of B, for example (4+3+B), because it is the label of b then calculate the support degree divided by the denominator of a 3/5, the output is < (0,2,3,4),3/5> And then traversing, you can output:< (0,2,3,5),3/5>; < (0,2,3,6),2/5>; and other records;
It is important to note that 1) when the reducer has only a label data. The output is not required because this is the maximum frequent item set.  Cannot be a forward in any frequent itemsets; 2) The above value can be designed to define the writable type itself, rather than a simple string. 3) The MapReduce code to calculate the confidence level is not implemented, the download provided in this article is a standalone version. And the expansion of frequent itemsets is able to put into memory;

Share, grow, be happy

Reprint Please specify blog address: http://blog.csdn.net/fansy1990





The algorithm of FP association rules for calculating confidence and realization of MapReduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.