Source code analysis of FPGrowthDriver for mahout Association Rules

Source: Internet
Author: User

First of all, the source code analysis of the mahout association rules in the previous article is incorrect in many parts of part2. Now I will re-write the following: run the following command on the command line to obtain the usage of the association rule FPGrowthDriver for mahout: [java] bin/hadoop jar $ mahout_home/core/target/mahout-core-0.7-job.jar org. apache. mahout. fpm. pfpgrowth. FPGrowthDriver-h 1. open the source file of FPGrowthDriver. You can see that the main operation is to call the runPFPGrowth method of PFPGrowth. Because we only consider parallel operations, use the-method parameter to use the mapreduce variable: [java] else if ("mapreduce ". equalsIgnoreCase (classificationMethod) {Confi Guration conf = new Configuration (); HadoopUtil. delete (conf, outputDir); PFPGrowth. runPFPGrowth (params);} Then the operation is transferred, that is, FPGrowthDriver --> PFPGrowth2. open the PFPGrowth source file, view the runPFPGrowth method, and you can see that this class has the following four operations: [java] 2.1 startParallelCounting (params, conf); 2.2 // save feature list to dcache List <Pair <String, Long> fList = readFList (params); saveFList (fList, params, conf); 2.3 startParallelFPGrowth (params, conf); 2.4 startAggregating (params, conf); among them, the Operations 2.1 and 2.2 are clearly described in the source code analysis Part 1 of the mahout association rules; 2.3 In this step, a Job is started. Its Mapper, Combiner, and CER are ParallelFPGrowthMapper, ParallelFPGrowthCombiner, ParallelFPGrowthReducer, and [java] job. setMapperClass (ParallelFPGrowthMapper. class); job. setCombinerClass (ParallelFPGrowthCombiner. class); job. setReducerClass (ParallelFPGrowthReducer. class); paste the original data and set FPGrowth The-g parameter of Driver is 2 groups: Table 1 [html] milk, eggs, breads, potato chips, eggs, popcorn, potato chips, beer eggs, breads, potato chips, milk, eggs, bread, popcorn, potato chips, beer milk, bread, beer eggs, bread, beer milk, bread, potato chips milk, eggs, bread, butter, potato chips milk, eggs, butter, potato chips milk 1, eggs 1, bread 1, potato chips 1 eggs 1, popcorn 1, potato chips 1, beer 1 eggs 1, bread 1, potato chips 1 milk 1, eggs 1, bread 1, popcorn 1, potato chips 1, beer 1 milk 1, bread 1, beer 1 eggs 1, bread 1, beer 1 milk 1, bread 1, potato chips 1 milk 1, eggs 1, bread 1, butter 1, potato chips 1 milk 1, eggs 1, butter 1, potato chips 1 2.3.1 ParallelFPGrowthMapper main operations: 2.3.1.1 ParallelFPGrowthMapper setup function, this function is mainly used to read global fList (refer to: FP tree for mahout association rules: Parallel FP-G Rowth for Query Recommendation), which is stored in a Map: [java] int I = 0; for (Pair <String, Long> e: PFPGrowth. readFList (context. getConfiguration () {fMap. put (e. getFirst (), I ++);} for raw data, the stored fMap is (project name, encoding, number of occurrences ), the Code starts from 0 in descending order based on the number of times the project appears, and increases by 1 each time: table 2 [html] potato chips 0 7 potato chips 1 1 1 7 breads 2 7 breads 1 3 7 eggs 4 7 eggs 1 5 7 milk 6 milk 1 7 6 beer 8 4 beer 1 9 4 2.3.1.2 map function of ParallelFPGrowthMapper: this function has two main parts: The first part: for a transaction of raw data, it is performed according to the sequence in fMap. Output, and delete items that do not appear in fMap: for example, for [html] eggs, breads, and potato chips, the output should be: [, 4]; for [html] milk 1, eggs 1, bread 1, popcorn 1, potato chips 1, beer 1 output should be: [1, 3, 5, 7, 9]; www.2cto.com Part 2: how to map the output above? The numGroups set above is 2, that is, two groups (numGroups parameter settings are mainly for fList, that is, fList is divided into multiple groups to achieve parallel purposes), then 0 ~ 4 (the encoded project name) is the first group, and its corresponding id is 0, 5 ~ 9 is the second group, and the corresponding id is 1. If all the items in the first part do not exceed the last encoding of the first group (4 in this example), only one record is output, that is, itself; for example, [, 4], the key of the record output by map is the group id, that is, 0, value is [, 4]; otherwise, two records, such as [, 9], are output. one of them is itself, that is, the map output key is id, 1, value is [,]; the other record is key 0, value is [], that is, split the output of the first part into two parts, and only take the output of the corresponding group. For example, if the output of [,] is: 0 [, 4]; 1 [,]; then all the output of the map for the original data is: table 3 outputs the TransactionTree, the constructor in which transactionSet is an attribute of TransactionTree (here, the initial TransactionTree is described in detail): [java] public TransactionTree (IntArrayList items, Long support) {representedAsList = true; transactionSet = Lists. newArrayList (); transactionSet. add (new Pair <IntArrayList, Long> (items, support);} 2.3.2 reduce function of ParallelFPGrowthCombiner: [j Ava] TransactionTree cTree = new TransactionTree (); [java] view plaincopyfor (TransactionTree tr: values) {for (Pair <IntArrayList, Long> p: tr) {cTree. addPattern (p. getFirst (), p. getSecond ();} context. write (key, cTree. getCompressedTree (); here we can see that a new TransactionTree is created, and then the records of the same group (groupid) are put into a TransactionTree using the addPattern method, finally, the getCompressedTree method is used to return a compressed TransactionTree and output the TransactionTree; IonTree attributes include: [java] int [] attribute; int [] childCount; int [] [] nodeChildren; long [] nodeCount; int nodes; boolean representedAsList; list <Pair <IntArrayList, Long> transactioniSet; for example, [,], [, 4], [, 8], the following three records are added by addPattern. The effect is as follows: record 1 and record 2: record 3: Add record through addPattern method. The representedAsList attribute of TransactionTree is false, transactionSet is null, and other attributes are stored with corresponding values; the preceding method creates a TransactionTree for each id. Therefore, two Trans are created for table 3 data. Then, each TransactionTree uses the getCompressedTree to compress the two trasactiontrees. The compression method is to use list to represent the value represented by an array. In this case, the representedAsList attribute is true and transactionSet is not null, but the following value: id: 0, value: {([1], 2) ([1, 3], 5) ([2], 1) ([2, 4], 1) ([0, 2], 1) ([0, 2, 4], 4) ([0, 4], 2) ([3], 2)}; id: 1, value: {([0, 2, 4, 6], 2) ([0, 2, 4, 6, 8], 1) ([0, 2, 6], 1) ([0, 4, 8], 1) ([0, 4, 6], 1) ([2, 6, 8], 1) ([2, 4, 8], 1) ([1, 5, 9], 1) ([1, 5, 7], 1) ([1, 3, 5], 1) ([1, 3, 5, 7], 2) ([1, 3, 5, 7, 9], 1) ([1, 3, 7], 1) ([3, 7, 9], 1) ([3, 5, 9], 1)}, actually above The result is the number of times each transaction appears in table 3. 2.3.3 ParalleFPGrowthReducer: 2.3.3.1 setup function, which is the same as the setup function of Mapper and reads the fList file, but here we read the project into List <String> featureReverseMap, and read the corresponding frequency into LongArrayList freqList; 2.3.3.2 reduce function: first generate a localFList, that is, gList, using the generateFList method of TransactionTree, this method generates a list of all items contained in a TransactionTree, that is, the frequency of the items on the TransactionTree. For example, the gList generated by the two transactiontrees and the corresponding frequency are: [(), ()] and [(1), 7 ), (9, 4)]; then call the generateTopKFrequentPatterns method of FPGrowth; [java] FPGrowth <Integer> fpGrowth = new FPGrowth <Integer> (); fpGrowth. generateTopKFrequentPatterns () Open the source code of FPGrowth and find the following method: [java] public final void generateTopKFrequentPatterns (Iterator <Pair <List <A>, Long> transactionStream, collection <Pair <A, Long> frequencyList, long minSupport, int K, Collection <A> returnableFeatures, OutputCollector <A, List <Pair <List <A>, Long> output, StatusUpdater updater) first, convert the fList to gList. For example, for the second TransactionTree, gList ), (), (6, 6), (7, 6), (8, 4), (9, 4)], the projects are sorted in descending order of frequency as follows; then the local gList recode it: {0 = 3, 1 = 0, 2 = 4, 3 = 1, 4 = 5, 5 = 2, 6 = 6, 7 = 7, 8 = 8, 9 = 9}, so, should be; for the original record: {[2, 3, 4, 6], 2} is changed to {[3, 4, 5, 6] 2}. Then, this function calls the generateTopKFrequentPatterns function, for the first time, I thought it called itself again (that is, where the above explanation is incorrect). Next, there is a method with the same name, but the input parameters are different. In addition, FPTree is introduced, next analysis.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.