FP Association Rules Mining

Source: Internet
Author: User
Keywords nbsp; the rules very this
FP Association Rules Mining Blog categories: Hadoop mapreducehadoop last year, the company 1 Demolition 4, and then demolition 3, in the demolition 25, really 72 move changes, I see the cock silk a burst of dread, but a year later did not affect my colleagues and my work, also did not hear some negative news, Nice, it looks like the level of a big cut. A large result of the demolition 25 is that the front flow is bound to be divided, this should be very tangled, a bit far, stop. This year, my technical direction has bi-directional algorithm More, this is my personal interest, the team focused on CRM this piece, now mention more is CEM, as if you also mention CRM is embarrassed and people say hello. In order to improve the user experience, so in doing a user behavior analysis of the east, the idea is to collect user behavior, better service members, one of the landing point is based on membership status, behavior speculated that the destination of electricity, that is, what the problem. Association rules algorithm Mainstream has 3, Apriori, based on the division algorithm, FP, they have their own focus, encyclopedia address: http://baike.baidu.com/view/1076817.htm


basic ideas are to find frequent itemsets apriori iteration of the way to find, inefficient, but the idea is very clear, based on the planning of the optimization of its performance, FP is the Han Jiawei design algorithm, only need to scan 2 databases, performance has a great upgrade, And the main mahout has a corresponding matter, mahout for MapReduce support friendly, so chose it.





Environmental


fp wiki https://cwiki.apache.org/confluence/display/MAHOUT/Parallel+Frequent+Pattern+Mining


Hadoop 0.20.2+mahout 0.5 +jdk 1.6 Different versions are incompatible, this pit I stepped over, pits and installs to see my previous article. Many problems encountered during installation some Baidu can be solved, there are many need to Google to see the forum of foreigners, think that every time Baidu did not make use of Google, the last foreigner's explanation is quite simple, but can solve the problem, so strongly recommend before the search to see whether the foreigners directly to the forum, Feel that the domestic problems are always richer.


Second, talk about the algorithm before the first talk about how to configure Eclipse to debug,


1, Eclipse installs MapReduce plug-in, find an installation on the net on line, should with the Hadoop version does not matter


2, this time you need to configure the plug-in's Hadoop information, because you need to interact with the Hadoop environment, you need to know the Namenode listening port and Jobtracker listening port, if you have forgotten your own configuration, then look at the file. Different Hadoop versions of the configuration files are also different, mine is hadoop0.20.2 (this is also a pit)


core-site.xml file Fs.default.name


mapred-site.xml file Mapred.job.tracker


so you can run and debug in Eclipse


Two, algorithm logic


Program main entrance is a fpgrowthdriver is actually a startup class, do some input parameter parsing, such as input and output, according to the parameters of the choice of stand-alone or distributed computing, by method specified, specific parameters to see Mahout- Fp-treewiki page (or you will be prompted for incorrect input parameters), my method specifies the MapReduce, the following code:


if ("Sequential". Equalsignorecase (Classificationmethod)) {runfpgrowth (params);} else if ("MapReduce". Equalsignorecase (Classificationmethod)) {Revisit conf = new revisit (); Hadooputil.delete (conf, outputdir); Pfpgrowth.runpfpgrowth (params);


Pfpgrowth.runpfpgrowth Main computational logic is always in this method, this method calls 5 methods, so the calculation process can be divided into 5 steps, I explained in detail what each step has done, before there is a reference to another blog, the first few steps to say very detailed, and there are pictures, but many details did not mention, blog address is: http ://www.cnblogs.com/zhangchaoyang/articles/2198946.html


1, Count


startparallelcounting This is a wordcount, for each element in db to do count, can be called count, so as to facilitate memory and understanding, output is a


flist, [(Potato chips, 7), (bread, 7), (eggs, 7), (Milk, 5), (beer, 4)] This will be used later, so these variable names are best understood and remembered.


2, Group


startgroupingitems flist randomly divided into n groups, each group put Maxpergroup element maxpergroup=flist.size ()/numgroups;


grouped data into glist {potato chips = 0, milk = 3, eggs = 2, bread = 1, beer = 4} number is group ID


did not use the MapReduce, the local completion,

The
flist corresponds to the mapping relationship of the element or element corresponding number to the group


3, Code


starttransactionsorting numbering, go heavy, sort, input and output similar to the following, but [0, 1, 2] is not an array, but a transactiontree, and now the position logic is very clear, the fourth step began to build the tree structure


milk, eggs, bread, potato Chips->> Single branch tree [0, 1, 2]


4, Tree


startparallelfpgrowth


Map: Read the third step output tree, and split into multiple trees: such as [0, 1, 2]--> [0, 1, 2] [0,1] [0] Output k=groupid (flist 2 corresponding groupId) v= corresponding tree


Reducer: In the third step, the output of the same groupid is constructed into a corresponding tree. It then iterates through the header entries and finds all the parent nodes recursively from the tree, so that the 1 elements in the header entry correspond to multiple paths, and then take these paths as input to step three, iterating over the


5, Mining


Startaggregating According to the fourth step output outputs top K frequent mode








PS: Time is more urgent, today to engage in the environment and do ETL, first write these, follow-up updates, found that there is no I refer to the blog to write the details, others have a map, the follow-up will also do a map, and then some of my code notes extracted out, will be more clear, convenient to see the people, executive power, executive power Ah, The picture will be there. Call it a call, go home ... 2013-03-30
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.