FP Association Rules Mining Blog categories: Hadoop mapreducehadoop last year, the company 1 Demolition 4, and then demolition 3, in the demolition 25, really 72 move changes, I see the cock silk a burst of dread, but a year later did not affect my colleagues and my work, also did not hear some negative news, Nice, it looks like the level of a big cut. A large result of the demolition 25 is that the front flow is bound to be divided, this should be very tangled, a bit far, stop. This year, my technical direction has bi-directional algorithm More, this is my personal interest, the team focused on CRM this piece, now mention more is CEM, as if you also mention CRM is embarrassed and people say hello. In order to improve the user experience, so in doing a user behavior analysis of the east, the idea is to collect user behavior, better service members, one of the landing point is based on membership status, behavior speculated that the destination of electricity, that is, what the problem. Association rules algorithm Mainstream has 3, Apriori, based on the division algorithm, FP, they have their own focus, encyclopedia address: http://baike.baidu.com/view/1076817.htm
basic ideas are to find frequent itemsets apriori iteration of the way to find, inefficient, but the idea is very clear, based on the planning of the optimization of its performance, FP is the Han Jiawei design algorithm, only need to scan 2 databases, performance has a great upgrade, And the main mahout has a corresponding matter, mahout for MapReduce support friendly, so chose it.
Environmental
fp wiki https://cwiki.apache.org/confluence/display/MAHOUT/Parallel+Frequent+Pattern+Mining
Hadoop 0.20.2+mahout 0.5 +jdk 1.6 Different versions are incompatible, this pit I stepped over, pits and installs to see my previous article. Many problems encountered during installation some Baidu can be solved, there are many need to Google to see the forum of foreigners, think that every time Baidu did not make use of Google, the last foreigner's explanation is quite simple, but can solve the problem, so strongly recommend before the search to see whether the foreigners directly to the forum, Feel that the domestic problems are always richer.
Second, talk about the algorithm before the first talk about how to configure Eclipse to debug,
1, Eclipse installs MapReduce plug-in, find an installation on the net on line, should with the Hadoop version does not matter
2, this time you need to configure the plug-in's Hadoop information, because you need to interact with the Hadoop environment, you need to know the Namenode listening port and Jobtracker listening port, if you have forgotten your own configuration, then look at the file. Different Hadoop versions of the configuration files are also different, mine is hadoop0.20.2 (this is also a pit)
core-site.xml file Fs.default.name
mapred-site.xml file Mapred.job.tracker
so you can run and debug in Eclipse
Two, algorithm logic
Program main entrance is a fpgrowthdriver is actually a startup class, do some input parameter parsing, such as input and output, according to the parameters of the choice of stand-alone or distributed computing, by method specified, specific parameters to see Mahout- Fp-treewiki page (or you will be prompted for incorrect input parameters), my method specifies the MapReduce, the following code:
if ("Sequential". Equalsignorecase (Classificationmethod)) {runfpgrowth (params);} else if ("MapReduce". Equalsignorecase (Classificationmethod)) {Revisit conf = new revisit (); Hadooputil.delete (conf, outputdir); Pfpgrowth.runpfpgrowth (params);
Pfpgrowth.runpfpgrowth Main computational logic is always in this method, this method calls 5 methods, so the calculation process can be divided into 5 steps, I explained in detail what each step has done, before there is a reference to another blog, the first few steps to say very detailed, and there are pictures, but many details did not mention, blog address is: http ://www.cnblogs.com/zhangchaoyang/articles/2198946.html
1, Count
startparallelcounting This is a wordcount, for each element in db to do count, can be called count, so as to facilitate memory and understanding, output is a
flist, [(Potato chips, 7), (bread, 7), (eggs, 7), (Milk, 5), (beer, 4)] This will be used later, so these variable names are best understood and remembered.
2, Group
startgroupingitems flist randomly divided into n groups, each group put Maxpergroup element maxpergroup=flist.size ()/numgroups;
grouped data into glist {potato chips = 0, milk = 3, eggs = 2, bread = 1, beer = 4} number is group ID
did not use the MapReduce, the local completion,
The
flist corresponds to the mapping relationship of the element or element corresponding number to the group
3, Code
starttransactionsorting numbering, go heavy, sort, input and output similar to the following, but [0, 1, 2] is not an array, but a transactiontree, and now the position logic is very clear, the fourth step began to build the tree structure
milk, eggs, bread, potato Chips->> Single branch tree [0, 1, 2]
4, Tree
startparallelfpgrowth
Map: Read the third step output tree, and split into multiple trees: such as [0, 1, 2]--> [0, 1, 2] [0,1] [0] Output k=groupid (flist 2 corresponding groupId) v= corresponding tree
Reducer: In the third step, the output of the same groupid is constructed into a corresponding tree. It then iterates through the header entries and finds all the parent nodes recursively from the tree, so that the 1 elements in the header entry correspond to multiple paths, and then take these paths as input to step three, iterating over the
5, Mining
Startaggregating According to the fourth step output outputs top K frequent mode
PS: Time is more urgent, today to engage in the environment and do ETL, first write these, follow-up updates, found that there is no I refer to the blog to write the details, others have a map, the follow-up will also do a map, and then some of my code notes extracted out, will be more clear, convenient to see the people, executive power, executive power Ah, The picture will be there. Call it a call, go home ... 2013-03-30
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.