並行化頻繁模式挖掘演算法FP Growth及其在Mahout下的命令使用

來源:互聯網
上載者:User

今天調研了並行化頻繁模式挖掘演算法PFP Growth及其在Mahout下的命令使用,簡單記錄下實驗結果,供以後查閱:

環境:Jdk1.7 + Hadoop2.2.0單機偽叢集 +  Mahout0.6(0.8和0.9版本都不包含該演算法。Mahout0.6可以和Hadoop2.2.0和平共處有點意外orz)

部分輸入資料,輸入資料一行代表一個購物籃:

4750,19394,25651,6395,5592
26180,10895,24571,23295,20578,27791,2729,8637
7380,18805,25086,19048,3190,21995,10908,12576
3458,12426,20578
1880,10702,1731,5185,18575,28967
21815,10872,18730
20626,17921,28930,14580,2891,11080
18075,6548,28759,17133
7868,15200,13494
7868,28617,18097,22999,16323,8637,7045,25733
12189,8816,22950,18465,13258,27791,20979
26728
17512,14821,18741
26619,14470,21899,6731
5184
28653,28662,18353,27437,5661,12078,11849,15784,7248,7061,18612,24277,4807,15584,9671,18741,3647,1000

。。。。。。

執行命令:

mahout fpg -i /workspace/dataguru/hadoopdev/week13/fpg/in/ -o /workspace/dataguru/hadoopdev/week13/fpg/out -method mapreduce -s 3

參數說明:

-i 輸入路徑,由於運行在hadoop環境中,所以輸入路徑必須是hdfs路徑,實驗的輸入路徑是/workspace/dataguru/hadoopdev/week13/fpg/in/user2items.csv

-o輸出路徑,指定hdfs中的輸出路徑

完整參數說明參見下表:

命令執行以後的輸出目錄:

casliyang@singlehadoop:~$ hadoop dfs -ls /workspace/dataguru/hadoopdev/week13/fpg/out
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Found 4 items
-rw-r--r--   3 casliyang supergroup       5567 2014-06-17 17:50 /workspace/dataguru/hadoopdev/week13/fpg/out/fList
drwxr-xr-x   - casliyang supergroup          0 2014-06-17 17:51 /workspace/dataguru/hadoopdev/week13/fpg/out/fpgrowth
drwxr-xr-x   - casliyang supergroup          0 2014-06-17 17:51 /workspace/dataguru/hadoopdev/week13/fpg/out/frequentpatterns
drwxr-xr-x   - casliyang supergroup          0 2014-06-17 17:50 /workspace/dataguru/hadoopdev/week13/fpg/out/parallelcounting

挖掘出來的頻繁模式在frequentpatterns檔案夾下

casliyang@singlehadoop:~$ hadoop dfs -ls /workspace/dataguru/hadoopdev/week13/fpg/out/frequentpatterns
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Found 2 items
-rw-r--r--   3 casliyang supergroup          0 2014-06-17 17:51 /workspace/dataguru/hadoopdev/week13/fpg/out/frequentpatterns/_SUCCESS
-rw-r--r--   3 casliyang supergroup      10017 2014-06-17 17:51 /workspace/dataguru/hadoopdev/week13/fpg/out/frequentpatterns/part-r-00000

該檔案是序列化檔案,不能直接查看,mahout提供了命令可以將其轉換為普通文本:

更多精彩內容:http://www.bianceng.cnhttp://www.bianceng.cn/Programming/sjjg/

mahout seqdumper -s /workspace/dataguru/hadoopdev/week13/fpg/out/frequentpatterns/part-r-00000 -o /home/casliyang/outpattern

這裡要注意,-o指定的輸出檔案路徑必須是linux檔案系統,並且目標檔案必須提前建立好,否則會報錯。

最終輸出到/home/casliyang/outpattern的部分結果

Key: 29099: Value: ([29099],18), ([29099, 4479],3)
Key: 29202: Value: ([29202],3)
Key: 29203: Value: ([29203],9), ([14020, 29203],3)
Key: 29224: Value: ([29224],3)
Key: 29547: Value: ([29547],5)
Key: 2963: Value: ([2963],8), ([2963, 21146],3)
Key: 2999: Value: ([2999],3)
Key: 3032: Value: ([3032],4)
Key: 3047: Value: ([3047],4)
Key: 3151: Value: ([3151],7), ([14020, 3151],4)
Key: 3181: Value: ([3181],3)
Key: 3228: Value: ([3228],14)
Key: 3313: Value: ([3313],3)
Key: 3324: Value: ([3324],3)
Key: 3438: Value: ([3438],3)
Key: 3458: Value: ([3458],4)
Key: 3627: Value: ([3627],11), ([3627, 11176],3)

。。。。。。

含義:

Key:itemid

Value:和該item相關的頻繁模式及其支援度

有了挖掘出來的頻繁模式,就可以進一步用程式根據業務需求做處理了。

Mahout真是個偉大的開源項目!

作者:csdn部落格 u010967382

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.