Use hadoop for related product statistics

Last Update:2014-10-17 Source: Internet

Author: User

Tags map class

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprinted please indicate the source: http://blog.csdn.net/xiaojimanman/article/details/40184581

I have been reading hadoop-related books for the past few days. I feel a little bit that I have compiled a statistical item on my own just like the wordcount program.

Requirement Description:

Calculate the correlation between items based on the sales list of supermarkets (that is, the number of times that both A and B products are bought ).

Data format:

The supermarket sales list is simplified to the following format: one line represents a list, and each item is separated by ",", as shown in:

Requirement Analysis:

This requirement is calculated using mapreduce in hadoop.

The map function is used to split an associated item. The output result is commodity A and value is commodity B. The split result of the first three results is shown in:

In order to calculate the products that are associated with products A and B, the relationship between products a and B Outputs two results: A-B and B-.

The reduce function performs grouping statistics on the items related to item A, that is, to calculate the number of occurrences of each item in the value, and the output result is key: item A | item B, value indicates the number of occurrences of the combination. For the five records mentioned above, perform the following analysis on the key value in the map output as R:

Use the map function to obtain the record shown in:

The values output by map in reduce are grouped and counted. The result is shown in.

The product a B is used as the key and the number of combinations is output as the value. The output result is as follows:

The analysis of the implementation process of the requirement is now over. Next let's look at the specific code implementation.

Code implementation:

Do not describe the code in detail. For details, refer to the comments in the code.

Package com; import Java. io. ioexception; import Java. util. hashmap; import Java. util. map. entry; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. conf. configured; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. intwritable; import Org. apache. hadoop. io. longwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. job; import Org. apache. hadoop. mapreduce. mapp Er; import Org. apache. hadoop. mapreduce. reducer; import Org. apache. hadoop. mapreduce. lib. input. fileinputformat; import Org. apache. hadoop. mapreduce. lib. output. fileoutputformat; import Org. apache. hadoop. mapreduce. lib. output. textoutputformat; import Org. apache. hadoop. util. tool; import Org. apache. hadoop. util. toolrunner; public class test extends configured implements tool {/*** map class, implement data preprocessing * output result key is product a valu E is the associated product B * @ author lulei */public static class mapt extends mapper <longwritable, text, text> {public void map (longwritable key, text value, context) throws ioexception, interruptedexception {string line = value. tostring (); If (! (Line = NULL | "". equals (line) {// split commodity string [] Vs = line. split (","); // a combination of two to form a record for (INT I = 0; I <(. length-1); I ++) {If ("". equals (vs [I]) {// exclude null record continue;} For (Int J = I + 1; j <. length; j ++) {If ("". equals (vs [J]) {continue;} // output result context. write (new text (vs [I]), new text (vs [J]); context. write (new text (vs [J]), new text (vs [I]) ;}}}}/*** reduce class, implement data counting * output result key is commodity a | B value is the number of associations * @ Author lulei */public static class extends CET extends CER <text, intwritable> {private int count;/*** initialize */Public void setup (context) {// obtain the minimum number of records from the parameter string countstr = context. getconfiguration (). get ("count"); try {This. count = integer. parseint (countstr);} catch (exception e) {This. count = 0 ;}} public void reduce (Text key, iterable <text> values, context) throws ioexcepti On, interruptedexception {string keystr = key. tostring (); hashmap <string, integer> hashmap = new hashmap <string, integer> (); // use hash to count the number of times B Products for (Text Value: values) {string valuestr = value. tostring (); If (hashmap. containskey (valuestr) {hashmap. put (valuestr, hashmap. get (valuestr) + 1);} else {hashmap. put (valuestr, 1) ;}// output the result for (Entry <string, integer> entry: hashmap. entryset () {If (entry. getval UE ()> = This. count) {// only context with the output count not less than the minimum value. write (new text (keystr + "|" + entry. getkey (), new intwritable (entry. getvalue () ;}}}@ overridepublic int run (string [] arg0) throws exception {// todo auto-generated method stubconfiguration conf = getconf (); Conf. set ("count", arg0 [2]); job = new job (CONF); job. setjobname ("jobtest"); job. setoutputformatclass (textoutputformat. class); job. setoutputkeyclass (T Ext. class); job. setoutputvalueclass (text. class); job. setmapperclass (mapt. class); job. setreducerclass (reducet. class); fileinputformat. addinputpath (job, new path (arg0 [0]); fileoutputformat. setoutputpath (job, new path (arg0 [1]); job. waitforcompletion (true); Return job. issuccessful ()? 0: 1;}/*** @ Param ARGs */public static void main (string [] ARGs) {// todo auto-generated method stubif (ARGs. length! = 3) {system. exit (-1);} Try {int res = toolrunner. run (new configuration (), new test (), argS); system. exit (RES);} catch (exception e) {// todo auto-generated catch blocke. printstacktrace ();}}}

Upload and run:

Package the program into a jar file and upload it to the cluster. Upload the test data to the HDFS distributed file system.

Run the following command:

View the corresponding HDFS file system after running, as shown in:

This is the completion of a complete mapreduce program. We will continue to learn about hadoop ~

Use hadoop for related product statistics

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More