Use hadoop for related product statistics

Source: Internet
Author: User
Tags map class

Reprinted please indicate the source: http://blog.csdn.net/xiaojimanman/article/details/40184581

I have been reading hadoop-related books for the past few days. I feel a little bit that I have compiled a statistical item on my own just like the wordcount program.

Requirement Description:

Calculate the correlation between items based on the sales list of supermarkets (that is, the number of times that both A and B products are bought ).

Data format:

The supermarket sales list is simplified to the following format: one line represents a list, and each item is separated by ",", as shown in:


Requirement Analysis:

This requirement is calculated using mapreduce in hadoop.

The map function is used to split an associated item. The output result is commodity A and value is commodity B. The split result of the first three results is shown in:


In order to calculate the products that are associated with products A and B, the relationship between products a and B Outputs two results: A-B and B-.

The reduce function performs grouping statistics on the items related to item A, that is, to calculate the number of occurrences of each item in the value, and the output result is key: item A | item B, value indicates the number of occurrences of the combination. For the five records mentioned above, perform the following analysis on the key value in the map output as R:

Use the map function to obtain the record shown in:


The values output by map in reduce are grouped and counted. The result is shown in.


The product a B is used as the key and the number of combinations is output as the value. The output result is as follows:


The analysis of the implementation process of the requirement is now over. Next let's look at the specific code implementation.

Code implementation:

Do not describe the code in detail. For details, refer to the comments in the code.

Package com; import Java. io. ioexception; import Java. util. hashmap; import Java. util. map. entry; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. conf. configured; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. intwritable; import Org. apache. hadoop. io. longwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. job; import Org. apache. hadoop. mapreduce. mapp Er; import Org. apache. hadoop. mapreduce. reducer; import Org. apache. hadoop. mapreduce. lib. input. fileinputformat; import Org. apache. hadoop. mapreduce. lib. output. fileoutputformat; import Org. apache. hadoop. mapreduce. lib. output. textoutputformat; import Org. apache. hadoop. util. tool; import Org. apache. hadoop. util. toolrunner; public class test extends configured implements tool {/*** map class, implement data preprocessing * output result key is product a valu E is the associated product B * @ author lulei */public static class mapt extends mapper <longwritable, text, text> {public void map (longwritable key, text value, context) throws ioexception, interruptedexception {string line = value. tostring (); If (! (Line = NULL | "". equals (line) {// split commodity string [] Vs = line. split (","); // a combination of two to form a record for (INT I = 0; I <(. length-1); I ++) {If ("". equals (vs [I]) {// exclude null record continue;} For (Int J = I + 1; j <. length; j ++) {If ("". equals (vs [J]) {continue;} // output result context. write (new text (vs [I]), new text (vs [J]); context. write (new text (vs [J]), new text (vs [I]) ;}}}}/*** reduce class, implement data counting * output result key is commodity a | B value is the number of associations * @ Author lulei */public static class extends CET extends CER <text, intwritable> {private int count;/*** initialize */Public void setup (context) {// obtain the minimum number of records from the parameter string countstr = context. getconfiguration (). get ("count"); try {This. count = integer. parseint (countstr);} catch (exception e) {This. count = 0 ;}} public void reduce (Text key, iterable <text> values, context) throws ioexcepti On, interruptedexception {string keystr = key. tostring (); hashmap <string, integer> hashmap = new hashmap <string, integer> (); // use hash to count the number of times B Products for (Text Value: values) {string valuestr = value. tostring (); If (hashmap. containskey (valuestr) {hashmap. put (valuestr, hashmap. get (valuestr) + 1);} else {hashmap. put (valuestr, 1) ;}// output the result for (Entry <string, integer> entry: hashmap. entryset () {If (entry. getval UE ()> = This. count) {// only context with the output count not less than the minimum value. write (new text (keystr + "|" + entry. getkey (), new intwritable (entry. getvalue () ;}}}@ overridepublic int run (string [] arg0) throws exception {// todo auto-generated method stubconfiguration conf = getconf (); Conf. set ("count", arg0 [2]); job = new job (CONF); job. setjobname ("jobtest"); job. setoutputformatclass (textoutputformat. class); job. setoutputkeyclass (T Ext. class); job. setoutputvalueclass (text. class); job. setmapperclass (mapt. class); job. setreducerclass (reducet. class); fileinputformat. addinputpath (job, new path (arg0 [0]); fileoutputformat. setoutputpath (job, new path (arg0 [1]); job. waitforcompletion (true); Return job. issuccessful ()? 0: 1;}/*** @ Param ARGs */public static void main (string [] ARGs) {// todo auto-generated method stubif (ARGs. length! = 3) {system. exit (-1);} Try {int res = toolrunner. run (new configuration (), new test (), argS); system. exit (RES);} catch (exception e) {// todo auto-generated catch blocke. printstacktrace ();}}}

Upload and run:

Package the program into a jar file and upload it to the cluster. Upload the test data to the HDFS distributed file system.

Run the following command:


View the corresponding HDFS file system after running, as shown in:


This is the completion of a complete mapreduce program. We will continue to learn about hadoop ~

Use hadoop for related product statistics

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.