Hadoop advanced programming (1) --- composite key custom input type

Last Update:2014-09-25 Source: Internet

Author: User

Tags iterable

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction: in terms of the basic method of big data processing, mapreduce processes data that are not highly dependent on each other and divides large problems into small ones for solving, this makes the problem simple and feasible. At the same time, the mapreduce framework hides a lot of processing details, including data splitting, task scheduling, data communication, fault tolerance, and load balancing ..... the system is responsible for this. For many problems, you only need to take the default value of the Framework. You only need to complete the design of the map function and reduce function.
In general, you only need to use a simple <key, value> pair for composite keys. However, in some complex situations, you can perform a lot of effective processing to reduce network data communication overhead, improve program computing efficiency. Example: The inverted index is the most commonly used data structure in the Document Retrieval System. It is widely used in full-text retrieval and stores the ing of a word or phrase in a document or multiple documents, find the document based on the content. Instead of searching for the document, you can search for the document based on the content, and perform the opposite operation, which then becomes an inverted index. Code:

Package reverseindex; import Java. io. ioexception; import Java. util. stringtokenizer; import Org. apache. hadoop. io. longwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapred. filesplit; import Org. apache. hadoop. mapreduce. mapper; import Org. apache. hadoop. mapreduce. reducer;/*** inverted index: find the file * XD is a good man-> file1.txt * Good Boy is XD-> file2.txt * XD like beautiful women-> file3.txt * according to the following: * XD-> file1.txt file2.txt file3.txt * is-> file1.txt file2.txt * A-> file1.txt * Good-> file1.txt file2.txt * Man-> file1.txt * boy-> file2.txt * Like-> file3.txt * beautiful-> file3.txt * women-> file3.txt * The data pair required in each map function is <"word + file name ", "Word Frequency"> facilitates Word Frequency Statistics of combiner * and changes the data pair to <"word" in combiner ", "file name + Word Frequency"> it is easy to distribute data of the same key to the same CER for execution (hashpartition ). * @ author XD */public class inverseindex {public static class map extends mapper <longwritable, text, text> {private text keyinfo = new text (); // key value: Private text valueinfo = new text (); // value: Private filesplit split; // The splie object public void map (longwritable key, text value, context context) throws ioexception, interruptedexception {split = (filesplit) context. getinputsplit (); // obtain the key <key, value> to which the split object stringtokenizer itr = new stringtokenizer (value. tostring (); While (itr. hasmoretokens () {int splitindex = split. getpath (). tostring (). indexof ("file"); // obtain the index location keyinfo where the file name contains the file. set (itr. nexttoken () + ":" + split. getpath (). tostring (). substring (splitindex); // set the key value valueinfo. set ("1"); context. write (keyinfo, valueinfo) ;}} public static class combiner extends CER <text, text> {private text info = new text (); // to split the key value, prepare to store the new value public void reduce (Text key, iterable <text> values, context) throws ioexception, interruptedexception {int sum = 0; for (Text VAL: values) {sum + = integer. parseint (Val. tostring ();} int splitindex = key. tostring (). indexof (":"); info. set (key. tostring (). substring (splitindex + 1) + ":" + sum); // The new value Key. set (key. tostring (). substring (0, splitindex); context. write (Key, Info) ;}} public static class reduce extends CER <text, text> {private text result = new text (); // set the final output result public void reduce (Text key, iterable <text> values, context) throws ioexception, interruptedexception {string list = new string (); for (Text VAL: values) {list + = Val. tostring () + ";"; // different index files are separated} result. set (list); context. write (Key, result );}}}

User-Defined data types hadoop provides many built-in data types. However, to solve some complicated problems, these built-in simple data types are difficult to meet users' needs, you need to customize the data type. You need to implement the writable interface when customizing data types so that data can be serialized for network transmission or file input and output. In addition, the writablecomparable interface must be implemented if the data needs to be used as the primary key or when the value is relatively large. Example:

Package COM. RPC. NEFU; import Java. io. datainput; import Java. io. dataoutput; import Java. io. ioexception; import Org. apache. hadoop. io. writablecomparable; // You Can serialize the input data to customize a serializable class public class keyValue implements writablecomparable <keyValue> {public int X, Y; Public keyValue () {This. X = 0; this. y = 0;} public keyValue (INT X1, int Y1) {This. X = x1; this. y = Y1 ;}@ overridepublic void readfields (datainput in) Throw S ioexception {// todo auto-generated method stubx = in. readint (); y = in. readint () ;}@ overridepublic void write (dataoutput out) throws ioexception {// todo auto-generated method stubout. writeint (x); out. writeint (y);} public int distancefromorigin () {return (x * x + y * Y);} public Boolean equals (keyValue O) {If (! (O instanceof keyValue) {return false;} return (this. X = O. x) & (this. y = O. y);} public int hashcode () {return float. floattointbits (x) ^ float. floattointbits (y);} Public String tostring () {return integer. tostring (x) + "," + integer. tostring (y) ;}@ overridepublic int compareto (keyValue O) {// return X; // todo auto-generated method stubif (x> O. x) {return 1;} else if (x = O. x) {return 0;} else {return-1 ;}}}

Hadoop advanced programming (1) --- composite key custom input type

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More