Hadoop advanced programming (1) --- composite key custom input type

Source: Internet
Author: User
Tags iterable
Introduction: in terms of the basic method of big data processing, mapreduce processes data that are not highly dependent on each other and divides large problems into small ones for solving, this makes the problem simple and feasible. At the same time, the mapreduce framework hides a lot of processing details, including data splitting, task scheduling, data communication, fault tolerance, and load balancing ..... the system is responsible for this. For many problems, you only need to take the default value of the Framework. You only need to complete the design of the map function and reduce function.
In general, you only need to use a simple <key, value> pair for composite keys. However, in some complex situations, you can perform a lot of effective processing to reduce network data communication overhead, improve program computing efficiency. Example: The inverted index is the most commonly used data structure in the Document Retrieval System. It is widely used in full-text retrieval and stores the ing of a word or phrase in a document or multiple documents, find the document based on the content. Instead of searching for the document, you can search for the document based on the content, and perform the opposite operation, which then becomes an inverted index. Code:
Package reverseindex; import Java. io. ioexception; import Java. util. stringtokenizer; import Org. apache. hadoop. io. longwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapred. filesplit; import Org. apache. hadoop. mapreduce. mapper; import Org. apache. hadoop. mapreduce. reducer;/*** inverted index: find the file * XD is a good man-> file1.txt * Good Boy is XD-> file2.txt * XD like beautiful women-> file3.txt * according to the following: * XD-> file1.txt file2.txt file3.txt * is-> file1.txt file2.txt * A-> file1.txt * Good-> file1.txt file2.txt * Man-> file1.txt * boy-> file2.txt * Like-> file3.txt * beautiful-> file3.txt * women-> file3.txt * The data pair required in each map function is <"word + file name ", "Word Frequency"> facilitates Word Frequency Statistics of combiner * and changes the data pair to <"word" in combiner ", "file name + Word Frequency"> it is easy to distribute data of the same key to the same CER for execution (hashpartition ). * @ author XD */public class inverseindex {public static class map extends mapper <longwritable, text, text> {private text keyinfo = new text (); // key value: Private text valueinfo = new text (); // value: Private filesplit split; // The splie object public void map (longwritable key, text value, context context) throws ioexception, interruptedexception {split = (filesplit) context. getinputsplit (); // obtain the key <key, value> to which the split object stringtokenizer itr = new stringtokenizer (value. tostring (); While (itr. hasmoretokens () {int splitindex = split. getpath (). tostring (). indexof ("file"); // obtain the index location keyinfo where the file name contains the file. set (itr. nexttoken () + ":" + split. getpath (). tostring (). substring (splitindex); // set the key value valueinfo. set ("1"); context. write (keyinfo, valueinfo) ;}} public static class combiner extends CER <text, text> {private text info = new text (); // to split the key value, prepare to store the new value public void reduce (Text key, iterable <text> values, context) throws ioexception, interruptedexception {int sum = 0; for (Text VAL: values) {sum + = integer. parseint (Val. tostring ();} int splitindex = key. tostring (). indexof (":"); info. set (key. tostring (). substring (splitindex + 1) + ":" + sum); // The new value Key. set (key. tostring (). substring (0, splitindex); context. write (Key, Info) ;}} public static class reduce extends CER <text, text> {private text result = new text (); // set the final output result public void reduce (Text key, iterable <text> values, context) throws ioexception, interruptedexception {string list = new string (); for (Text VAL: values) {list + = Val. tostring () + ";"; // different index files are separated} result. set (list); context. write (Key, result );}}}
User-Defined data types hadoop provides many built-in data types. However, to solve some complicated problems, these built-in simple data types are difficult to meet users' needs, you need to customize the data type. You need to implement the writable interface when customizing data types so that data can be serialized for network transmission or file input and output. In addition, the writablecomparable interface must be implemented if the data needs to be used as the primary key or when the value is relatively large. Example:
Package COM. RPC. NEFU; import Java. io. datainput; import Java. io. dataoutput; import Java. io. ioexception; import Org. apache. hadoop. io. writablecomparable; // You Can serialize the input data to customize a serializable class public class keyValue implements writablecomparable <keyValue> {public int X, Y; Public keyValue () {This. X = 0; this. y = 0;} public keyValue (INT X1, int Y1) {This. X = x1; this. y = Y1 ;}@ overridepublic void readfields (datainput in) Throw S ioexception {// todo auto-generated method stubx = in. readint (); y = in. readint () ;}@ overridepublic void write (dataoutput out) throws ioexception {// todo auto-generated method stubout. writeint (x); out. writeint (y);} public int distancefromorigin () {return (x * x + y * Y);} public Boolean equals (keyValue O) {If (! (O instanceof keyValue) {return false;} return (this. X = O. x) & (this. y = O. y);} public int hashcode () {return float. floattointbits (x) ^ float. floattointbits (y);} Public String tostring () {return integer. tostring (x) + "," + integer. tostring (y) ;}@ overridepublic int compareto (keyValue O) {// return X; // todo auto-generated method stubif (x> O. x) {return 1;} else if (x = O. x) {return 0;} else {return-1 ;}}}



Hadoop advanced programming (1) --- composite key custom input type

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.