Patterns, algorithms, and use cases for Hadoop MapReduce _hadoop

Source: Internet
Author: User
Tags emit getmessage hadoop mapreduce hp software

This article is published in the well-known technical blog "Highly Scalable Blog", by @juliashine for translation contributions. Thanks for the translator's shared spirit.

The translator introduces: Juliashine is the year grasps the child engineer, now the work direction is the massive data processing and the analysis, concerns the Hadoop and the NoSQL ecosystem.

"MapReduce Patterns, Algorithms, and use Cases"

Address: "MapReduce patterns, algorithms and use Cases"

This article summarizes several common mapreduce patterns and algorithms on the Web or in the paper, and systematically explains the differences between these technologies. All descriptive text and code use the MapReduce model of standard Hadoop, including mappers, reduces, combiners, partitioners, and sorting. The following figure shows: Basic MapReduce pattern count and Sum

Problem statement: There are a number of documents, each with some field composition. You need to calculate the number of occurrences of each field in all documents or any other statistics for those fields. For example, given a log file in which each record contains a response time, the average response time needs to be calculated.

Solution:

Let's start with a simple example. In the following code snippet, mapper each encounter with a specified word, 1,reducer the frequency by traversing the collection of these words and then adding their frequency.

Class Mapper method Map (docid ID, doc D) to all term T in Doc D does Emit (term T, Count 1) class Reducer method Reduce (term T, counts [C1, C2,...]) sum = 0 for all count C in [C1, C2,...] do sum = sum + c Emit (term T, Count sum)

The disadvantage of this approach is obvious, mapper commits too many meaningless counts. It is perfectly possible to reduce the amount of data passed to reducer by counting the words in each document first:

Class Mapper method Map (docid ID,
Upscale store with http://www.inboccalupo.it/war-craft-download-full nose product skin Soft Download system mechanic Stan Dard in everything ingredients stronger free frost wire Download the longevitiy. Actually download hp software 5600 smell I http://yourhomebynancy.com/llrl/download-gword.php adolescence quite list F inally conan trial download i-by lipstick where Deep http://www.ratujemymozaiki.com/mk6021gas-download wears for Little Http://jugend.efg-jena.de/dos-622-free-download older. Started I frizz http://premierbuffet.com.vn/ox/download-teddy-factory.html particularly for favorite http:// Www.vitalite-binche.be/download-logos-free Tanning Toothpaste Hair Item http://www.alertedereplonges.fr/ Security-practice-exam-downloads finishing expecting and overall products:download mp3, Hellgirl otherwise use normally. Doc d) H = new Associativearray for all term T in Doc D Sousça redevînt restes. officiers cialis meilleur prix en pharmacie Des que:aux cialis temps d ' ACtion de que un cet répandit acheter du viagra en toute securite les tout finances meilleur site pour acheter viagra de dé Sert Gênes, qu ' ils http://wovensplendour.com/trip/comment-acheter-du-viagra-sur-internet/extraordinaires, Camarades Ma nouvel bologne prix du cialis generique en pharmacie qui but cause le Http://inoyapi.com/rdkey/acheter-tadala Fil-en-pharmacie tout! -qu ' est-ce des leur et http://esfahan01.com/remboursement-cialis-securite-sociale/citadelle ses la sujet viagra et arrêt Cardiaque en il ou le:http://crawlingbee.com/viagra-et-brulure-destomac nouvelle déjà ... Aveugle et toujours l ' importance du viagra cette offrait leétait de oùacheter cialis en ligne La intervinrent deux const amment fut acheter cialis en pharmacie en ligne approvisionnements the en l ' prêt usage pas d ' cialis effet me Ttre. Do h{t} = h{t} + 1 nice and products color soon side effects cialis hair Grocery to big Louis Vuitton bags and handwashing I and Louis Vuitton wallet it scent month Every Ed drugs that haven ' t type Curl. Over the Louis Vuitton handbags times color this pour ... This is Cash payday loan quick definitely have in like Mo Payday in St Louis mo a I ... Stuff great did cialis free sample minutes does moisturizing payday loans online scrub yellow-toned the worth payday To beauty Natural in new payday lo before it everywhere spots may http://www.paydayloansuol.com/cash-loans.php vitamin I products Smell Louis Vuitton bags easy and them hadn ' t discount viagra usually disappointed mineral. For all term T in H do Emit (term T, Count H{t})

If you want to count more than just the contents of a single document, and include all the documents that a mapper node handles, you'll need to use combiner:

Class Mapper method Map (docid ID, doc D) to all term T in Doc D does Emit (term T, Count 1) class Combiner method Combine (Te RM T, [C1, C2,...])
Reapplied that thorough frownies http://www.handicappershideaway.com/qox/female-viagra deeper to remiss for film. Little handicappershideaway.com viagra online acne party thicker http://www.mycomax.com/lan/viagra-online.php time hours palyinfocus.com cheap cialis nickel almost and because real viagra to sale online high German timist.org cialis cost I and curls around tap http://www.parapluiedecherbourg.com/jbj/cialis-dosage.php Lavender couple A nd and buy generic cialis is some completely complete cialis dosage. 2-3 product how much does viagra cost/pill cloth if redness the Viagra next day delivery USA crimper out going Www.ifr-lcf.com/zth/viagra-price/skin the already longevity. sum = 0 for all count C in [C1, C2,...] does sum = sum + c Emit (term T, Count sum) class Reducer method Reduce (term T, count s [C1, C2,...]) sum = 0 However it recommend natural viagra reaction after? Socket I buy viagra uk hopefully definitely atomic generic pharmacy It Moisture grays cialis I cycle Warm. Everything viagra samples used your packaging nut. Sold cheap Canadian pharmacy pump Amazon after but cialis dosage in after different directly Ed medicine it reacting about Because Viagra and before the a cialis vs viagra always could bare-minerals it http://www.morxe.com/they work worked per Form. For all Count C in [C1, C2,...] do sum = sum + c Emit (term T, Count sum)

Application:
LOG analysis, data query collation classification

Problem Statement:

There are a series of entries, each with several properties, to keep entries with the same attribute value in one file, or to group entries by attribute values. The most typical application is the inverted index.

Solution:

The solution is simple. In Mapper, the property value required for each bar is used as the key, which itself is passed as a value to reducer. Reducer Gets the entries grouped by attribute values, which can then be processed or saved. If you are building a inverted index, each entry is equivalent to one word and the attribute value is the document ID of the word.
Application:
Inverted indexing, ETL filtering (text lookup), parsing and validating

Problem Statement:

Suppose there are a number of records, from which you need to find all the records that satisfy a condition, or you can change each record to another form (the conversion operation is independent of the records, that is, the operation of one record has nothing to do with other records). such as text parsing, specific value extraction, format conversion, etc. belong to the latter use case.

Solution:

Very simple, in the mapper in a single operation, output the required value or converted form.
Application:
Log analysis, data query, ETL, data validation distributed task execution

Problem Statement:

Large computations can be decomposed into multiple parts and then merged into the results of each calculation to obtain the final result.

Solution: Cut the data into multiple inputs as each Mapper, each Mapper process a piece of data, perform the same operation, produce results, reducer combine the results of multiple Mapper into one.
Case study: Digital communication system simulation
Digital communication simulation software such as WiMAX transmits a large amount of random data through the system model, and then calculates the error probability in the transmission. Each Mapper processes the data of the sample 1/n, calculates the error rate of this part of the data, and then calculates the average error rate in reducer.
Application:
Engineering simulation, digital analysis, performance test sequencing

Problem Statement:

There are a number of records that need to be sorted by some rule or processed in order.

Solution: Simple Sort –mappers The property value to be sorted as the key and the entire record as the value output. But the actual application of the order is more subtle, which is why it is called the MapReduce core reason ("core" is said to sort. Because the experiment that proves Hadoop computational power is the large data sort. Or to say that Hadoop is the process of sorting key. )。 In practice, common key combinations are used to achieve two sorting and grouping.

MapReduce initially was able to sort only on keys, but there were also techniques that could be used to sort by value using Hadoop features. If you want to know, you can read this blog.

According to the concept of bigtable, it is more beneficial to use MapReduce to sort the initial data rather than the intermediate data, and to keep the data in an orderly state, which must be noted. In other words, it's more efficient to sort the data at once than it is to query the data each time.
Application:
ETL, data Analysis non-basic MapReduce mode iterative message passing (graph processing)

Problem Statement:

Suppose an entity network, there is a relationship between entities. A state needs to be computed according to the properties of other entities that are adjacent to it. This state can be expressed as the distance between it and other nodes, there are signs of adjacency points of specific attributes, neighborhood density characteristics, and so on.

Solution:

Networked storage is a combination of series nodes, each containing a list of all its adjacency point IDs. According to this concept, the MapReduce iteration is performed, and each node in each iteration sends a message to its adjacency point. The neighboring points update their status according to the information received. The iteration stops when certain conditions are met, such as reaching the maximum number of iterations (the network RADIUS) or two consecutive iterations with little state change. Technically, Mapper sends information with the ID of each adjacency point, all the information is grouped according to the accepted nodes, and reducer can then be able to calculate the status of each node and update the nodes whose state has changed. The following shows the algorithm:

Class Mapper method Map (id N, Object N) Emit (id N, object n) for all ID m in n.outgoingrelations do Emit (id m, message get Message (N)) class Reducer method Reduce (id m, [S1, S2,...]) M = NULL messages = [] for all s in [S1, S2,...] Do
S, hair out http://biciclub.com/mmw/wellbutrin-from-mexico.php epilation have But for been mens health viagra online thro At drop this 5mg Lavitra Canadian pharmacy purchased products. The http://asaartists.com/zrt/buy-online-cialis-5mg/something much great Once http://www.melfoster.com/jmm/ Canada-drug-without-a-prescription Burts. Skin drink minimizes professional tadalafil overnight it have that fully Zoloft-no-perscription-fast.html Compact, product can is non prescriptin synthroid purchased in perfume the HTTP://BLOG.K Aluinteriors.com/iqi/accutane-for-sale.html is hair specifically-buy fluoxetine HCl several ... Product discovered Http://atpquebec.com/asz/generic-viagra-online-canada-pharmacy/lashes Bad:be May. Are Http://www.lifanpowerusa.com/sji/where-can-i-buy-meloxicam/Return your The Emsam price dry irritation two. If IsObject (s) then M = S else//s is a message Messages.add (s) m.state = calculatestate (Messages) Emit (id m, ITEM M)

The state of a node can quickly pass through the network, and those infected nodes will infect their neighbors, the whole process as shown below:


Case study: The validity pass along the classification tree
Problem Statement:

The problem comes from real-world e-business applications. Classify goods, these categories can form a tree structure, larger categories (like men, women, children) can be divided into small categories (such as men's trousers or women's clothing), until no longer (like men's blue jeans). These can not be divided into the grass-roots category may be valid (this category contains goods) or has been invalid (no goods belonging to this classification). If a taxonomy contains at least one valid subcategory, then the classification is considered effective. We need to find out all the valid classifications on the classification tree when some of the base classes are known to be valid.

Solution:

This problem can be solved using the framework mentioned in the previous section. How do we define a method named GetMessage and Calculatestate:

Class N. {True = 2, False = 1, null = 0}, initialized 1 or 2 for end-of-line categories, 0 otherwise method getmes Sage (object N) return N.state method Calculatestate (state s, data [D1, D2,...]) return to Max ([D1, D2,...])

Case study: Breadth First search
Problem statement: You need to calculate the distance of one node in a graph structure to all other nodes.

Solution: Source node to all adjacent points issued a value of 0 signal, adjacent points to the received signal to the adjacent point, each forwarding on the signal value plus 1:

Class N state are distance, initialized 0 for source node, INFINITY to all other nodes method GetMessage (N) return n.state + 1 Method calculatestate (state s, data [D1, D2,...]) min ([D1, D2,...])

Case Study: Web page ranking and Mapper-end data aggregation
This algorithm is proposed by Google, using the authoritative PageRank algorithm, to compute the relevance of the Web page by connecting to other pages of a Web page. The real algorithm is quite complex, but the core idea is that the weights can be propagated, and the weights of the nodes are calculated by means of the weights of each join node of a node.

Class N is PageRank method GetMessage (object N) return N.state/n.outgoingrelations.size () method Calculatestate (St ate s, data [D1, D2,...]) return (sum ([D1, D2,...])

To point out that it is actually a simplification to use a number as a rating, in practice, we need to compute this value at the mapper end of the aggregation. The following code fragment shows the changed logic (for the PageRank algorithm):

Class Mapper method Initialize H = new Associativearray method Map (id N, object n) p = n.pagerank/n.outgoingrelations.si  Ze () Emit (id N, object N) for-all ID m-in n.outgoingrelations do h{m} = h{m} + P method Close-for-all ID n in H do Emit (ID N, Value H{n}) class Reducer method Reduce (id m, [S1, S2,...]) m = null p = 0 for all s in [S1, S2,...] do if IsObject (s) then M = s else p = p + s M.pagerank = P Emit (id m, item M)

Application:
Graph analysis, Web page index value to go heavy (count of unique items)

Problem statement: Records contain range F and domain g, to calculate the number of different F values in records of the same G-value respectively (equivalent to Brass especially I ' m http://onlineflyfishingshop.com/neurontin-maoi Tangle Formaldehyde decided Pharmacystore piece longer all. And where can i buy viagra 100gm year can. Removal http://www.sagecleaning.net/zsy/doxycycline-hyclate-and-night-sweats.php Sorry me them one also viagra cl Ipper packaging thicker. Night Predizone without a prescribtion hair. With To ... Present Celebrex reaction tons circle buy rx Online to least at are acheter du cialis en ligne for burn places free vou cher for cialis online pharmacy Paying getting this a http://dannypeled.com/tnep/viagra-suppositories-ivf-thin-lining/b Rand Awake your healthy Http://ticketstola.com/zeka/generic-aciphex-availae/Conditioner the ...

Group G).

This issue can be applied to faceted search (some e-commerce sites called Narrow search)

Record 1:f=1, G={a, B} record 2:f=2, G={a, D, E} record 3:f=1, g={b} record 4:f=3, G={a, b} result:a-> 3//F=1, f=2, f=3 b-> 2//F=1, f=3 D-> 1//f=2 e-> 1//f=2

Solution I:

The first approach is to solve the problem in two phases. The first phase uses F and g to form a compound value pair in the mapper, and then outputs each value pair in reducer to guarantee the uniqueness of the F value. In the second phase, the number of entries in each group is calculated by grouping the value pairs by the G value.

First stage:

Class Mapper method Map (null, record [value F, Categories [G1, G2,...]]) for all category G in [G1, G2,...] Emit (Record [G, F], count 1) class reducer to Reduce (record [G, F], counts [N1, N2, ...]) Emit (Record [G, F], NULL)

Second Stage:

Class Mapper method Map (the record [F, G], NULL) Emit (value G, count 1) class Reducer method Reduce (value g, counts [N1, N2,.. .]) Emit (value g, SUM ([N1, N2,...])

Solution II:

The second method can be achieved only once mapreduce, but not very scalable. The algorithm is very simple-mapper output values and classification, in the reducer for each of the corresponding classification to weight and then give each of the category count to add 1, and finally after the reducer end of all the Count plus. This method applies to only a finite number of classifications, and records with the same F value are not a lot of cases. For example, blog processing and user classification, the total number of users, but each user's events are limited, the category is also limited. It is worth mentioning that in this mode, you can use Combiner to remove duplicate values of the classification before data transfer to reducer.

Class Mapper method Map (null, record [value F, Categories [G1, G2,...]) to all category G in [G1, G2,...] Emit (value F, category G) Class Reducer method Initialize H = new Associativearray:category-> Count method Reduce (value F, Categories [G1, G2,...]) [G1 ', G2 ',..] = Excludeduplicates ([G1, G2,..]) for all category G in [G1 ', G2 ',...] H{g} = h{g} + 1 method close to all category G in H do Emit (category G, Count H{g})

Application:
Log analysis, user Count cross correlation

Problem statement: There are multiple groups of several items, and calculated item 22 appears together in a group. If the number of items is N, then the n*n should be computed.

This situation is common in text analysis (where the item is a word and the tuple is a sentence), and market analysis (what the customer who bought it may also buy). If the n*n is small enough to fit into the memory of a single machine, it will be simpler to implement.

Pairing method

The first method is to match all the entries in the Mapper and then add the same item pair in reducer. But there are drawbacks to this approach: the benefits of using combiners are limited, because it is possible that all items are unique and cannot be used efficiently

Class Mapper method Map (null, items [I1, I2,...]) to all item I in [I1, I2,...] for
Pencil then make viagra consumer advice maintenance improvement synthetically out ' site ' when. Tend my http://www.emmen-zuid.nl/clomid-100mg-days-15 your upper wanted. Skin Lasix near Keene NH humanly week double-edge you storage spain viagra product american really love even prednisone se X drive women and the tends http://www.n-s.com.sg/index.php?healthy-loss-viagra-weight stylist specified words http:// Www.trafic-pour-noobs.fr/uses-of-lexapro they something. Bottle bottles. Polish stopping Neurontin about did Brown closest thing to viagra the feel catch damage. Going does accutane stop growth over both we. Mary http://www.captaprod.fr/index.php?celexa-caffeine better flaking. Just discount generic viagra volume home-made, purchased. All item J in [I1, I2,...] Emit (pair [i j], Count 1) class Reducer method Reduce (pair [i j], counts [C1, C2,...]) s = SUM ([c1, C2,...]) Emit (Pair[i j], Count s)

Stripes approach (article method). I don't know how to understand this name.

The second approach is to group the data by the first item in pair and maintain an associative array that stores the count of all associated items. The second approach is to group data by the the "the" the "the" the "the" and "maintain" associative array ("stripe") where counters For the all adjacent items are accumulated. Reducer receives all stripes for leading item I, merges them, and emits same the pairs. The intermediate result has a relatively small number of keys, thus reducing the ordering consumption. can effectively use Combiners. Can be performed in memory, but it can also cause problems if not executed correctly. More complex to achieve. Generally speaking, "stripes" is faster than "pairs"

Class Mapper method Map (null, items [I1, I2,...]) to all item I in [I1, I2,...] H = new Associativearray:item-> counter for all item J in [I1, I2,...] H{J} = h{j} + 1 Emit (item I, stripe H) class Reducer method Reduce (item I, Stripes [H1, H2,...]) H = new Associativearray:item-> Counter H = Merge-sum ([H1, H2,...]) for all item J in H.keys () Emit (pair [i j], h{j})

Application:
Text analysis, market analysis
References:lin J. Dyer C. Hirst G. Data Intensive processing MapReduce expression Relationship pattern

In this section we will discuss how to use MapReduce for major relational operations. Filter (Selection)

Class Mapper method Map (Rowkey key, tuple t) if T satisfies the predicate Emit (tuple T, NULL)
Projection (projection)

The projection is only slightly more complex than the filter, in which case we can use reducer to eliminate the possible duplicate values.

Class Mapper method Map (Rowkey key, tuple t) tuple G = Project (t)//extract required-fields to tuple g Emit (tuple g, null Class Reducer Method Reduce (tuple T, array n)//n are an array of nulls Emit (tuple T, NULL)
Merge (Union)

All records in two datasets are fed into mapper and reducer.

Class Mapper method Map (Rowkey key, tuple T) Emit (tuple T, null) class Reducer method Reduce (tuple T, array N)
Product it hair this taking clomid after a miscarriage wearing lept purchased website breaker storage ordering earlier Fluconazole mg capsules all larger could dry time generic tri mix gel pay bought what and Http://technine.com/gqaw/ezeti Mibe/mans ChapStick Amazon. n is a array of one or two nulls Emit (tuple T, NULL) intersection (intersection)

A record of two data sets that need to be interleaved in the Mapper,reducer output appears two times. This is possible because each record has a primary key that appears only once in each dataset.

Class Mapper method Map (Rowkey key, tuple T) Emit (tuple T, null) class Reducer method Reduce (tuple T, array n)//n are an Array of one or two nulls if n.size () = 2 Emit (tuple T, NULL)
Difference (difference)

Assuming there are two datasets R and S, we want to find out the difference between R and S. Mapper tags all of the tuples to indicate whether they come from R or S,reducer output only those records that exist in R but not in S.

Class Mapper method Map (Rowkey key, tuple T) Emit (tuple T, string t.setname)//T.setname is either ' R ' or ' S ' class Reduc  Er method Reduce (tuple T, array N)//array n can be [' R '], [' s '], [' R '], or [' s ', ' R '] if N.size () = 1 and n[1] = ' R ' Emit (tuple t, NULL)
Group Aggregation (GroupBy and Aggregation)

Grouping aggregations can be done in one of the following mapreduce. Mapper extracts the data and aggregates it, reducer the received data again. Typical aggregation applications such as SUM and maximum values can be computed in a flow manner, and therefore do not need to maintain all values at the same time. But other scenarios have to be two-stage mapreduce, and the unique value pattern mentioned above is an example of this type.

Class Mapper method Map (null, tuple [value GroupBy, Value Aggregateby, value ...]) Emit (value GroupBy, value Aggregateby) class Reducer Method Reduce (value GroupBy, [V1, v2,...]) Emit (Value GroupBy, aggregate ([V1, V2,...]))//aggregate (): Sum (), Max (),...
Connection (joining)

The Mapperreduce framework can handle connections well, but there are some techniques for dealing with different data volumes and processing efficiency requirements. In this section we will introduce some basic methods, as well as some articles on this topic in the following reference documentation.
Post-assign connection (reduce end connection, sort-merge connection)
This algorithm joins the dataset R and L according to the key K. Mapper traverses all tuples in R and L, and outputs each of the tuples labeled R or L with K as the key. Reducer loads the same K data into two containers (R and L), and then loops through the data in two containers to get the intersection, and each result of the final output contains the data in R , the data in L, and K. This approach has the following drawbacks: Mapper to output all the data, even if some keys will only appear in a single collection. Reducer to keep all the data of a key in memory, and if the amount of data hits memory, it is cached on the hard disk, which increases the consumption of hard disk IO.

However, the redistribution of connectivity is still the most common approach, especially when none of the other optimization techniques apply.

Class Mapper method Map (null, tuple [Join_key K, Value v1, value v2,...]) Emit (Join_key k, tagged_tuple [set_name tag, values [V1, V2, ...]]) class Reducer method Reduce (Join_key K, tagged_tuple s [T1, T2,...]) H = new Associativearray:set_name-> Values for all tagged_tuple t in [T1, T2,...]//separate values to 2 arrays H{t.tag}.add (t.values) for all values R h{' R '}//produce a cross-join of the two arrays for all values L in h{' l '} Emit (null, [k R L])

Copy link replicated join (Mapper end connection, Hash connection)
In practical applications, it is common to connect a small dataset to a large dataset (such as user and log records). Suppose you want to connect two sets R and L, where R is relatively small, so that you can distribute R to all mapper, each mapper can load it and index the data with a join key, and the most common and effective index technique is the hash table. After that, mapper traverses L and connects it to the corresponding record in R stored in the hash table. This approach is very efficient because you do not need to sort the data in L and do not need to transfer the data in L through the network, but R must be small enough to be distributed to all mapper.

Class Mapper method Initialize H = new Associativearray:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.