Reprinted from: yangguan. orgmapreduce-patterns-algorithms-and-use-cases translated from: highlyscalable. wordpress. in this article, com20120201mapreduce-patterns summarizes several common MapReduce models and algorithms on the Internet or in the paper, and systematically explains the differences between these technologies.
Reposted from: Workshop
Reposted from: Workshop. All descriptive text and code use the standard hadoop MapReduce model, including Mappers, CES, Combiners, Partitioners, and sorting. As shown in.
Basic MapReduce mode count and sum
Problem description:? There are many documents, each of which consists of some fields. Calculate the number of occurrences of each field in all documents or the other statistical values of these fields. For example, given a log file, each record contains a response time, and the average response time needs to be calculated.
Solution:Let's start with a simple example. In the following code snippet, Mapper records frequency 1 every time it encounters a specified word. Cer CER traverses these word sets one by one and then adds their frequencies.
class Mapper method Map(docid id, doc d) for all term t in doc d do Emit(term t, count 1)class Reducer method Reduce(term t, counts [c1, c2,...]) sum = 0 for all count c in [c1, c2,...] do sum = sum + c Emit(term t, count sum)
The disadvantage of this method is obvious. Mapper submits too many meaningless counts. It can count the words in each document to reduce the amount of data transferred to Cer CER:
class Mapper method Map(docid id, doc d) H = new AssociativeArray for all term t in doc d do H{t} = H{t} + 1 for all term t in H do Emit(term t, count H{t})
If you want to count not only the content in a single document, but also all the documents processed by a Mapper node, you need to use Combiner:
class Mapper method Map(docid id, doc d) for all term t in doc d do Emit(term t, count 1)class Combiner method Combine(term t, [c1, c2,...]) sum = 0 for all count c in [c1, c2,...] do sum = sum + c Emit(term t, count sum)class Reducer method Reduce(term t, counts [c1, c2,...]) sum = 0 for all count c in [c1, c2,...] do sum = sum + c Emit(term t, count sum)
Applications: Log analysis, data query, sorting, and classification
Problem description:There are a series of entries, each of which has several attributes. To save entries with the same attribute value in a file, or group entries by attribute value. The most typical application is inverted index.
Solution:The solution is simple. In Mapper, the attribute value required for each entry is used as the key, and the value is passed to Cer. CER obtains entries grouped by attribute values and can be processed or saved. If an inverted index is built, each entry is equivalent to a word, and the attribute value is the Document ID of the word. Application: inverted index, ETL filtering (text search), parsing and validation
Problem description:Assume that there are many records, you need to find all the records that meet a certain condition from them, or transfer each record to another form (the conversion operation is independent of each record, that is, operations on a record are irrelevant to other records ). Such as text parsing, extraction of specific values, and format conversion all belong to the latter case.
Solution:It is very simple. You can perform operations one by one in Mapper and output the required values or converted forms. Applications: log analysis, data query, ETL, and data validation distributed task execution
Problem description:Large computing can be divided into multiple parts and then the results of each calculation are merged to obtain the final result.
Solution:? Split the data into multiple parts as the input of each Mapper. Each Mapper processes one copy of the data and performs the same operation to generate the results. Reducer combines the results of multiple Mapper into one. Case study: digital communication systems simulate digital communication simulation software such as WiMAX to transmit a large amount of random data through the system model, and then calculate the probability of errors in transmission. Sample 1/N processed by each Mapper? Calculate the error rate of this part of the data, and then calculate the average error rate in cer Cer. Applications: engineering simulation, digital analysis, performance test sorting
Problem description:There are many records that need to be sorted by certain rules or processed in order.
Solution:? Simple sorting is easy-Mappers sets the attribute value to be sorted as a key, and the entire record is a value output. However, sorting in practical applications must be more clever. This is why it is called the MapReduce core ("core" refers to sorting? Because the experiment that proves Hadoop's computing capability is big data sorting? Or is the key sorting link in Hadoop processing ?). In practice, key combinations are often used to achieve secondary sorting and grouping. MapReduce can only sort keys at first, but some technologies can use Hadoop features to sort by value. For more information, see? This blog. According to the concept of BigTable, MapReduce is used to sort the initial data rather than the intermediate data, that is, it is more advantageous to maintain the orderly state of the data. You must pay attention to this. In other words, sorting data at insertion is more efficient than sorting data at each query. Application: ETL, data analysis, non-basic MapReduce mode, iterative message transfer (graph processing)
Problem description:Assume that there is a relationship between entities in an object network. You need to calculate a state based on the attributes of other entities adjacent to it. This state can be expressed as the distance between it and other nodes, there are signs of adjacent contacts of specific attributes, and the characteristics of the neighborhood density.
Solution:? Network storage is a combination of series nodes. Each node contains a list of all adjacent contact IDs. According to this concept, MapReduce iteration is carried out, and each node in each iteration sends a message to its adjacent contacts. The adjacent contacts update their status based on the received information. When certain conditions are met, the iteration stops, for example, the maximum number of iterations (Network radius) or two consecutive iterations have almost no state change. Technically, Mapper sends information based on the ID of each adjacent contact. All information is grouped by the receiving node, reducer can recalculate the status of each node and then update those nodes whose status has changed. The algorithm is shown below:
class Mapper method Map(id n, object N) Emit(id n, object N) for all id m in N.OutgoingRelations do Emit(id m, message getMessage(N))class Reducer method Reduce(id m, [s1, s2,...]) M = null messages = [] for all s in [s1, s2,...] do if IsObject(s) then M = s else // s is a message messages.add(s) M.State = calculateState(messages) Emit(id m, item M)
The status of a node can be quickly transmitted along the network, and the infected nodes can infect their neighbors. The whole process is like the following figure: Case study: validity transfer along the Classification Tree
Problem description:This problem comes from real e-commerce applications. Classify various types of goods. These types can form a tree structure. Larger categories (such as men, women, and children) can be further divided into small categories (such as men's trousers or women's clothes ), until no more points (like men's blue jeans ). The grass-roots category that cannot be further divided can be valid (this category contains goods) or invalid (there is no goods under this category ). If a category contains at least one valid sub-category, the category is also valid. We need to find out all valid classifications on the classification tree when we know that some grass-roots classifications are valid.
Solution:This problem can be solved using the framework mentioned in the previous section. The getMessage and calculateState methods are defined below:
class N State in {True = 2, False = 1, null = 0}, initialized 1 or 2 for end-of-line categories, 0 otherwisemethod getMessage(object N) return N.Statemethod calculateState(state s, data [d1, d2,...]) return max( [d1, d2,...] )
Case study: breadth-first search
Problem description:Calculate the distance between a node in a graph structure and all other nodes.
Solution:? The Source node sends a signal with a value of 0 to all adjacent contacts. The adjacent contacts then send the received signal to their adjacent contacts. Each time a signal is forwarded, 1 is added to the signal value:
class N State is distance, initialized 0 for source node, INFINITY for all other nodesmethod getMessage(N) return N.State + 1method calculateState(state s, data [d1, d2,...]) min( [d1, d2,...] )
Case study: the algorithm of WebPage Ranking and Mapper data aggregation is proposed by Google. The powerful PageRank algorithm is used to calculate the relevance of webpages by connecting to other webpages of a webpage. Real algorithms are quite complex, but the core idea is that weights can be transmitted, that is, the weights of nodes are calculated by means of the average weights of each connected node of a node.
class N State is PageRankmethod getMessage(object N) return N.State / N.OutgoingRelations.size()method calculateState(state s, data [d1, d2,...]) return ( sum([d1, d2,...]) )
It should be pointed out that using a value as a score is actually a simplification. In actual situations, we need to aggregate and calculate this value on the Mapper side. The following code snippet shows the logic after the change (for the PageRank algorithm ):
class Mapper method Initialize H = new AssociativeArray method Map(id n, object N) p = N.PageRank / N.OutgoingRelations.size() Emit(id n, object N) for all id m in N.OutgoingRelations do H{m} = H{m} + p method Close for all id n in H do Emit(id n, value H{n})class Reducer method Reduce(id m, [s1, s2,...]) M = null p = 0 for all s in [s1, s2,...] do if IsObject(s) then M = s else p = p + s M.PageRank = p Emit(id m, item M)
Application: Graph Analysis, de-duplicated web index values (count unique items)
Problem description:? The record contains the value range F and the value range G. The number of different F values in the records with the same G value must be counted respectively (equivalent to grouping by G ). this problem can be widely applied to Faceted Search (called Narrow Search by some e-commerce websites)
Record 1: F=1, G={a, b} Record 2: F=2, G={a, d, e} Record 3: F=1, G={b} Record 4: F=3, G={a, b} Result: a -> 3 // F=1, F=2, F=3 b -> 2 // F=1, F=3 d -> 1 // F=2 e -> 1 // F=2
Solution I:The first method is to solve the problem in two phases. In the first stage, F and G are used in Mapper to form a composite value pair, and each value pair is output in cer CER to ensure the uniqueness of the F value. In the second stage, the value pairs are grouped by G value to calculate the number of entries in each group. Stage 1:
class Mapper method Map(null, record [value f, categories [g1, g2,...]]) for all category g in [g1, g2,...] Emit(record [g, f], count 1) class Reducer method Reduce(record [g, f], counts [n1, n2, ...]) Emit(record [g, f], null )
Stage 2:
class Mapper method Map(record [f, g], null) Emit(value g, count 1) class Reducer method Reduce(value g, counts [n1, n2,...]) Emit(value g, sum( [n1, n2,...] ) )
Solution II:The second method only requires one MapReduce operation, but the scalability is not strong. The algorithm is simple-Mapper outputs values and categories. In CER, deduplication is performed for the categories corresponding to each value and 1 is added to the classification count of each value, after the CER ends, all counts are added. This method applies to scenarios where there are only a limited number of categories and there are not many records with the same F value. For example, for network log processing and user classification, there are many users in total, but each user's event is limited, and the categories obtained by this classification are also limited. It is worth mentioning that in this mode, Combiner can be used to remove duplicate values of classification before data is transmitted to Cer.
class Mapper method Map(null, record [value f, categories [g1, g2,...] ) for all category g in [g1, g2,...] Emit(value f, category g)class Reducer method Initialize H = new AssociativeArray : category -> count method Reduce(value f, categories [g1, g2,...]) [g1', g2',..] = ExcludeDuplicates( [g1, g2,..] ) for all category g in [g1', g2',...] H{g} = H{g} + 1 method Close for all category g in H do Emit(category g, count H{g})
Application: log analysis, user count
Problem description:There are multiple groups composed of several items, and the number of times the computing items appear in one group together. If the number of items is N, N * N should be calculated. This situation is common in Text Analysis (the entries are words and the tuples are sentences), market analysis (what else customers who have purchased this item may purchase ). If N * N is small enough to accommodate the memory of a single machine, it is easy to implement.
Pairing MethodThe first method is to pair all entries in Mapper, and then add the count of the same entry pair in cer Cer. However, this approach also has disadvantages:
- The benefits of using combiners are limited, because it is likely that all item pairs are unique.
- Memory cannot be used effectively
class Mappermethod Map(null, items [i1, i2,...] )for all item i in [i1, i2,...]for all item j in [i1, i2,...]Emit(pair [i j], count 1)class Reducermethod Reduce(pair [i j], counts [c1, c2,...])s = sum([c1, c2,...])Emit(pair[i j], count s)
Stripes Approach (method? I don't know how to understand this name)The second method is to group data according to the first item in the pair and maintain an associated array, which stores the count of all associated items. The second approach is to group data by the first item in pair and maintain an associative array ("stripe") where counters for all adjacent items are accumulated. CER receives all stripes for leading item I, merges them, and emits the same result as in the Pairs approach.
- The number of keys in the intermediate result is relatively small, which reduces the consumption of sorting.
- Combiners can be effectively used.
- It can be executed in the memory, but if it is not correctly executed, it will also cause problems.
- Implementation is complicated.
- Generally, "stripes" is faster than "pairs"
class Mappermethod Map(null, items [i1, i2,...] )for all item i in [i1, i2,...]H = new AssociativeArray : item -> counterfor all item j in [i1, i2,...]H{j} = H{j} + 1Emit(item i, stripe H)class Reducermethod Reduce(item i, stripes [H1, H2,...])H = new AssociativeArray : item -> counterH = merge-sum( [H1, H2,...] )for all item j in H.keys()Emit(pair [i j], H{j})
Application: Text Analysis, market analysis References:
- Lin J. Dyer C. Hirst G .? Data Intensive Processing MapReduce
In this section, we will discuss how to use MapReduce for main relational operations. Selection)
class Mapper method Map(rowkey key, tuple t) if t satisfies the predicate Emit(tuple t, null)
Projection is only a little more complex than filtering. In this case, you can use CER to eliminate possible duplicate values.
class Mapper method Map(rowkey key, tuple t) tuple g = project(t) // extract required fields to tuple g Emit(tuple g, null)class Reducer method Reduce(tuple t, array n) // n is an array of nulls Emit(tuple t, null)
All records in the Union dataset are sent to Mapper, removing duplicates in cer Cer.
class Mapper method Map(rowkey key, tuple t) Emit(tuple t, null)class Reducer method Reduce(tuple t, array n) // n is an array of one or two nulls Emit(tuple t, null)
The Intersection (Intersection) Inputs Mapper into the records that need to be crossed between the two datasets, And the CER outputs two records. Because each record has a primary key and appears only once in each dataset, this is feasible.
class Mapper method Map(rowkey key, tuple t) Emit(tuple t, null)class Reducer method Reduce(tuple t, array n) // n is an array of one or two nulls if n.size() = 2 Emit(tuple t, null)
Difference suppose there are two datasets R and S. We need to find the Difference between R and S. Mapper marks all the tuples, indicating whether they are from R or S. Cer CER only outputs records that exist in R but not in S.
class Mapper method Map(rowkey key, tuple t) Emit(tuple t, string t.SetName) // t.SetName is either 'R' or 'S'class Reducer method Reduce(tuple t, array n) // array n can be ['R'], ['S'], ['R' 'S'], or ['S', 'R'] if n.size() = 1 and n[1] = 'R' Emit(tuple t, null)
Group Aggregation can be completed in the following MapReduce. Mapper extracts and aggregates the data in groups. Cer CER aggregates the received data again. Typical aggregate applications, such as sum and maximum value, can be calculated in a stream mode, so there is no need to keep all values at the same time. However, in other scenarios, two-phase MapReduce is required. The unique value mode mentioned above is an example of this type.
class Mapper method Map(null, tuple [value GroupBy, value AggregateBy, value ...]) Emit(value GroupBy, value AggregateBy)class Reducer method Reduce(value GroupBy, [v1, v2,...]) Emit(value GroupBy, aggregate( [v1, v2,...] ) ) // aggregate() : sum(), max(),...
The Joining MapperReduce framework can process connections well, but there are still some skills in dealing with different data volumes and processing efficiency requirements. In this section, we will introduce some basic methods. Some articles on this topic are listed in the reference documents below. Join after allocation (Reduce end connection, sort-merge connection) This algorithm connects the data set R and L according to the key K. Mapper traverses all the tuples in R and L, and uses K as the key to output each tuples marked from R or L, reducer splits the data of the same K into two containers (R and L), and then traverses the data of the two containers in a nested loop to get the intersection, each output result contains data in R, data in L, and K. This method has the following Disadvantages:
- Mapper outputs all the data, even if some keys only appear in one set.
- Reducer must keep all the data with a key in the memory. If the data volume hits the memory, it must be cached on the hard disk, which increases the hard disk IO consumption.
Even so, the redistribution connection method is still the most common method, especially when other optimization technologies are not applicable.
class Mapper method Map(null, tuple [join_key k, value v1, value v2,...]) Emit(join_key k, tagged_tuple [set_name tag, values [v1, v2, ...] ] )class Reducer method Reduce(join_key k, tagged_tuples [t1, t2,...]) H = new AssociativeArray : set_name -> values for all tagged_tuple t in [t1, t2,...] // separate values into 2 arrays H{t.tag}.add(t.values) for all values r in H{'R'} // produce a cross-join of the two arrays for all values l in H{'L'} Emit(null, [k r l] )
Replication link Replicated Join (Mapper-side connection, Hash connection) in practical applications, it is very common to connect a small dataset with a large dataset (such as user and log records ). Assume that we want to connect two sets of R and L, where R is relatively small. In this way, we can distribute R to all Mapper, each Mapper can load it and index the data with a connection key. The most common and effective index technology is the hash table. Then, Mapper traverses L and connects it to the corresponding records stored in R in the hash table ,. This method is very efficient because you do not need to sort the data in L or transmit the data in L through the network, but R must be small enough to be able to be distributed to all Mapper.
class Mapper method Initialize H = new AssociativeArray : join_key -> tuple from R R = loadR() for all [ join_key k, tuple [r1, r2,...] ] in R H{k} = H{k}.append( [r1, r2,...] ) method Map(join_key k, tuple l) for all tuple r in H{k} Emit(null, tuple [k r l] )
Refer:
- Join Algorithms using Map/Reduce
- Optimizing Joins in a MapReduce Environment
MapReduce algorithms used in machine learning and mathematics
- C. T. Chu?Et al? Provides an excellent? Description? Of? Machine learning algorithms for MapReduce in the article? Map-Reduce for Machine Learning on Multicore.
- FFT using MapReduce :? Http://www.slideshare.net/hortonworks/large-scale-math-with-hadoop-mapreduce
- MapReduce for integer factorization :? Http://www.javiertordable.com/files/MapreduceForIntegerFactorization.pdf
- Matrix multiplication with MapReduce :? Http://csl.skku.edu/papers/CS-TR-2010-330.pdf? And? Http://www.norstad.org/matrix-multiply/index.html
Original article address: [reprinted] MapReduce mode, algorithm, and use case. Thank you for sharing it with the original author.