The original English: "MapReduce Patterns, Algorithms, and use Cases" https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
In this article, we summarize some of the common mapreduce patterns and algorithms on the Web or in this paper, and systematically explain the differences between these technologies. All descriptive text and code uses the standard Hadoop model of MapReduce, including Mappers, reduces, combiners, partitioners, and sorting. As shown in the following:
Basic MapReduce mode count and Sum
Problem Statement: There are many documents, and each document has some fields. You need to calculate the number of occurrences of each field in all documents, or any other statistical value for those fields. For example, given a log file, where each record contains a response time, the average response time needs to be calculated.
Solution:
Let's start with a simple example. In the code snippet below, mapper the frequency 1,reducer one by one each meeting the specified word and then adds their frequency.
The drawbacks of this approach are obvious, and mapper commits too many meaningless counts. It is entirely possible to reduce the amount of data passed to reducer by counting the words in each document first:
If you want to count more than just the contents of a single document, and also include all the documents processed by a mapper node, you need to use combiner:
Application:
LOG Analysis, data query
Organize your collation
Problem Statement :
There are a number of entries, each with several attributes, to keep entries with the same attribute value in a file, or to group the entries by attribute values. The most typical application is an inverted index.
Solution:
The solution is simple. In Mapper, the value of the attribute required for each article is the key, which itself is passed as a value to Reducer. Reducer gets the entries grouped by attribute value, which can then be processed or saved. If you are building an inverted index, then each entry is equivalent to a word and the attribute value is the document ID of the word.
Application:
Inverted index, ETL
Filtering (text lookup), parsing and verifying
Problem Statement:
Suppose you have a number of records where you need to find all the records that satisfy a certain condition, or you can convert each record to another form (the conversion operation is independent of each record, that is, the action on one record is unrelated to the other records). such as text parsing, specific value extraction, format conversion, etc. all belong to the latter case.
Solution:
Very simple, in the mapper operation, output the desired value or the converted form.
Application:
Log analysis, data query, ETL, data check
Distributed Task execution
Problem Statement:
Large computations can be decomposed into multiple parts and then merge the results of each calculation to obtain the final result.
Solution: Divide the data into multiple copies as input for each Mapper, each Mapper process a piece of data, perform the same operation, produce results, reducer combine multiple Mapper results into one.
Case study: Digital communication system simulation
Digital communication simulation software such as WiMAX transmits a large amount of random data through a system model, and then calculates the probability of error in transmission. Each Mapper processes the data of the sample 1/n, calculates the error rate for this part of the data, and calculates the average error rate in Reducer.
Application:
Engineering simulation, digital analysis, performance testing
Sort
Problem Statement:
There are a number of records that you need to sort all records by some rule or in order to process the records.
Solution: simple sorting is good –mappers the property value to be sorted is the key, and the entire record is the value output. But the sorting in the actual application is a bit more ingenious, which is why it is called the MapReduce core ("core" is the sort?). Because the experiment that proves Hadoop's computing power is big data sequencing? Or does the process of Hadoop deal with key sequencing? )。 In practice, common key combinations are used to achieve two sorting and grouping.
MapReduce initially only has the ability to sort keys, but there are also techniques that can be used to sort by value using the features of Hadoop. If you want to know, you can read this blog.
In accordance with the BigTable concept, it is important to note that MapReduce is used to sort the initial data rather than the intermediate data, that is, to keep the data in an orderly state. In other words, it is more efficient to sort the data when it is inserted than at each query of the data.
Application:
ETL, data analysis
Iterative message delivery for non-basic MapReduce patterns (Figure processing)
Problem Statement:
Suppose an entity network, there is a relationship between entities. You need to calculate a state according to the properties of other entities that are adjacent to it. This state can be expressed as the distance between it and other nodes, the presence of the neighboring points of a particular attribute, the neighborhood density characteristics, and so on.
Solution:
Networked storage is a combination of series nodes, each of which contains a list of all of its adjacency point IDs. In this concept, the MapReduce iteration is performed, and each node in each iteration sends a message to its adjacency point. The adjacency point updates its state based on the information received. When certain conditions are met, the iteration stops, such as reaching the maximum number of iterations (network radius) or two consecutive iterations with almost no state changes. Technically, Mapper sends the message with the ID of each adjacency point, all the information is grouped according to the receiving node, and the reducer is able to re-calculate the state of each node and update the nodes that have changed state. The following shows the algorithm:
The state of a node can be quickly spread across the network, and the infected nodes go to infect their neighbors, the whole process is like this:
Case study: Passing validity along the classification tree
Problem Statement:
This problem comes from real e-commerce applications. Classification of various goods, these categories can form a tree structure, than the larger classification (like men, women, children) can be divided into small categories (like men's trousers or women's clothing), until no longer (like men's blue jeans). These non-sub-categories may be valid (the category contains goods) or are invalid (no goods belonging to this category). If a classification contains at least one valid subcategory then the classification is considered valid. We need to find out all the valid classifications on the classification tree when some of the basic categories are known to be valid.
Solution:
This problem can be solved using the framework mentioned in the previous section. How do we define the following methods named GetMessage and Calculatestate:
class
N
State in {True = 2, False = 1, null = 0}, initialized 1 or 2
for
end-of-line categories, 0 otherwise
method getMessage(object N)
return
N.State
method calculateState(state s, data [d1, d2,...])
return
max( [d1, d2,...] )
Case study: Breadth-First search
Problem Statement : you need to calculate the distance from one node to all other nodes in a graph structure.
Solution: The source node emits a signal with a value of 0 for all adjacency points, and the adjacency point forwards the received signal to its adjacency point, adding 1 to the signal value each time it is forwarded:
Case study: Page rankings and Mapper-end data aggregation
This algorithm, proposed by Google, uses the authoritative PageRank algorithm to calculate web page dependencies by connecting to other pages in a Web page. The real algorithm is quite complex, but the core idea is that the weights can be propagated, that is, the weights of the nodes are calculated by means of the weights of each joint node.
It is actually a simplification to point out that a value is used as a score, and in practice we need to do the aggregation calculations on the mapper side to derive this value. The following code snippet shows the changed logic (for the PageRank algorithm):
Application:
Graph analysis, Web page index
Value de-weight (count of unique items)
Problem Statement: The record contains the value of the range F and the domain G, the number of different F-values to be counted separately for the same G value (equivalent to
Brass especially I ' m http://onlineflyfishingshop.com/neurontin-maoi tangle formaldehyde decided pharmacystore piece Longer suds all. And where can i buy viagra 100gm year can. Removal http://www.sagecleaning.net/zsy/doxycycline-hyclate-and-night-sweats.php Sorry me them one also viagra women Clipper packaging thicker. Night Predizone without a prescribtion hair. With To ... Present Celebrex reaction tons circle buy rx Online to least at are acheter du cialis en ligne for burn places free voucher for cialis online pharmacy Paying getting this a http://dannypeled.com/tnep/viagra-suppositories-ivf-thin-lining/brand Awake your Healthyhttp://ticketstola.com/zeka/generic-aciphex-availae/conditioner the ...
G group).
This problem can be applied to faceted search (some e-commerce sites call narrow search)
Solution I:
The first approach is to solve this problem in two phases. The first stage uses F and G to form a composite value pair in Mapper, and then outputs each value pair in reducer to ensure the uniqueness of the F-value. In the second stage, the value pairs are grouped by the G value to calculate the number of entries in each group.
First stage:
Solution II:
The second approach requires only one mapreduce to be implemented, but not very extensible. The algorithm is simple-mapper output values and classification, in the reducer for each value corresponding to the classification of the weight and then to each belongs to the classification count plus 1, and finally after the end of the reducer will all count plus. This approach applies to only a limited number of categories, and records with the same F value are not a lot of cases. For example, the network log processing and user classification, the total number of users, but each user's event is limited, the category obtained by this classification is also limited. It is worth mentioning that in this mode you can use Combiner to remove duplicate values of the classification before data transfer to reducer.
Application:
Log analysis, User count
Cross-correlation
Problem Statement: There are several groups consisting of several items, counting the number of times that item 22 appears together in a group. If the number of items is N, then the n*n should be computed.
This is common in text analysis (where the entry is a word and the tuple is a sentence), and the market analysis (what the customer who buys the item may also buy). If the n*n is small enough to fit in the memory of a single machine, it's easier to implement.
Pairing method
The first method is to pair all entries in Mapper and then add the count of the same item pair in reducer. But this approach also has drawbacks:
- The benefits of using combiners are limited, as it is likely that all item pairs are unique
- Cannot use memory effectively
Stripes approach (method?) Do not know how to understand this name)
The second method is to group the data according to the first item in the pair and maintain an associative array, where the count of all associated items is stored in the array. The second approach is to group data by the first item in pair and maintain an associative array ("stripe") where counters For all adjacent items is accumulated. Reducer receives all stripes for leading item I, merges them, and emits the same result as in the Pairs approach.
- The intermediate result has a relatively small number of keys, thus reducing the ordering consumption.
- can effectively utilize combiners.
- Can be executed in memory, but it can also cause problems if it is not executed correctly.
- More complex to implement.
- Generally speaking, "stripes" is faster than "pairs"
Application:
Text analysis, market analysis
References:
- Lin J. Dyer C. Hirst g. Data intensive processing MapReduce
Expressing relational patterns with MapReduce
In this section we will discuss how to use MapReduce for the main relational operations.
Filter (Selection)
Projection (Projection)
Projection is only a little more complicated than filtering, in which case we can use reducer to eliminate possible duplicate values.
Merge (Union)
All the records in the two datasets are fed into mapper and reducer.
Intersection (intersection)
Two data sets need to be crossed in the record input Mapper,reducer output has been recorded two times. Because each record has a primary key that appears only once in each dataset, it is possible to do so.
Difference (difference)
Assuming there are two datasets R and S, we want to find out the difference between R and S. Mapper marks all tuples, indicating whether they come from R or S,reducer only those records that exist in R but not in S.
class
Mapper
method Map(rowkey key, tuple t)
Emit(tuple t, string t.SetName)
// t.SetName is either ‘R‘ or ‘S‘
class
Reducer
method Reduce(tuple t, array n)
// array n can be [‘R‘], [‘S‘], [‘R‘ ‘S‘], or [‘S‘, ‘R‘]
if
n.size() = 1 and n[1] =
‘R‘
Emit(tuple t, null)
Group Aggregation (GroupBy and Aggregation)
Grouping aggregations can be done in one of the following mapreduce. Mapper extracts the data and aggregates it into groups, Reducer the received data again. Typical aggregation applications, such as Sum and maximum, can be computed in a stream, thus eliminating the need to maintain all values at the same time. But other scenarios have to be two-phase mapreduce, and the unique value pattern mentioned above is an example of this type.
Connection (joining)
The Mapperreduce framework can handle connections well, but there are some tricks in the face of different data volumes and processing efficiency requirements. In this section we will describe some of the basic methods, as well as some featured articles on this in the following reference documentation.
Post-allocation connection (reduce-end connection, sort-merge connection)
This algorithm joins the dataset R and L according to the key K. Mapper iterates through all tuples in R and L, with the key output of K for each tuple marked from R or L, reducer the data of the same k into two containers (R and L), then loops through the data in the two containers to get the intersection, and each result of the last output contains the data in R , data in L and K. This approach has the following drawbacks:
- Mapper to output all the data, even if some key only appears in one collection.
- Reducer to keep all the data of a key in memory, if the amount of data is in memory, then it will be cached to the hard disk, which increases the consumption of hard disk IO.
However, redistribution of connectivity is still the most common approach, especially when other optimization techniques are not applicable.
class
Mapper
method Map(null, tuple [join_key k, value v1, value v2,...])
Emit(join_key k, tagged_tuple [set_name tag, values [v1, v2, ...] ] )
class
Reducer
method Reduce(join_key k, tagged_tuples [t1, t2,...])
H =
new
AssociativeArray : set_name -> values
for
all tagged_tuple t in [t1, t2,...]
// separate values into 2 arrays
H{t.tag}.add(t.values)
for
all values r in H{
‘R‘
}
// produce a cross-join of the two arrays
for
all values l in H{
‘L‘
}
Emit(null, [k r l] )
Copy link replicated join (Mapper end connection, Hash connection)
In real-world applications, it is common to connect a small dataset to a large dataset (such as user and log records). Suppose you want to connect two sets R and L, where R is relatively small, so that you can distribute R to all mapper, each mapper can load it and index the data with a connection key, the most common and effective indexing technique is the hash table. After that, mapper traverses L and connects it to the corresponding record in R stored in the hash table. This method is very efficient because the data in L is not ordered and the data in L is not required to be transmitted over the network, but R must be small enough to be distributed to all mapper.
Reference:
- Join algorithms using Map/reduce
- Optimizing Joins in a MapReduce environment
MapReduce algorithms for machine learning and math
- C. T. Chu et al provides an excellent description of machine learning algorithms for MapReduce in the article Map -reduce Learning on multicore.
- FFT using Mapreduce:http://www.slideshare.net/hortonworks/large-scale-math-with-hadoop-mapreduce
- MapReduce for integer factorization:http://www.javiertordable.com/files/mapreduceforintegerfactorization.pdf
- Matrix multiplication with mapreduce:http://csl.skku.edu/papers/cs-tr-2010-330.pdf andhttp://www.norstad.org/ Matrix-multiply/index.html
MapReduce patterns, algorithms, and use cases