Directory address for this book Note: http://www.cnblogs.com/mdyang/archive/2011/06/29/data-intensive-text-prcessing-with-mapreduce-contents.html
Currently, the most effective way to process large-scale data is to divide and conquer it ".
Divide and conquer: divide a major problem into several small problems that are relatively independent and then solve them. Because small issues are relatively independent, they can be processed in concurrency or in parallel. Specifically, it can be multithreading, multi-process, multi-core, or multi-processor (cluster ).
How to divide? The processing method varies according to different application scenarios. Considerations include but are not limited:
· How to divide it?
· How to allocate subtasks to workers (worker can be a thread, process, processor core, or processor )?
· How to synchronize?
· How to share intermediate results among workers (for example, worker a needs to use intermediate results produced by worker B )?
· How to Handle software/hardware errors?
In the past, programmers had to handle all these issues independently. For example, for concurrent/parallel programs with shared memory, programmers need to use synchronization primitives such as mutex to explicitly specify the access mode of the shared data structure, also, avoid deadlocks and competitive conditions. Like OpenMP (parallel/concurrent programming framework with shared memory) and MPI (message passing interface) this parallel programming framework encapsulates the synchronization and communication mechanisms to be processed at the underlying layer of parallel programming in the form of libraries. Even so, programmers still need to consider a lot of underlying details irrelevant to problem handling.
MapreduceThe biggest advantage is that it provides abstraction at the system layer, thus hiding the underlying details of programmers to a large extent.In this way, the programmer only needs to consider how to solve the problem itself, without having to consider many irrelevant details (these details: communication, synchronization, and so on are usually caused by parallel programming ).
In addition, mapreduce distributes datasets to each node and directly executes the processing program on the node, improving the efficiency.
This chapter mainly introduces the mapreduce programming model and distributed file system..
2.1SectionIntroduce functional programming FP (functional programming), which is inspired by mapreduce design;
2.2SectionDescribes the basic programming models of Mapper, reducer, and mapreduce;
2.3SectionDiscusses the role of the execution framework in executing mapreduce programs (jobs;
2.4SectionPartitioner and combiner (splitter and combiner );
2.5SectionDistributed File System;
2.6SectionHadoop cluster architecture;
2.7SectionSummary.
2.1 functional programming (FP)
FpBasic Concepts
Figure 2-1MapAndFoldBuilt-in high-level functions
MapreduceThe design is inspired by FP.. FP programming languages include lisp and ML.An important feature of FP is support for high-order functions (a function that accepts other functions as input parameters). The two most common built-in high-level functions are:MapAndFold(See figure 2.1 ).MapAccept a parameter of the Function TypeFAnd the input listLEach element applicationF.FoldTwo input parameters are accepted: FunctionGAnd initial valueI.FoldConvert the initial valueLAfter the 1st elements inLAnd save the result to the intermediate variable ...... This loop ends until the calculation of the last element is complete and the intermediate result is output.MapAndFoldThe pseudo code of is as follows:
Map(F) For eachE InL EBytesF(E) |
Fold(G,I) For eachE InL IBytesG(E,I) |
Example: ComputingLThe sum of squares of all elements in can be defined.F(X) =X2λ x.X2 ).G(X,Y) =X+Y(Note:λ x λ y.X+Y). The calculation process can be expressed as follows:
Map(λ x.X2 ),ILimit 0,Fold(λ x λ y.X+Y,
I)
MapIt can be seen as a transformation to a dataset (the transformation method is composedFDefinition );FoldIt can be seen as the merging of data (the merging method is composedGDefinition ).
You can intuitively see that the operations performed by map (FApplyLAnd can be easily parallel. WhileFoldFunctions have certain requirements on Data Locality: ExecutionGBefore,LThe elements in must be aggregated first. But actually manyFoldFunction access is not requiredLAll elements in (for example,FoldYou only need to access the intermediate variables andL), SoFoldIt is also possible to perform parallelization.
FpRelationship with mapreduce
The above description just summarizes the mapreduce work method:FP programMapFunction execution corresponds to the map stage in mapreduce,FoldCorresponding reduce stage.
MapreduceDefines a two-step method for processing large-scale data:
1) execute User-DefinedMapFunction to generate intermediate results;
2) The intermediate result is defined by the userReduceFunction.
What programmers need to do is to define the map and reduce functions, similar to defining the inputMapOfFAnd inputFoldOfG.
The mapreduce execution framework is responsible for the specific execution process.
Mapreduce seems simple, and has great functional limitations. However, if complex algorithms can be decomposed into multi-step map-Reduce processes, it is also possible to use mapreduce to design complex parallel algorithms (which will be introduced in subsequent chapters ).
MapreduceWhat is it?
To be precise, mapreduce represents three concepts that are interrelated but different from each other:
1) mapreduce programming model;
2) mapreduce execution framework (also known as "runtime environment", that is, runtime );
3) mapreduce implementation (Google's own implementation and Java-based hadoop Open Source implementation ).
In addition to hadoop, there are many other mapreduce implementations (such as gpgpus designed for the cell architecture ). There are some differences between hadoop's implementation of mapreduce and Google's design of mapreduce, which will be explained later. We willBenchmark with hadoopBecause hadoop is currently the most mature open-source mapreduce implementation and has the most developers.
2.2 Mapper and reducer
Data Structure
MapreduceThe basic data structure in is key-Value Pair (key-value pairs). Key and value can be basic data types, such as integers, floating-point numbers, strings, or byte arrays, or more complex data types (linear tables, ancestor, and associated arrays ). Complex structures can be customized using libraries such as protocol buffers, thrift, and Avro.
Mapreduce can be used to describe the dataset composed of key-value pairs. For example:
1) webpage dataset key: page url, value: HTML code of the page;
2) Figure Key: node ID, value: List of adjacent nodes.
In some cases, the key is completely unnecessary, and in other cases, the key is used as the ID to identify the data item. The specific content is described later in Chapter 3rd.
Define Mapper and reducer
In mapreduce, programmers need to define Mapper and reducer:
Map :(K1,V1) → [(K2,V2)]
Reduce :(K2 ,[V2]) → [(K3,V3)]
(This book uses [...] to represent the list)
The input of mapreduce programs is usually the key-value data stored in the underlying Distributed File System (see section 2.5 ).
MapreduceExecution Process
MapreduceExecution ProcessYou can briefly describe it as follows:
1) input data is imported to mapper for execution one by one to generate a list of another key-Value Pair (intermediate result );
1.5) The intermediate results are immediately grouped by key, and the pairs in each group are sorted by key (this step is an implicit operation );
2) input a corresponding CER for processing in each group and write the output results to the file system;
(1) The reducer processes a group in key order, and the CER runs in parallel.
(2)RReducer will generateROutput files
Usually, you do not need to merge thisRFiles, because they are often input by the next mapreduce program.
Figure 2.2 demonstrates the two steps.
Figure 2.2 simplified mapreduce computing process
A simple example
The program pseudocode 2.3 shows the number of occurrences of each word in a statistical document.
1 2 3 4 |
ClassMapper MethodMap (docidA, DocD) For allTermT∈ DocD Do Emit (termT, Count 1) |
1 2 3 4 5 6 |
ClassReducer MethodReduce (termT, Counts [C1,C2,...]) SumLimit 0 For allCountCε counts [C1,C2,...] Do SumBytesSum+C Emit (termT, CountSum) |
Figure 2.3 program pseudocode for word count using mapreduce
In 2.3, the input key-value pair is (docid, DOC) and stored in the distributed file system. Docid is the ID that uniquely identifies a document, and Doc is the document body. Mapper accepts one (docid, DOC), separates words in the docid text one by one (tokenization), and generates a key-Value Pair (word, 1) for each word ), the key is the word itself, and the value is 1 (that is, a count ). (Word, 1) is then grouped and divided, and then sent to Cer CER for processing. CER adds all (word, 1) pairs with the same word to the count, sorts the final result, and outputs it to the file.
The mapreduce execution environment can ensure that all key-value pairs with the same key (here refers to the intermediate results obtained after mapper processing, in 2.3, that is, those with the same word (word, 1) can be allocated to the same CER Cer.
The grouping and division of intermediate results are completed by partitioner, which will be described in section 2.4.
HadoopAnd its differences with Google mapreduce
The algorithm pseudocode in this book is based on the hadoop ProgramTherefore, it reflects some features of hadoop's mapreduce implementation.
In hadoop, mapper/CER is the object that implements the MAP/reduce method.
In the map stage, a Mapper object is generated for each map task. The execution framework applies the map method defined in Mapper to each input data item. Programmers can specify the number of map tasks simultaneously, but the execution framework has the final decision (depending on the physical distribution of input data). Sections 2.5 and 2.6 will detail the relevant content.
The reduce stage is similar: For each reduce task to generate a reduce object, the reduce operation is specified one by one for the input (intermediate result) accepted by the reduce Cer. Different from map, programmers can completely control the number of reduce tasks simultaneously. This part will be detailed in the introduction to task execution in section 2.6. To read section 2.6, you must first read section 2.5.
There is a difference between hadoop implementation and Google's own mapreduce implementation.DifferenceOf:
1) In hadoop, CER accepts a key-iterator pair, where iterator is a list iterator, this list stores the values of all key-value pairs that are equal to the current key value, and the list is out of order (compare and wrap, for example ,(K1,V1 ),(K1,V2 ),(K1,V3 ),(K1,V1) if it is allocated to Cer CER 1 for processing, then reducer 1 actually accepts Yes (K1, iteratorIOf list [V1,V2,V3,V1]). Google's implementation allows programmers to perform secondary sorting operations (that is, sorting by value) (all key-values have been sorted by key before division. After the sorting is completed, each group is sorted by value, which is called secondary sorting). In this way, the data accepted by the reducer is not only sorted by key, but also by each key, the Value List is also ordered. How to Implement secondary sorting in hadoop is described in section 3.4;
2) the key value of reducer input cannot be changed in Google implementation. That is to say, the output key of reducer is the same as the input key (for example, input ("A", 1 ), ("the", 1), ("the", 1), only outputs ("A", 1), ("the", 2 )). Hadoop does not have this restriction. You can specify the output key at will.
Notes
What are the limitations of Mapper and reducer?
Be careful when using external resources because it may cause access competition and performance bottlenecks (such as SQL)
MapreduceActive use
1) mapreduce program without CER (equivalent to a reducer with nothing to do: it directly outputs the input as is ). The output result of this program is the output result of mapper. Examples of application scenarios: parse a large text set (a collection composed of many texts) and process a large image set (a collection composed of many images). Each item can be processed independently without merging the processing results;
2) on the contrary, mapreduce programs without ER er are useless.
Other data storage
In general, mapreduce reads input data from the Distributed File System and writes the results to the file. But not limited:
1) bigtable, a distributed sparse table Storage implemented by Google. Hbase is an open-source implementation of hbase (which can be integrated with hadoop );
2) MPP Rdb (massively parallel processing relational database );
3) Some computing tasks do not require input at all (for example, π calculation ).