Transferred from: http://blog.csdn.net/yangbutao/article/details/8331937
The entire processing process consists mainly of parsing (abstract syntax tree, AST, using ANTLR), semantic analysis (sematic Analyzer generation query block), logical plan generation (OP tree), logical plan optimization, physical plan generation (Task tree), And the composition of the physical plan execution.
The following figure (who does not know who drew it) gives a brief description of the process
Here the emphasis is on the physical plan generation, as well as the execution.
The generation of the physical plan is generated from the logical operations Tree (operator), the physical plan is executed by the Task object, each task has a Woker object, and the work represents the description of the physical plan.
Mainly has Fetchwork,movework,mapredwork,copywork,ddlwork,functionwork,explainwork,conditionalwork
The execution of a physical plan invokes the Execute method for each physical plan.
Mainly has Fetchtask,conditionaltask,copytask,ddltask,explaintask,mapredtask,movetask
Where Mapredtask implements the Mapreuce client, it generates a plan XML file based on the woker description mapredwork, which is a command parameter related to the Hadoop jar [params], passed to
MapReduce to execute (execmapper,execreducer).
The following diagram illustrates the process of data processing in the MapReduce process:
FileFormat, you need to specify the storage format of the data (store as) when you define the table, such as Textflle,sequencefile,rcfile, and of course you can customize the format of the data store (store as ROW format),
The storage format of the data is mainly fileformat how the record (writable) is stored in the file, the file is read when the map is provided, and the write of the file is provided when reduce.
SerDe, the conversion of the format of the data, writable to the object used by the operator.
A brief analysis on the principle of hive Architecture-mapreduce part