Hadoop version and Biosphere
1. Hadoop version
(1) Introduction to Apache Hadoop version
Apache Open Source project development process:
Backbone Branch: New features are developed on the backbone branch (trunk).
Unique branch of attribute: Many new features are poorly stabilized or imperfect, and the branch is merged into the backbone branch after the unique specificity of these branches is perfect.
Candidate Branch: Periodically split from the backbone branch, the general candidate Branch release, the branch will stop updating new features, if the candidate branch has a bug fix, the candidate branch will be again released a new version, the candidate branch is the stable version of the release.
Causes of the Hadoop version confusion:
The main features were developed in the branched version: After 0.20 branches were released, the main features were developed on the branch, and the branch branch did not merge the branches, and 0.20 of the branches became mainstream.
Post-release: Release 0.22 later than version 0.23.
Version renaming: 0.20 the 0.20.205 version of the branch is renamed to the 1.0 version, the two versions are the same, but the names change.
Apache Hadoop version diagram:
(2) Introduction to Apache Hadoop version features
First generation Hadoop features:
Append: Supports the file append function, lets the user use the HBase time to avoid the data loss, is also uses the hbase premise.
RAID: Ensure the data is reliable, and the number of data blocks is validated by checking code.
Symlink: Supports HDFs file links.
Security:hadoop security mechanism.
Namenode ha: In order to avoid Namenode single point of failure, HA cluster has two sets of Namenode.
Second generation Hadoop features:
HDFS federation:namenode restricts HDFS extension, which allows multiple namenode to be in charge of different directories to achieve access isolation and scale-out.
Yarn:mapreduce Scalability and multiple framework support, yarn is a new resource management framework that separates Jobtracker resource management and job control functions, ResourceManager responsible for resource management, Applicationmaster is responsible for operation control.
0.20 Version Branch: Only This branch is a stable version, and the other branches are unstable versions.
0.20.2 version (Stable version): Contains all features, classic version.
0.20.203 version (Stable edition): Contains append, does not contain symlink raid Namenodeha features.
0.20.205 Version/1.0 (Stable Edition): Contains append security and does not contain symlink raid Namenodeha features.
1.0.1~1.0.4 version (Stable edition): Fixes 1.0.0 bugs and performs some performance improvements.
0.21 version Branch (unstable version): Contains append raid symlink Namenodeha, no security.
0.22 version Branch (unstable version): Contains append raid symlink so ha, does not contain MapReduce security.
0.23 Version Branch:
0.23.0 version (unstable version): second-generation Hadoop, adding HDFs Federation and yarn.
0.23.1~0.23.5 (unstable version): Fixes some bugs in 0.23.0, and makes some optimizations.
2.0.0-alpha~2.0.2-alpha (unstable version): Added Namenodeha and wire-compatiblity features.
(3) Cloudera Hadoop corresponds to the Apache version of Hadoop
2. Hadoop Biosphere
Apache support: The core projects of Hadoop are supported by Apache, and in addition to Hadoop, there are several projects that are an integral part of Hadoop.
HDFS: Distributed file system for reliable storage of massive amounts of data.
MapReduce: A distributed processing data model that can be run in a large commercial cloud computing cluster.
Pig: Data Flow Language and runtime environment for retrieving mass datasets.
HBase: Distributed databases, stored in columns, HBase use HDFS as the underlying storage, while supporting the MapReduce model's mass computation and random reads.
Zookeeper: Provides distributed coordination services for Hadoop clusters to build distributed applications to avoid the loss of uncertainty caused by application execution failures.
Sqoop: This tool can be used for data transmission between hbase and HDFS, improving the efficiency of data transmission.
Common: Distributed file systems, universal IO components and interfaces, including serialization, Java RPC, and persistent data structures.
Avro: A serialization system that supports efficient cross-language RPC and persistent data storage.
Introduction to MapReduce Model
MapReduce Introduction: MapReduce is a data processing programming model.
Multi-lingual Support: MapReduce can be written in a variety of languages, such as Java, Ruby, Python, and C + +.
Parallel nature: MapReduce can be run in parallel in nature.
1. MapReduce Data Model Analysis
MapReduce Data Model:
Two phases: The MapReduce task can be divided into two phases, the map phase and the reduce phase.
Input/output: Each phase uses key-value pairs as input and output, and IO types can be selected by the programmer.
Two functions: The map function and the reduce function.
MapReduce Job Composition: A MapReduce work unit that includes input data, mapreduce programs, and configuration information.
Job control: Job control is controlled by Jobtracker (one) and Tasktracker (multiple).
Jobtracker function: Jobtracker Control the operation of the Tasktracke task, and carry out the unified dispatch.
Tasktracker role: The implementation of specific mapreduce procedures.
Unified Scheduling mode: Tasktracker run while running progress to the Jobtracker,jobtracker record all the Tasktracker.
Task failure Processing: If a tasktracker task fails, Jobtracker schedules the MapReduce job on other tasktracker.
2. Map Data Flow
Input fragmentation: When the MapReduce program executes, the input data is divided into equal-length blocks of data, which are fragmented.
Fragment corresponding task: Each fragment corresponds to a map task, that is, the map function in MapReduce.
Parallel processing: Each fragment performs a map task shorter than the one-time processing of all data.
Load balancing: The computer in the cluster has good performance and poor performance, according to the reasonable allocation of fragmentation size, than the average distribution efficiency is high, give full play to the efficiency of the cluster.
Reasonable fragmentation: the smaller the load balancing efficiency, but the management of fragmentation and management map task total time will increase, need to determine a reasonable fragment size, the general default is 64M, and the same block size.
Data Local optimization: The map task runs on the node where the data is stored locally for the best efficiency.
Fragment = Data Block: A fragment is stored only on a single node and is the most efficient.
Fragmentation > Data block: Fragmentation is larger than data blocks, then a fragmented data is stored on multiple nodes, the map task needs to transfer data from multiple nodes, will reduce efficiency.
Map task output: When the map task finishes executing, the results are written to the local hard disk, not to the HDFs.
Intermediate transitions: The result of the map is only for intermediate transitions, this intermediate result is passed to the reduce task, the result of the reduce task is the final result, and the map intermediate value is eventually deleted.
Map task failed: If the map task fails, the map task is rerun on the other node and the intermediate result is calculated again.
3. Reduce Data Flow
Reduce task: The number of map tasks is much greater than the reduce task.
No localization advantage: the input of the task of reduce is the output of the map task, where most of the data for the reduce task is not local.
Data Merge: The results of the map task output are uploaded over the network to the Reduce task node, where the data is merged and then processed in the input to the reduce task.
Result output: Reduce output is directly exported to HDFs.
Reduce quantity: Reduce quantity is specifically specified, specified in the configuration file.
MapReduce Data Flow Diagram Analysis:
Single MapReduce Data flow:
Map output partition: More than one reduce task, each reduce task corresponds to some map tasks, we partition these map tasks according to their input reduce task, to create a partition for each reduce.
Partition ID: Map results have many kinds of keys, the same key corresponding to the data to a reduce, a map may give multiple reduce output data.
Partition functions: Partition functions can be defined by the user, typically using the system default partition function Partitioner, which is partitioned by a hash function.
Blending: The data flow between the map task and the reduce task becomes mixed.
Reduce data source: input data from multiple maps per reduce task
Map data whereabouts: The results of each map task are exported to multiple reduce
No reduce: When data can be processed in full parallel, reduce is not applicable and only map tasks can be performed.
4. Combiner introduced
MapReduce bottlenecks: Bandwidth limits the number of MapReduce to perform tasks, and a large amount of data transfer is required during map and reduce execution.
Solution: Merges the function combiner, merges the results of multiple map task outputs, and sends the merged results to the reduce job.
5. hadoopstreaming
Hadoop MultiLanguage Support: Java, Python, Ruby, C + +
MultiLanguage: Hadoop allows you to write MapReduce functions in other languages.
Standard stream: Because Hadoop can use UNIX standard streams as an interface between Hadoop and applications, MapReduce programming can be done as long as the standard stream is used.
Streaming processing text: Streaming in text processing mode, there is a data row view, which is ideal for handling text.
The input and output of the map function: The standard stream is entered into the map function, and the result of the map function is written to the standard output stream.
Map output format: The output's key-value pairs are tab-delimited rows, written in this form in the standard output stream.
The input and output of the reduce function: the input data is a tab-delimited key-value pair row in the standard input stream, which is sorted by the Hadoop framework, and the result is output to the standard output stream.
6. Hadoop Pipes
Pipes concept: Pipes is the MapReduce C + + interface
Misunderstanding: Pipes does not use standard input-output streams as a streaming between map and reduce, nor does it use JNI programming.
How it works: Pipes uses sockets as a communication between the map and the reduce function processes.