HDFs configuration:
Configuration parameters in the client can override parameters on the server side.
Example: Number of copies, size of dice
HDFs file Storage:
The server stores the actual size of the block, but it is not suitable for storing small files, and small files consume Namenode metadata space.
For the optimization of small file data, it can be merged and uploaded before uploading.
Example: compression, text file merging
HDFs extension:
The traditional way of distributed tasks:
Task resource distribution jar configuration file ... Allocation of hardware Resources
The task sets the run environment on each task node and initiates execution
Monitor each phase of task execution status
Task failed, retry
Intermediate result scheduling, summary
Hadoop for distributed abstraction
Yarn: Resource Scheduler, responsible for hardware resource scheduling, task assignment, environment configuration, start-up tasks.
MAPREDCE: Distributed computing framework, monitoring task execution, failure retry, intermediate result scheduling.
Spark, Storm: real-time computing
Mapreduce
- Mapper:
Read one row of data at a time
Output a set of KeyValue
Mapper number equals block number
- Shuffle:
Merging data
- Reduce:
Business logic Processing
Hadoop serialization mechanism:
- The current serialization mechanism in Hadoop is writable and will be replaced with Avro in subsequent versions
MapReduce Task Submission Method
- Jar package, Hadoop jar Wordcount.jar Count
Mr will be submitted to the cluster, which is the way the cluster runs
- Local mode
Run the main method directly in eclipse
- Eclipse Hadoop Plug-in
MapReduce Task Execution Process
- Runjar: Client
- ResourceManager: Resource Manager, Boss
- NodeManager: Executing Task Manager
- Mrappmaste: Task start, monitoring, failed retry
- Yarnchild:mapper and Reducer
- Runjar submit a job to the ResourceManager application
- ResourceManager returns the path of Jobid and a job submission (hdfs://)
- Runjar file (Jar, config Job.xml,split.xml) submitted to the job task to HDFs
- Runjar escalated to ResourceManager task has been submitted for completion
- ResourceManager assigning resources and writing task tasks to the task queue
- NodeManager take the initiative to ResourceManager to collect the task
- NodeManager start Mrappmaster in a container
- Mrappmaster Register to ResourceManager
- ResourceManager returning resource information to Mrappmaster
- Mrappmaster Start Mapper (Mapper reducer detailed process ... )
- Mrappmaster Start Reducer
- Task execution complete, log off resources to ResourceManager
Big Data Learning Note 3--hdfs extension and mapreduce work process