Big Data Learning Note 3--hdfs extension and mapreduce work process

Source: Internet
Author: User

HDFs configuration:

    • Configuration parameters in the client can override parameters on the server side.

    • Example: Number of copies, size of dice

HDFs file Storage:

    • The server stores the actual size of the block, but it is not suitable for storing small files, and small files consume Namenode metadata space.

    • For the optimization of small file data, it can be merged and uploaded before uploading.

    • Example: compression, text file merging

HDFs extension:

    • HDFs supports rest API, platform agnostic

    • Jetty Container

    • HDFs supports rest command

The traditional way of distributed tasks:

    1. Task resource distribution jar configuration file ... Allocation of hardware Resources

    2. The task sets the run environment on each task node and initiates execution

    3. Monitor each phase of task execution status

    4. Task failed, retry

    5. Intermediate result scheduling, summary

Hadoop for distributed abstraction

    • Yarn: Resource Scheduler, responsible for hardware resource scheduling, task assignment, environment configuration, start-up tasks.

    • MAPREDCE: Distributed computing framework, monitoring task execution, failure retry, intermediate result scheduling.

    • Spark, Storm: real-time computing

Mapreduce

    • Mapper:
      Read one row of data at a time
      Output a set of KeyValue
      Mapper number equals block number
    • Shuffle:
      Merging data
    • Reduce:
      Business logic Processing

Hadoop serialization mechanism:

    • The current serialization mechanism in Hadoop is writable and will be replaced with Avro in subsequent versions

MapReduce Task Submission Method

    1. Jar package, Hadoop jar Wordcount.jar Count
      Mr will be submitted to the cluster, which is the way the cluster runs
    2. Local mode
      Run the main method directly in eclipse
    3. Eclipse Hadoop Plug-in

MapReduce Task Execution Process

    • Runjar: Client
    • ResourceManager: Resource Manager, Boss
    • NodeManager: Executing Task Manager
    • Mrappmaste: Task start, monitoring, failed retry
    • Yarnchild:mapper and Reducer
    1. Runjar submit a job to the ResourceManager application
    2. ResourceManager returns the path of Jobid and a job submission (hdfs://)
    3. Runjar file (Jar, config Job.xml,split.xml) submitted to the job task to HDFs
    4. Runjar escalated to ResourceManager task has been submitted for completion
    5. ResourceManager assigning resources and writing task tasks to the task queue
    6. NodeManager take the initiative to ResourceManager to collect the task
    7. NodeManager start Mrappmaster in a container
    8. Mrappmaster Register to ResourceManager
    9. ResourceManager returning resource information to Mrappmaster
    10. Mrappmaster Start Mapper (Mapper reducer detailed process ... )
    11. Mrappmaster Start Reducer
    12. Task execution complete, log off resources to ResourceManager

Big Data Learning Note 3--hdfs extension and mapreduce work process

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.