MRV2 working mechanism, fair dispatcher, Mr Compression, edge data

Source: Internet
Author: User

For large clusters with more than 4000 nodes, the MapReduce system described in the previous section is beginning to face an extended bottleneck. The 2010 Yahoo team began designing the next generation of MapReduce. (Yet another Resource negotiator, YARN application Resource nefotiator).

YARN will only divide jobtracker into separate entities, thus improving the MR1 of the expansion bottlenecks facing the problem. Jobtracker is responsible for job scheduling and task progress monitoring, tracking tasks, restarting failed or slow tasks, and registering tasks, such as maintaining the total number of counters.YARN divides these two roles into twostand-alone daemon: Resource Manager:                On the management clusterResourcesUseApplication Manager:
        Run any on the management clusterService life cycleThe Application Manager Application Manager negotiates the cluster's compute resources with the resource ManagerContainer(Each container hasSpecific MemoryUpper limit), the process of running a particular application on these containers. Container is run by a dot monitor on the cluster node(nodemanage) monitoring. In fact, MapReduce is just one form of yarn application and yarn applications can coexist on a cluster. For example, an MR application can run as an MPI application at the same time. Significantly improved manageability and cluster utilization.        MPI communication protocol.     The goal is high performance, large scale, and portability. Mr on YARN includes more entities than the classic MR:    The client that submitted the MapReduce.    YARN's resource Manager    YARN's node Manager    The MapReduce Application Master is responsible for coordinating the task of running the MapReduce job. It and the MapReduce task run in the container, which is assigned by the resource manager and managed by the node manager. the process of YARN running MapReduceadded master, failed flag optimization,  MR2 status update propagation
  early Hadoop uses the FIFO scheduling algorithm to run jobs soon added priority , like DotA very_high, High, NORMAL, Low, very_low. When job scheduling selects the priority, the highest job is selected. FIFO scheduling algorithms, priority does not support preemption, so high-priority jobs are still blocked by the previously started job , long-running priority jobs.  1 Fair SchedulerThe goal is to allow each user to share the cluster capabilities fairly. Jobs are placed in the job pool and do not get more cluster resources for users with a large number of jobs. You can customize the minimum capacity of the job pool with the number of task slots for map and reduce, or you can set the weights for each pool. The Fairness Scheduler supports preemption mechanisms , and if a pool fails to share resources equitably for a certain period of time , it terminates the task of getting more resources in the run pool, and the empty slots give Run a job pool with insufficient resources.  2 Capacity Schedulerfor multi-user scheduling, the capacity scheduler allows users to simulate a MapReduce cluster that uses a FIFO scheduling strategy. (Finer granularity control )Map processing will compress the map output compress the disk as it is written to itIt's always a good idea.default does not compress。 The map output that the reduce processing uses for compression must be uncompressed in memory. After all the map tasks have been copied. This phase merges the map output (sort map is done) the relationship between the input shard and the HDFs block a file is divided into 5 rows, the boundary of the line is not aligned with the HDFS block boundary. The Shard boundary is aligned with the boundary of the logical boundary (the row boundary), so the first Shard contains the first 5 rows, and the fifth row breaks down the block one and the second in a timely manner.  The second Shard starts at line sixth. The Write method of multipleoutputs can specify that the base path is interpreted relative to the output path, because he can include the file pathdelimiter (/), so you cancreate a path of any depth。 Delay outputthe Fileoutputformat subclasses produce output files, and the files are empty in time. So the Lazyoutputformat appeared.  He can guarantee that a file is actually created when the first record output is specified for the partition. To use it, use jobconf and the associated output format as parameters to invoke theThe set Outputformatclass () method can be used.  MR Advanced FeaturesHadoop maintains several built-in counters for each job.    Edge Data Distributionside data is the additional read-only data required by the job to aid in processing the main dataset. is how to make all map or reduce tasks easy and efficient to use edge data. 1 The Edge data is serialized in the job configuration (jobconf). Wasting time with memory2 distributed cache,

MRV2 working mechanism, fair dispatcher, Mr Compression, edge data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.