Distributed basic learning [2] -- distributed computing system (MAP/reduce)

Source: Internet
Author: User
Document directory
  • IV. For details about map tasks, see
  • V. Reduce task details
  • Vi. Distributed support
  • VII. Summary
2. Distributed Computing (MAP/reduce)
Distributed Computing is also a broad concept. In this case, it refers
The distributed framework designed by the MAP/reduce framework. In hadoop, distributed file systems serve a wide range of distributed computing requirements. We say that a distributed file system is a distributed file system. Similar definitions can be viewed Added distributed computing functions. From a computing point of view, the MAP/reduce framework accepts key-value pairs of various formats as input. After reading the computation, the output file in the custom format is generated. From a distributed perspective, the input files of distributed computing are usually large and distributed on multiple machines. Single-host computing is not supported and inefficient, therefore, the MAP/reduce framework needs to provide a mechanism to expand the computing to an infinite number of machine clusters. According to this definition, our understanding of the whole map/reduce can also be viewed along these two flows... In the map/reduce framework, each computing request is called Job. In the distributed computing MAP/reduce framework, to complete this job, it implements a two-step strategy. First, it splits it into several Map taskAnd is allocated to different machines for execution. Each map task uses a part of the input file as its own input. After some calculations, an intermediate file in a certain format is generated, the file format is exactly the same as the final file format, but it only contains part of the data. Therefore, after all the map tasks are completed, it goes to the next step to merge these intermediate files to obtain the final output file. In this case, the system generates several Reduce taskIt is also assigned to different machines for execution. Its goal is to summarize the intermediate files generated by several map tasks into the final output files. Of course, this summary will not always look like 1
+ 1 =
2. This is the value of the reduce task. After the above steps, the job is completed and the target file is generated. The key to the entire algorithm is to add an intermediate file generation process, greatly improving flexibility and ensuring its distributed scalability... I. Glossary comparison
Like distributed file systems, Google, hadoop, and... I each adopt a uniform concept. To ensure its uniformity, the following table is unique...
Text Translation Hadoop terminology Google terminology Explanations
Job Job Job Each computing request of a user is called a job.
Job Server Jobtracker Master The server where the user submits the job. At the same time, the server is also responsible for assigning each job task and managing all the task servers.
Task Server Tasktracker Worker Employees are responsible for executing specific tasks.
Task Task Task Each job needs to be split up and completed by multiple servers. The split execution unit is called a task.
Backup Task Speculative task Buckup task Every task may fail or be slow. To reduce the cost, the system will plan to execute the same task on another task server, which is a backup task.
Ii. Basic Architecture
Similar to distributed file systems, MAP/reduce clusters are also composed of three types of servers. Where Job ServerIn hadoop Job
Tracker
In the Google paper Master. The former tells us that the job server is responsible for managing all jobs running in this framework, and the latter tells us that it is also the core of assigning tasks to each job. Similar to the master server of HDFS, HDFS also acts as a single point, which simplifies the synchronization process. Specific Executes user-defined operationsYes Task Server, Each job is split into many Task, Including Map taskAnd Reduce taskTasks are the basic unit of execution. They must all be assigned to the appropriate Task Server for execution. The Task Server reports the status of each task to the job server while executing the task, in this way, the job server can understand the overall execution status of jobs, allocate new jobs, and so on... In addition to the Job Manager performer, you also need to have Submitted by taskThis is the client. Like a distributed file system, the client is not a separate process, but a group of APIs. You need to customize the content you need and use the client-related code, submit jobs and their related content and configurations to the job server, and monitor the execution status at all times... Same as hadoop implementation and communication mechanism of HDFS, hadoop
MAP/reduce also uses protocol interfaces for inter-server communication. The implementer acts as the RPC server, and the caller calls the server through the RPC proxy. In this way, most of the communication is completed, the architecture of the specific server, and the status of each protocol running in the server. For more information, see. As you can see, compared with HDFS, there are a few less related protocols. There is no direct communication between the client and the task server and the Task Server. This does not mean that the client does not need to know the execution status of specific tasks, nor does it mean that the task server does not need to know the execution status of other tasks,, the connection between machines in the cluster is much more complex than HDFS, and direct communication is too difficult to maintain. Therefore, the job server is responsible for organizing and forwarding. In addition, from this figure, we can see that the task server is not fighting by a single person. It will recruit a group of babies like Sun Wukong to help them execute specific tasks. I personally think that security considerations should be taken into consideration. After all, the task code is submitted by the user and the data is also specified by the user. This quality is naturally uneven, in case of damage, the whole Task Server process will be killed. Therefore, the rights and responsibilities are clearly defined when you put them on a separate website... Compared with distributed file systems, the MAP/reduce framework has another feature: Highly customizable. Many algorithms in the file system are fixed and intuitive, so there will not be too many changes due to different stored content. As a general computing framework, the problems to be faced are much more complicated, it is difficult to have a cure for all kinds of diseases. As a map/reduce framework, We need to extract public requirements as much as possible to achieve this. More importantly, a good scalability mechanism is required to meet the needs of users to customize various algorithms. Hadoop is implemented by Java. Therefore, it is quite easy to implement custom extensions through reflection. In JobconfClass, defines a large number of interfaces, which is basically hadoop
A centralized display of all customizable content of the MAP/reduce framework. In jobconf, many set interfaces accept one Class <? Extends
Xxx>
... Iii. computing process
If everything is done step by step, the entire job computing process should be job submission-> map task allocation and execution-> reduce task allocation and execution
-> Job completion. The execution of each task includes the input preparation-> Algorithm Execution->
Output generation in three substeps. Along this process, we can quickly sort out the execution of jobs in the entire map/reduce framework... 1. Submit a job
Before submitting a job, you need to configure all the items that should be configured, because once submitted to the job server, the job is in a completely automated process, at most, it can play a supervisory role and punish some tasks that do not work well... Basically, you need to do the following in the Code submission phase: First, write all the custom code. At the very least, you need to execute the code of MAP and reduce. In hadoop, map needs to be derived from Mapper <K1, V1,
K2, V2>
Interface, reduce needs to be derived from CER <k2, V2, K3,
V3>
Interface. The generic type is used to support different key-value types. Both interfaces have only one method, one is map and the other is reduce. Both methods are directly subject to four parameters, and the first two are input. KeyAnd ValueRelated data structure, the third is OutputThe last data structure is ReporterClass instance, which can be used to count when implemented. In addition to these two interfaces, there are also a large number of interfaces that can be derived, such as the split Partitioner <k2,
V2>
Interface... Then, you need to write the code of the main function. The main content is to instantiate JobconfClass, and then call its rich setxxx interface to set the required content, including the input and output file paths, MAP and reduce classes, it even includes the format support classes required to read and write files, and so on... Finally, call JobclientOf RunjobMethod to submit this jobconf object. The runjob method is called first JobsubmissionprotocolThe SubmitjobMethod to submit the job to the job server. Then, the runjob starts to loop and calls jobsubmissionprotocol continuously. GettaskcompletioneventsMethod To obtain TaskcompletioneventClass object instance to learn the execution status of each task of this job... 2. Map task allocation when a job is submitted to the job server, the job server generates several map tasks for each map task, converts a part of the input to an intermediate file in the same format as the final file. Generally Job input is based on distributed file system files.(Of course, in a single-host environment, the file system can also be single-host...), because it can naturally be associated with distributed computing. For a map task, the input is usually a data block of the input file or a part of the data block, No cross-data block. Because, once data blocks are crossed, multiple servers may be involved, resulting in unnecessary complexity... When a job is submitted from the client to the job server, the job server generates JobinprogressObject, used for management. After a job is split into several map tasks, it is prefixed to The Task Server topology tree on the job server. This is based on the location of the distributed file data block. For example, if a map task needs a data block and the data block has three copies, the task is attached to the three servers, which can be considered as a pre-allocation... Jobinprogress relies on JobinprogresslistenerAnd Taskscheduler. Taskscheduler, as its name implies, is a policy class used for task allocation (to simplify the description, it is used to represent all subclass of taskscheduler ...). It will master the task information of all jobs. AssigntasksFunction that accepts TasktrackerstatusAs a parameter, assign a new task to the task server according to the status of the task server and the status of the existing task. To understand the status of all job-related tasks, taskschedener registers several jobinprogresslistener Jobtracker... Task allocation is an important part. Assign appropriate jobs to appropriate servers. It is easy to see that there are two steps in it: select a job first, and then select a task in this job. Like all assignment jobs, task assignment is also a complex task. Poor task allocation may lead to increased network traffic, and the server load efficiency of some tasks may decrease. In addition, task allocation is still a non-consistent mode. Different business backgrounds may require different algorithms to meet the requirements. Therefore, in hadoop, many sub-classes of taskscheduler, such as Facebook and Yahoo, have contributed their own algorithms. In hadoop, the default task distributor is JobqueuetaskschedulerClass. It selects the basic job order: Map
Clean up task (the cleanup task of the map Task Server, used to clear related expired files and environments...)-> map setup
Task (the installation task of the map Task Server, which is responsible for configuring the relevant environment...)-> map tasks-> reduce clean up task
-> Reduce setup task-> reduce
Task. On this premise, it is specific to the allocation of map tasks. When a Task Server is ready to work and is expected to obtain a new task, jobqueuetaskscheduler will start from Job with the highest priorityStart allocation. Each time you allocate one, you will also be given a margin, which is not necessary. For example, the system currently has three jobs with priority 3, 2, and 1. Each job has an allocable map task, and a Task Server applies for a new task, it also has the ability to carry the execution of three tasks. jobqueuetaskscheduler will first obtain a task from the job with priority 3 and assign it to it, and then set aside a one-task margin. In this case, the system can only assign the task of Priority 2 to this server, but not the task of Priority 1. The basic idea of this strategy is All job services with high priorityThe priority assignment, not to mention, is assigned a good reserve and sufficient strength to prepare for the unexpected needs. Such preferential treatment is enough to make the high-priority jobs excited, so that the low-priority jobs are not only born, even ELE. Me. Me... After determining which job to extract the task, the specific allocation algorithm, after a series of calls, is JobinprogressOf FindnewmaptaskFunction completed. Its algorithm is simple, that is Do your best to assign tasks to this server as not configured and as best as possibleThat is to say, as long as there are still allocable tasks, they will be assigned to them without considering the latencies. The job server starts from the server closest to it and checks whether unassigned tasks (pre-allocated tasks) are mounted. From near to far, if all tasks are assigned, check whether the execution has been enabled for multiple times. If the execution is enabled, you need to re-assign the unfinished tasks once (which will be detailed later ...)... For a job server, assigning a task does not mean that it is completely liberated. You can ignore this task. The task may fail on the task server and may be executed slowly. This requires the job server to help them run again. Therefore, in a task, there is a record TaskattemptidFor task servers, each time they run is actually only an attempt. The reduce task only needs to receive one output, and the rest is almost too busy... 3. The execution of a map task is similar to that of HDFS. The task server sends a heartbeat message to the job server to report the execution status of each task on it at this moment and apply for a new task from the job server. The specific implementation is TasktrackerCall IntertrackerprotocolProtocol HeartbeatMethod. This method accepts TasktrackerstatusObject as a parameter, which describes the status of the task server at this time. When a worker receives a new task, it also transmits AcceptnewtasksThe parameter is true, indicating that the job server is expected to shoulder heavy responsibilities. JobtrackerAfter receiving the relevant parameters, after processing, HeartbeatresponseObject. This object defines a group of tasktrackeractions to guide the task server in the next step. A bunch of child classes of tasktrackeraction have been defined in the system. Some of them expand the parameters they carry, and some just indicate the ID, which is not detailed, you can see at a glance... When the tasktrackeraction received by tasktracker contains LaunchtaskactionTo start executing the assigned new task. In tasktracker, there is Tasktracker. tasklauncherThreads (two, one equal map task and one equal reduce task) are waiting for the arrival of new tasks. Once the task arrives, it will eventually call CreaterunnerMethod to Construct TaskrunnerObject To create a thread for execution. For a map task, the corresponding runner is a subclass of taskrunner. MaptaskrunnerHowever, the core part is within the implementation of taskrunner. Taskrunner will first download all the required files, unpack them, and record them to a global cache. This is a global directory and can be used by all tasks of this job. It uses some soft links to link some file names to the cache. Then, configure a JVM execution environment based on different parameters. JvmenvClass Object. Then, taskrunner calls JvmmanagerOf LaunchjvmMethod, which is submitted to jvmmanager for processing. Jvmmanager is used to manage all running task sub-processes on the tasktracker. In the current implementation, the pooled approach is used. There are several fixed slots. If the slot is not full, start a new sub-process. Otherwise, look for the idle process. If it is a job, put it directly, otherwise, kill the process and replace it with a new process. Every process is managed by jvmrunner, which is also located in a separate thread. But from the implementation point of view, this mechanism does not seem to be deployed, and the sub-process waits for an endless loop without blocking the relevant threads of the parent process, the variables of the parent thread have not been adjusted. Once allocated, the variables are always busy. The actual execution carrier is child. It contains a main function. During process execution, relevant parameters are passed in. It disassembles these parameters and constructs related task instances, call its run function for execution. Each sub-process can execute a specified number of tasks. This is the pooled configuration mentioned above. However, in my opinion, this mechanism is not running. In fact, every process has no chance to execute new tasks, but it is a silly wait for the process pool to be full, and it is killed by a single knife. Maybe it's my old eyes, and I don't see any clue about the implementation... 4. Compare the allocation and execution of reduce tasks. The allocation of reduce tasks is simple. Basically, all MAP tasks are completed and there are idle task servers, assign a job. Because the results of the map task are very varied and varied, it is definitely worth the candle if we want to develop a global optimization algorithm. The construction and allocation process of the reduce task execution process is basically the same as that of map. There is nothing to say... But in fact, the biggest difference between a reduce task and a map task is that the files of a map task are separated locally, and the reduce task needs to be collected everywhere. This process is the task server where the reduce task is located by the job server. It tells the reduce task that the process is being executed and the address of the server where the map task has been executed, the reduce Task Server will contact the original map Task Server (of course, the local server will be exempted ...), download it through the FTP service. This implicit direct data connection is the biggest difference between executing a reduce task and executing a map task... 5. When all reduce tasks are completed, the required data is written to the distributed file system, and the entire job is completed. This involves many classes, a lot of files, and a lot of servers, so it is very difficult to say. I still draw two pictures if I have illustrated a thousand words and said so much, let's make a thorough expression... First, it is a sequence chart. It simulates a job execution flow consisting of three map tasks and one reduce task. We can see that in the execution process, as long as someone is too slow or fails, an attempt will be added, in exchange for the fastest total execution time. Once all the map tasks are completed, reduce starts to operate (in fact, it is not necessary to do so ...), for each map task, it is successful only when the data downloaded from the reduce task is completed. Otherwise, it is a failure and a new attempt is required... The second figure will not be reproduced if it is not drawn by me. For more information, see here. It describes the server status of the entire map/reduce, including the overall process, the server process, and the input and output. The figure clearly shows that the basic process of MAP/reduce can be completely completed. Some of these points may not be clearly described in the figure. One is that there are actually log files in HDFS, which are not marked in the figure; the other is Step 5, in fact, it is pushed by tasktracker instead of jobtracker. There are also steps 8 and 11, maptask and reducetask created, in hadoop, they all run on independent processes... IV. For details about the map task, see the overall process of the whole map and reduce task. The details to be followed are the specific execution details. The input of a map task is a file in a distributed file system that contains key-value pairs. To specify input for each map task, we need to master the file format to split it into blocks and separate the key value information from each block. In HDFS, the input file format is Inputformat <K,
V>
In jobconf, the default value is TextinputformatClass (see Getinputformat), This class is special Fileinputformat <longwritable,
Text>
Subclass, and Fileinputformat <K, V>Exactly inputformat <K,
V>. Through this relationship, we can easily understand that the default file format is Text FilesAnd the key is LongwritableType (integer), value: TextType (string ). It is not enough to know the file type. We also need to separate each piece of data in the file into key-value pairs. Recordreader <K,
V>
. In GetrecordreaderIn the method, we can see that it works with textinputformat by default. LinerecordreaderClass, is special Recordreader <longwritable,
Text>
Sub-class, it will One row acts as a record, the starting position acts as the key, and the entire row's string acts as the value. With the format, the key value is separated, and each map task needs to be split. Input of each map task InputsplitInterface. For a file input, its implementation is Filesplit, It contains File Name, start position, length, and a group of server addresses that store it... After the map task obtains the inputsplit, it starts to read records one by one and calls mapper for definition for computation (see maprunner <K1,
Run methods of V1, K2, V2> and maptask. Maptask will pass an outputcollector to mapper <K,
V> object as the output data structure. It defines a collect function that accepts a key-value pair. Two subclass of outputcollector are defined in maptask. One is maptask. directmapoutputcollector <K,
V>, people are like their names. Its implementation is indeed direct and straightforward. It uses a recordwriter <K,
V> object. When collect is called, recordwriter <K,
The write method of v> is written to a local file. If you see recordwriter <K,
V> the recordreader mentioned in the previous section is abrupt,
V>. Basically, the data structure corresponds to each other. One is input and the other is output. The output is also symmetric by recordwriter <K,
V> and outputformat <K, V>. The default implementation is linerecordwriter <K,
V> and textoutputformat <K, V>... In addition to this very direct implementation, maptask also has a complex implementation, which is maptask. mapoutputbuffer <k extends
Object, V extends
Object>. The truth is that simplicity prevails over everything. Why is there a very simple implementation that requires a complicated one. The reason is that, if it looks pretty, it often carries a thorn and a simple output implementation. Each time collect is called, a file is written. Frequent hard disk operations may lead to inefficiency of this solution. In order to solve this problem, this complicated version is available. It should first enable a piece of memory CacheAnd then create a ratio to do Threshold, Start thread monitoringThis cache. The content from collect is first written to the cache. When the monitoring thread finds that the proportion of cached content exceeds the threshold, it suspends all write operations and creates New FileBatch cache content Fl to this file... Why is it flushed to the file. This is not a simple copy process. before writing, the memory in the cache will be written after the sorting and combiner statistics. If you think the term combiner is too unfamiliar, consider reducer. combiner is also a CER class. It is set through setcombinerclass of jobconf. In common configurations, combiner is usually the reducer subclass defined by the user for reduce tasks. However, combiner is only a service with a smaller scope. In the local server where the map task is executed, it first performs a reduce operation according to the small part of the data processed by map, it can compress the size of the content to be transmitted and increase the speed. Each time a cache is refreshed, a new file is opened. After all the input of this task is processed, several ordered and merged output files exist. The system will put these files together, and then Merge multiple rows of data, and use the merging tool to merge the data, A unique, ordered, and merged intermediate file is obtained (note: the number of files is equivalent to the number of categories. When classification is not considered, it is simply regarded as ...). It is the input file that reduce tasks dream... In addition to merging, the complex version of outputcollector also has Category. Classification, is through Partitioner <k2,
V2>
The default implementation is Hashpartitioner <k2,
V2>,
The job submitted by jobconf SetpartitionerclassCustomize. What is the meaning of classification? Simply put, the map task output is divided into several files (usually equal to the number of reduce tasks), so that each reduce task, can process a certain type of files. This is a big benefit. Here is an example. For example, a job is running Word statisticsThe intermediate result of the map task should be A file that uses words as the key and the number of words as the value. If there is only one reduce task at this time, it's okay to say, from All map tasksCollect the files and calculate the final output files respectively. However, if a single reduce task cannot carry this load or the efficiency is too low, multiple reduce tasks need to be executed in parallel. In this case, the problem arises when the previous model is used. Each reduce task starts from Some map tasksThe output result is incorrect because the same word may be counted in different reduce tasks, you need to calculate them together to get the final result, so that the role of MAP/reduce is not fully realized. At this time, classification is required. If there are two reduce tasks at this time, the output is divided into two types: one class stores words with higher alphabetic order and the other class stores words with lower alphabetic order. Each reduce task starts from All map tasksObtain a class of intermediate files and obtain their output results. In the final result, you only need to splice the output of each reduce task. Essentially, this is the input of the reduce task, From Vertical Split to horizontal split. The role of partitioner is to accept a key value and return the serial number of a category. It will do this before flushing the cache to the file. In fact, there is only one more file name option, and other logics do not need to be changed... In addition to caching, merging, and classification, the complex version of outputcollector also supports skipping error data, which will be mentioned in the distributed troubleshooting section and marked, press not to table... V. Reduce task details theoretically, the entire execution process of a reduce task is more complex than that of a map task, because it needs to collect input files before processing. The reduce task has three steps: Copy, Sort, Reduce(See the run method of reducetask ). The so-called copy is to collect data from the server that executes each map task to the local machine. The copied task is composed Reducetask. reducecopierClass. It has an embedded class called MapoutputcopierIt is responsible for copying files on a map Task Server in a separate thread. The content copied remotely (of course it can also be local ...), as a mapoutput object, it can be serialized in the memory or on the disk, which is automatically adjusted based on the memory usage. The entire copy process is a dynamic process. That is to say, it does not change any input information once. It will not stop calling TaskumbilicalprotocolProtocol GetmapcompletioneventsMethod: Ask the parent tasktracker about the completion status of the map tasks of the job (tasktracker needs to ask jobtracker and then tell it to him ...). After obtaining information about the map task execution server, a thread is enabled to perform specific copy operations. At the same time, there is also a memory merger thread and a file merger thread working in synchronization, they will be Freshly downloaded files (may be in the memory, simply referred to as files ...), merge and sort to save time, reduce the number of input files, and reduce the load for subsequent sorting... Sort is a continuation of the preceding sorting. It will be done after all the files are copied, because although synchronization is doing the merge work, it may have a tail and is not done completely. After this process, it was completely complete. A new file that combines all the output files of the map task was created. The map task output files collected by thousands of lines from other servers quickly ended their historical missions. They were swept out and deleted...
The so-called good play is at the end. The last stage of the reduce task is the reduce itself. It will also prepare OutputcollectorDifferent from maptask, this outputcollector is simpler than opening Recordwriter, Collect once, write once. The biggest difference is that the file system passed into recordwriter is basically Distributed File SystemOr HDFS. In terms of input, reducetask calls a bunch of custom classes such as getmapoutputkeyclass, getmapoutputvalueclass, and getoutputkeycomparator from jobconf to construct the key types required by reducer, the iteration type iterator of the value (a key usually corresponds to a group of values here ). We recommend that you take a look at the specific implementation. Merger. mergequeue, Rawkeyvalueiterator, Reducetask. cecevaluesiteratorAnd so on. With input and output, custom reducers are continuously called cyclically. Finally, the reduce stage is complete... Vi. Distributed Support 1. Server correctness assurance hadoop
The status of the MAP/reduce server is similar to that of HDFS, so we can see that the methods for saving lives are similar. If you don't talk much about it, you can answer the question directly. As a client, the MAP/reduce client only submits jobs and starts to move a bench to watch the show without occupying the pitfalls. Therefore, once it crashes, it does not hurt the elegance. The Task Server also needs to maintain heartbeat contact with the job server at any time. Once there is a problem, the job server can hand over the tasks running on it to its personnel. Job server, as a single point of failure, uses recovery points (equivalent to HDFS images) and historical records (equivalent to HDFS logs) for recovery. The content that needs to be persisted for recovery, including the job status, job status, and the work status of each task attempt. With this content, coupled with the dynamic registration of the task server, it is easy to restore even if a nest is moved. JobhistoryIt is a static class related to historical records. Originally, it is also a log-writing job, but in the implementation of hadoop, it is an object-oriented encapsulation of log writing, at the same time, the observer mode is widely used for embedding, making it not so intuitive. Essentially, it is used to open several log files and write content through various interfaces. However, these logs will be stored in a distributed file system, so you don't need to wait for a secondxxx like HDFS. It is really nice to see a giant standing at his feet. The jobtracker. recoverymanager class is used in the job server for recovery. When the job server is started, it will call its recover method to restore the content in the log file. Here, the steps are clearly written in the annotations. Please check them yourself... 2. The execution of the task is correct and fast, and the execution of the entire job process is based on the barrel principle. The slowest map task and reduce task determines the overall execution time of the system (of course, if the execution time is very small in the whole process, it may be negligible ...). Therefore, it is critical to speed up the slowest task execution speed as much as possible. The strategy is simple but not simple. Multiple executions of a task. When all the unexecuted tasks are assigned and the first rich task is completed, and the Task Server tirelessly requests the task, the job server will start to stir up the leftovers, the tasks that are being executed slowly on a server are assigned to a new server and executed simultaneously. The two servers will do their best, and the results of the first server will be accepted. This policy implies a hypothesis that we believe that the input file segmentation algorithm is fair, and the execution of a task is slow, not because the task itself is heavily burdened, it is because the server is too difficult to afford or is about to let it go. In this new environment, people can move the tree to survive with half the effort...

Of course, there must be a choked task, which cannot be completed successfully on any server. This shows that the problem is not about the server, but about the shortcomings of the task itself. Where are defects? Each job has the same functional code. If other tasks are successful, the task fails. Obviously, the problem lies in the input. The input contains invalid entries, which cannot be identified by the program. Speaking of this, the solution strategy has also surfaced, and the plan is difficult to afford and can still be avoided. In maptask, maptask. skippingrecordreader <K,
V> and reducetask. skippingperformancevaluesiterator <key, value> In reducetask are used to do this. Their principle is very simple, that is, before reading a record, encapsulate the current location information into a sortedranges. Range object and submit it to tasktracker through the reportnextrecordrange method of the task. Tasktracker puts the content in the taskstatus object and reports it to jobtracker with heartbeat messages. In this way, the job server can understand at any time that each task is reading at that location. When an error occurs and is executed again, add a group of sortedranges information to the assigned task information. When maptask or reducetask is read, it will check these regions. If the current region is in the above minefield, skip and skip. It can be said that the road is tortuous and the future is bright...

VII. Summary for MAP/reduce, the real difficulty lies in improving its adaptability and creating an execution framework that can cure all diseases. Hadoop has done a good job, but only by figuring out the entire process can you help it do better...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.