Hive rcfile Why merge jobs produce duplicate data

Last Update:2017-02-27 Source: Internet

Author: User

Tags commit file size log stack trace

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A few days ago, DW user feedback, in a table (Rcfile table) with "Insert Overwrite table partition (XX) Select ..." When inserting data, duplicate files are generated. Looking at the job log, we found that map task 000005 had two task attempt, the second attempt was speculative execution, and the two attemp renamed the temp file as an official file in the task close function, Rather than through the two-phase commit protocol of the MapReduce framework (two phrase commit protocol) commit when Tasktracker is received committaskaction. Task to ensure that only one attemp result becomes the official result.

The output in the task log is as follows:

 attempt_201304111550_268224_m_000005_0 renamed Path hdfs://10.2.6.102/tmp/hive-deploy/hive_2013-05-30_ 10-13-59_124_8643833043783438119/_task_tmp.-ext-10000/hp_cal_month=2013-04/_tmp.000005_0 to hdfs://10.2.6.102/ Tmp/hive-deploy/hive_2013-05-30_10-13-59_124_8643833043783438119/_tmp.-ext-10000/hp_cal_month=2013-04/000005_0 . File size is 666922 attempt_201304111550_268234_m_000005_1 renamed Path Hdfs://10.2.6.102/tmp/hive-deploy/hive_ 2013-05-30_10-13-59_124_8643833043783438119/_task_tmp.-ext-10000/hp_cal_month=2013-04/_tmp.000005_1 to hdfs:// 10.2.6.102/tmp/hive-deploy/hive_2013-05-30_10-13-59_124_8643833043783438119/_tmp.-ext-10000/hp_cal_month= 2013-04/000005_1. The File size is 666922

In fact, this hive statement would have only 1 jobs (launching Job 1 out of 1), and when the first Job ends, a conditional task analyzes the average file size under each partition, If it is less than hive.merge.smallfiles.avgsize (the default is 16MB), the first job is the map-only job and the Hive.merge.mapfiles is turned on (the default is true). A second merge-file job is used to merge the small file, and the Rcfilemergemapper is the small file that was generated before the merge. There are two ways to workaround, one is to turn off speculative execution, but there may be a task that is slow to cause a bottleneck, and the other is to close the merge file job (set Hive.merge.mapfiles=fasle). This does not use Rcfilemergemapper, but this creates a large number of small files that cannot be merged.
If parsing is to start the merge job, create a blockmergetask (inherit from Task), execute the inside Execute method, set the jobconf corresponding parameters, such as Mapred.mapper.class, Hive.rcfile.merge.output.dir, and then creates a jobclient and submitjob,map execution logic in the Rcfilemergemapper class, which inherits the old Mapred The abstract class Mapreducebase in the API overrides the Configure and close methods, and the rename operation mentioned earlier is that in the Close method, the Run method in the Maprunner class loops through the map method that actually executes the mapper. And finally call the Mapper Close method

public void Close () throws IOException {  
  //close writer  
  if (outwriter = = null) {return  
    ;  
  }  
      
  Outwriter.close ();  
  Outwriter = null;  
      
  if (!exception) {  
    Filestatus FSS = Fs.getfilestatus (Outpath);  
    Log.info ("renamed path" + Outpath + "to" + Finalpath  
        +). File size is "+ fss.getlen ());  
    if (!fs.rename (Outpath, Finalpath)) {  
      throw new IOException ("Unable to rename output to" + Finalpath);  
    }  
  } else {  
    if (!autodelete) {  
      fs.delete (Outpath, True);  
    }  
  }  
}

After the job execution is complete, it is possible to have different attempt of the same task to produce the result file at the same time, but Hive obviously take this into account, So the Rcfilemergemapper.jobclose method is called after the merge job executes, it backs up the output directory, writes the data to the output directory, and calls the Utilities.removetemporduplicatefiles method to delete the duplicate file, and the deletion logic is Extract TaskID from the filename, if the same taskid has two files, then the small one will be deleted, but in the 0.9 version, Rcfilemergemapper for the target table is dynamic partition table situation does not support, so there will be duplicated files, Patch (HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/HIVE-3149?ATTACHMENTORDER=ASC) to solve the problem

Rcfilemergemapper Execute method Finally processing logic, the source code catch live exception after no processing, I added some stack trace output and set return value

Finally {  
      try {  
        if (ctxcreated) {  
          ctx.clear ()  
        }  
        if (RJ!= null) {  
          if (returnval!= 0) {  
            rj.killjob ();  
          }  
          HadoopJobExecHelper.runningJobKillURIs.remove (Rj.getjobid ());  
          Jobid = Rj.getid (). toString ();  
        Rcfilemergemapper.jobclose (OutputPath, Success, job, console, Work.getdynpartctx ());  
      catch (Exception e) {  
        console.printerror ("Rcfile merger Job close Error", "n"
            + Org.apache.hadoop.util.StringUtils.stringifyException (e));  
        E.printstacktrace (System.err);  
        Success = false;  
        ReturnVal = -500;  
      }  
    }

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hive rcfile Why merge jobs produce duplicate data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hive rcfile Why merge jobs produce duplicate data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support