Our Hadoop production environment has two versions, one of which is 1.0.3, and in order to support log compression and split, we have added feature about BZIP2 compression in hadoop-1.2. Everything works well.
To meet the company's needs for iterative computing (complex Hivesql, ad recommendation algorithms, machine learning etc), we built our own spark cluster, initially standalone Mode, version spark-0.9.1, and shark support.
On-line, the problem followed, the most deadly is that shark in the processing of Hadooop bzip2 files, the results are usually biased, sometimes worse (for example, with the shark statistics 1 5kw rows of logs, the result is only
3kw lines).
It's obvious that some part of Shark+hive+spark+hadoop has a bug. The first time I faced such a complicated system, I had a headache.
So, began to recklessly, deploy SHARK+HIVE+SPARK+HADOOP development environment, debug, look at the problem of the link. (The process of Spark-core the source code also wisp), never found any problem.
Later, in the Spark technology conference, in the process of communicating with peers, repentant: The task of spark is thread-level concurrency, and the task of Hadoop Mr is process-level concurrency, is it possible that BZIP2 has a thread-safety problem?
When you get back, check out the BZIP2CODEC related code and finally find out where the problem is. (Say, 3 o'clock in the morning, no eclipse, with vim), can't wait to recompile Hadoop and Spark, test, find processing BZIP2 results OK!
Due to the recent busy, submitting a path to the community requires a lengthy process and has not yet been submitted to the community. Specific patches as follows, if peers encounter similar problems, please learn from.
index:src/core/org/apache/hadoop/io/compress/bzip2/cbzip2inputstream.java===================================== ==============================---Src/core/org/apache/hadoop/io/compress/bzip2/cbzip2inputstream.java (version 525) + + + Src/core/org/apache/hadoop/io/compress/bzip2/cbzip2inputstream.java (version 510) @@ -129,9 +129,7 @@ -129,9 int COMPUTEDBLOCKCRC, COMPUTEDCOMBINEDCRC; Private Boolean skipresult = false;//used by skiptonextmarker-//modified by jicheng.song-//private Static Boolean ski Pdecompression = false;-private Boolean skipdecompression = false;+ private static Boolean skipdecompression = false; Variables used by setup* methods exclusively @@ -317,18 +315,13 @@ -317,18 @throws ioexception * */-//modified by jicheng.song-//public Static Long Numberofbytestillnextmarker (final InputStream in) throws ioexception{-public long Numberofbytestillnextmarker (final InputStream in) throws ioexception{-this.skipdecompression = true;-//-this.i n = new BuffereDinputstream (In, 1024x768 * 9);//>1 MB Buffer-this.readmode = readmode;-//cbzip2inputstream anobject = null;+ Pub Lic Static long Numberofbytestillnextmarker (final InputStream in) throws ioexception{+ Cbzip2inputstream.skipdecompres sion = true;+ Cbzip2inputstream anobject = null; -//anobject = new Cbzip2inputstream (in, Read_mode. BYBLOCK); + anobject = new Cbzip2inputstream (in, Read_mode. BYBLOCK); -Return This.getprocessedbytecount (); + return Anobject.getprocessedbytecount (); } public Cbzip2inputstream (final InputStream in) throws IOException {@@ -402,9 +395,7 @@ -402,9 (skipdecompression) { Changestatetoprocessablock ();-//modified by jicheng.song-//cbzip2inputstream.skipdecompression = false;- This.skipdecompression = false;+ Cbzip2inputstream.skipdecompression = false; } final int hi = offs + len;
.
Online spark processing Bzip2 leads to Hadoop BZIP2 thread safety issues