Recently researched how to read and write compressed format files in the following spark, mainly in the following three ways, here in Lzo mode compression as an example
/*******************old hadoop api*************************/ val confhadoop = new jobconf confhadoop.set (" Mapred.output.compress ", " true ") confhadoop.set (" Mapred.output.compression.codec ", "Com.hadoop.compression.lzo.LzopCodec") val textFile = Sc.hadoopfile (args (0), classof[deprecatedlzotextinputformat],classof[longwritable], classof[text], 1) textfile.saveashadoopfile (args (1), Classof[longwritable], classof[text], classof[textoutputformat[longwritable,text]],confhadoop) /*******************new hadoop api****************** / val job = new job () Job.setoutputformatclass (Classof[textoutputformat[longWritable,text]]) job.getconfiguration (). Set ("Mapred.output.compress", "true") job.getconfiguration (). Set ("Mapred.output.compression.codec", " Com.hadoop.compression.lzo.LzopCodec ") val textFile = Sc.newapihadoopfile (args (0), classof[lzotextinputformat],classof[longwritable], classof[text], Job.getconfiguration ()) textfile.saveasnewapihadoopfile (args (1), classOf[ Longwritable], classof[text],classof[textoutputformat[longwritable,text]],job.getconfiguration ()) /******************* Textfile*************************/ val textfile = sc.textfile (args (0), 1) textfile.saveastextfile (args (1), classof[lzopcodec])
In three ways, basically using all of the major spark-supplied API for reading and writing files, the first case is written for the legacy Hadoop API provided by Spark, and the compressed properties are configured in jobconf. When reading and writing, declare InputFormat and OutputFormat. The second case is written in the form of a new version of the Hadoop API, similar to the first case. At the end of the day is the simplest way to write, specifying codec when writing.
In order for Spark to support read-write compressed files, some basic configuration is required so that spark can load the class libraries and jars associated with the required compression format, as configured below
spark.executor.extralibrarypath=/usr/lib/native/spark.executor.extraclasspath=/usr/lib/hadoop/lib/ Hadoop-lzo.jar
The way spark supports three configuration properties, from low to High, is: in conf/ Configuration parameters are configured in spark-defaults.conf, using Spark-submit or Spark-shell to commit the program, and parameters are configured in the Spark program either through the System.setproperty method or by setting the Sparkconf object. If the same parameter is configured more than once, whichever is configured in the highest priority mode.
The above configuration for compression is for executor, and you need to configure compression-related properties for driver at commit time
--driver-class-path/usr/lib/hadoop/lib/hadoop-lzo.jar--driver-library-path/usr/lib/native
When using spark SQL, configuring executor and driver compression-related properties will normally read the compressed files in the hive directory (the version I tested is CDH5.0.0 and Spark1.0), and if you want to calculate the output compression format after hive, You can set the compression-related properties in the HQL () method, for example
HQL ("Set Io.compression.codecs=com.hadoop.compression.lzo.lzocodec,com.hadoop.compression.lzo.lzopcodec") hql (" Set Io.compression.codec.lzo.class=com.hadoop.compression.lzo.lzocodec ") hql (" Set Mapred.output.compression.codec =com.hadoop.compression.lzo.lzopcodec ")
This article is from the "17 blog" blog, be sure to keep this source http://xiaowuliao.blog.51cto.com/3681673/1536527