Spark read-Write compressed file API usage

Last Update:2014-08-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently researched how to read and write compressed format files in the following spark, mainly in the following three ways, here in Lzo mode compression as an example

    /*******************old hadoop api*************************/     val confhadoop = new jobconf    confhadoop.set (" Mapred.output.compress ", " true ")     confhadoop.set (" Mapred.output.compression.codec ",   "Com.hadoop.compression.lzo.LzopCodec")     val textFile =  Sc.hadoopfile (args (0),  classof[deprecatedlzotextinputformat],classof[longwritable], classof[text],  1)     textfile.saveashadoopfile (args (1), Classof[longwritable], classof[text],  classof[textoutputformat[longwritable,text]],confhadoop)                  /*******************new hadoop api****************** /    val job = new job ()      Job.setoutputformatclass (Classof[textoutputformat[longWritable,text]])     job.getconfiguration (). Set ("Mapred.output.compress",  "true")     job.getconfiguration (). Set ("Mapred.output.compression.codec",  " Com.hadoop.compression.lzo.LzopCodec ")     val textFile =  Sc.newapihadoopfile (args (0),  classof[lzotextinputformat],classof[longwritable], classof[text], Job.getconfiguration ())     textfile.saveasnewapihadoopfile (args (1),  classOf[ Longwritable], classof[text],classof[textoutputformat[longwritable,text]],job.getconfiguration ())                  /******************* Textfile*************************/    val textfile = sc.textfile (args (0),  1)     textfile.saveastextfile (args (1),   classof[lzopcodec])

In three ways, basically using all of the major spark-supplied API for reading and writing files, the first case is written for the legacy Hadoop API provided by Spark, and the compressed properties are configured in jobconf. When reading and writing, declare InputFormat and OutputFormat. The second case is written in the form of a new version of the Hadoop API, similar to the first case. At the end of the day is the simplest way to write, specifying codec when writing.

In order for Spark to support read-write compressed files, some basic configuration is required so that spark can load the class libraries and jars associated with the required compression format, as configured below

spark.executor.extralibrarypath=/usr/lib/native/spark.executor.extraclasspath=/usr/lib/hadoop/lib/ Hadoop-lzo.jar

The way spark supports three configuration properties, from low to High, is: in conf/ Configuration parameters are configured in spark-defaults.conf, using Spark-submit or Spark-shell to commit the program, and parameters are configured in the Spark program either through the System.setproperty method or by setting the Sparkconf object. If the same parameter is configured more than once, whichever is configured in the highest priority mode.

The above configuration for compression is for executor, and you need to configure compression-related properties for driver at commit time

--driver-class-path/usr/lib/hadoop/lib/hadoop-lzo.jar--driver-library-path/usr/lib/native

When using spark SQL, configuring executor and driver compression-related properties will normally read the compressed files in the hive directory (the version I tested is CDH5.0.0 and Spark1.0), and if you want to calculate the output compression format after hive, You can set the compression-related properties in the HQL () method, for example

HQL ("Set Io.compression.codecs=com.hadoop.compression.lzo.lzocodec,com.hadoop.compression.lzo.lzopcodec") hql (" Set Io.compression.codec.lzo.class=com.hadoop.compression.lzo.lzocodec ") hql (" Set Mapred.output.compression.codec =com.hadoop.compression.lzo.lzopcodec ")

This article is from the "17 blog" blog, be sure to keep this source http://xiaowuliao.blog.51cto.com/3681673/1536527

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark read-Write compressed file API usage

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark read-Write compressed file API usage

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support