Spark read-Write compressed file API usage

Source: Internet
Author: User

Recently researched how to read and write compressed format files in the following spark, mainly in the following three ways, here in Lzo mode compression as an example


    /*******************old hadoop api*************************/     val confhadoop = new jobconf    confhadoop.set (" Mapred.output.compress ", " true ")     confhadoop.set (" Mapred.output.compression.codec ",   "Com.hadoop.compression.lzo.LzopCodec")     val textFile =  Sc.hadoopfile (args (0),  classof[deprecatedlzotextinputformat],classof[longwritable], classof[text],  1)     textfile.saveashadoopfile (args (1), Classof[longwritable], classof[text],  classof[textoutputformat[longwritable,text]],confhadoop)                  /*******************new hadoop api****************** /    val job = new job ()      Job.setoutputformatclass (Classof[textoutputformat[longWritable,text]])     job.getconfiguration (). Set ("Mapred.output.compress",  "true")     job.getconfiguration (). Set ("Mapred.output.compression.codec",  " Com.hadoop.compression.lzo.LzopCodec ")     val textFile =  Sc.newapihadoopfile (args (0),  classof[lzotextinputformat],classof[longwritable], classof[text], Job.getconfiguration ())     textfile.saveasnewapihadoopfile (args (1),  classOf[ Longwritable], classof[text],classof[textoutputformat[longwritable,text]],job.getconfiguration ())                  /******************* Textfile*************************/    val textfile = sc.textfile (args (0),  1)     textfile.saveastextfile (args (1),   classof[lzopcodec])

In three ways, basically using all of the major spark-supplied API for reading and writing files, the first case is written for the legacy Hadoop API provided by Spark, and the compressed properties are configured in jobconf. When reading and writing, declare InputFormat and OutputFormat. The second case is written in the form of a new version of the Hadoop API, similar to the first case. At the end of the day is the simplest way to write, specifying codec when writing.

In order for Spark to support read-write compressed files, some basic configuration is required so that spark can load the class libraries and jars associated with the required compression format, as configured below

spark.executor.extralibrarypath=/usr/lib/native/spark.executor.extraclasspath=/usr/lib/hadoop/lib/ Hadoop-lzo.jar

The way spark supports three configuration properties, from low to High, is: in conf/ Configuration parameters are configured in spark-defaults.conf, using Spark-submit or Spark-shell to commit the program, and parameters are configured in the Spark program either through the System.setproperty method or by setting the Sparkconf object. If the same parameter is configured more than once, whichever is configured in the highest priority mode.

The above configuration for compression is for executor, and you need to configure compression-related properties for driver at commit time

--driver-class-path/usr/lib/hadoop/lib/hadoop-lzo.jar--driver-library-path/usr/lib/native

When using spark SQL, configuring executor and driver compression-related properties will normally read the compressed files in the hive directory (the version I tested is CDH5.0.0 and Spark1.0), and if you want to calculate the output compression format after hive, You can set the compression-related properties in the HQL () method, for example

HQL ("Set Io.compression.codecs=com.hadoop.compression.lzo.lzocodec,com.hadoop.compression.lzo.lzopcodec") hql (" Set Io.compression.codec.lzo.class=com.hadoop.compression.lzo.lzocodec ") hql (" Set Mapred.output.compression.codec =com.hadoop.compression.lzo.lzopcodec ")


This article is from the "17 blog" blog, be sure to keep this source http://xiaowuliao.blog.51cto.com/3681673/1536527

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.