Spark External Datasets

Last Update:2016-06-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark can create RDD from any storage source that supports Hadoop, including local file systems, HDFS, Cassandra, Hbase, and Amazon S3. Spark supports textFile, SequenceFiles, and any other Hadoop data in InputFormat.

1. the RDD of textfile can be created through the SparkContext's textFile method. This method requires passing a file path URL as a parameter and then reading the data of each row of the corresponding file, form a set of rows. For example:

Scala> val distFile = SC. textFile ("data.txt ")

DistFile: RDD [String] = MappedRDD @ 1d4cee08

2. In the textfile method, if the URL of a local file is input, this must ensure that other machines in the Spark cluster can access the same URL.

3. In Spark, all input methods (including textFile) Support folders, compressed files, and wildcards. For example:

TextFile ("/my/directory"), textFile ("/my/directory/*. txt"), and textFile ("/my/directory/*. gz ").

4. In the textFile method, the second parameter is also accepted. This parameter specifies the number of partitions corresponding to the generated RDD. By default, Spark uses the HDFS block size as the partition size, that is, the number of partitions as M. You can set a value greater than the number of partitions M, but not smaller than the number of partitions M.

5. In addition to the textFile method, Spark provides the following methods to load external data:

(1) SparkContext. wholeTextFiles

This method reads all small files in a path, and uses the content of each small file as the value, the filename of the file as the key, and the pairs (key, value) to the client. This method is opposite to textfile. textfile returns the records of each row of each file as the key and value.

(2) SparkContext's sequenceFile [K, V]

For sequenceFiles, we can use the sequenceFile [K, V] method to load external data. The K and V types are the key and value types in the file. However, this is a subclass of the Hadoop Writable (an interface type) type.

(3) SparkContext. hadoopRDD

For other Hadoop InputFormats, you can use hadoopRDD to load external data sources. This method requires passing in a specific JobConf and input format class, key class and value class

6. A simple method to save RDD:

RDD. saveAsObjectFile and SparkContext. objectFile support saving an RDD in a simple format consisting of serialized Java objects. while this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD.

Install and configure Spark in CentOS 7.0

Spark1.0.0 Deployment Guide

Install Spark0.8.0 in CentOS 6.2 (64-bit)

Introduction to Spark and its installation and use in Ubuntu

Install the Spark cluster (on CentOS)

Hadoop vs Spark Performance Comparison

Spark installation and learning

Spark Parallel Computing Model

Ubuntu 14.04 LTS install Spark 1.6.0 (pseudo-distributed)

Spark details: click here
Spark: click here

This article permanently updates the link address:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark External Datasets

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark External Datasets

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support