Spark External Datasets

Source: Internet
Author: User

Spark External Datasets

Spark can create RDD from any storage source that supports Hadoop, including local file systems, HDFS, Cassandra, Hbase, and Amazon S3. Spark supports textFile, SequenceFiles, and any other Hadoop data in InputFormat.

1. the RDD of textfile can be created through the SparkContext's textFile method. This method requires passing a file path URL as a parameter and then reading the data of each row of the corresponding file, form a set of rows. For example:

Scala> val distFile = SC. textFile ("data.txt ")

DistFile: RDD [String] = MappedRDD @ 1d4cee08

2. In the textfile method, if the URL of a local file is input, this must ensure that other machines in the Spark cluster can access the same URL.

3. In Spark, all input methods (including textFile) Support folders, compressed files, and wildcards. For example:

TextFile ("/my/directory"), textFile ("/my/directory/*. txt"), and textFile ("/my/directory/*. gz ").

4. In the textFile method, the second parameter is also accepted. This parameter specifies the number of partitions corresponding to the generated RDD. By default, Spark uses the HDFS block size as the partition size, that is, the number of partitions as M. You can set a value greater than the number of partitions M, but not smaller than the number of partitions M.

5. In addition to the textFile method, Spark provides the following methods to load external data:

(1) SparkContext. wholeTextFiles

This method reads all small files in a path, and uses the content of each small file as the value, the filename of the file as the key, and the pairs (key, value) to the client. This method is opposite to textfile. textfile returns the records of each row of each file as the key and value.

(2) SparkContext's sequenceFile [K, V]

For sequenceFiles, we can use the sequenceFile [K, V] method to load external data. The K and V types are the key and value types in the file. However, this is a subclass of the Hadoop Writable (an interface type) type.

(3) SparkContext. hadoopRDD

For other Hadoop InputFormats, you can use hadoopRDD to load external data sources. This method requires passing in a specific JobConf and input format class, key class and value class

6. A simple method to save RDD:

RDD. saveAsObjectFile and SparkContext. objectFile support saving an RDD in a simple format consisting of serialized Java objects. while this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD.

Install and configure Spark in CentOS 7.0

Spark1.0.0 Deployment Guide

Install Spark0.8.0 in CentOS 6.2 (64-bit)

Introduction to Spark and its installation and use in Ubuntu

Install the Spark cluster (on CentOS)

Hadoop vs Spark Performance Comparison

Spark installation and learning

Spark Parallel Computing Model

Ubuntu 14.04 LTS install Spark 1.6.0 (pseudo-distributed)

Spark details: click here
Spark: click here

This article permanently updates the link address:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.