Spark External Datasets
Spark can create RDD from any storage source that supports Hadoop, including local file systems, HDFS, Cassandra, Hbase, and Amazon S3. Spark supports textFile, SequenceFiles, and any other Hadoop data in InputFormat.
1. the RDD of textfile can be created through the SparkContext's textFile method. This method requires passing a file path URL as a parameter and then reading the data of each row of the corresponding file, form a set of rows. For example:
Scala> val distFile = SC. textFile ("data.txt ")
DistFile: RDD [String] = MappedRDD @ 1d4cee08
2. In the textfile method, if the URL of a local file is input, this must ensure that other machines in the Spark cluster can access the same URL.
3. In Spark, all input methods (including textFile) Support folders, compressed files, and wildcards. For example:
TextFile ("/my/directory"), textFile ("/my/directory/*. txt"), and textFile ("/my/directory/*. gz ").
4. In the textFile method, the second parameter is also accepted. This parameter specifies the number of partitions corresponding to the generated RDD. By default, Spark uses the HDFS block size as the partition size, that is, the number of partitions as M. You can set a value greater than the number of partitions M, but not smaller than the number of partitions M.
5. In addition to the textFile method, Spark provides the following methods to load external data:
(1) SparkContext. wholeTextFiles
This method reads all small files in a path, and uses the content of each small file as the value, the filename of the file as the key, and the pairs (key, value) to the client. This method is opposite to textfile. textfile returns the records of each row of each file as the key and value.
(2) SparkContext's sequenceFile [K, V]
For sequenceFiles, we can use the sequenceFile [K, V] method to load external data. The K and V types are the key and value types in the file. However, this is a subclass of the Hadoop Writable (an interface type) type.
(3) SparkContext. hadoopRDD
For other Hadoop InputFormats, you can use hadoopRDD to load external data sources. This method requires passing in a specific JobConf and input format class, key class and value class
6. A simple method to save RDD:
RDD. saveAsObjectFile and SparkContext. objectFile support saving an RDD in a simple format consisting of serialized Java objects. while this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD.
Install and configure Spark in CentOS 7.0
Spark1.0.0 Deployment Guide
Install Spark0.8.0 in CentOS 6.2 (64-bit)
Introduction to Spark and its installation and use in Ubuntu
Install the Spark cluster (on CentOS)
Hadoop vs Spark Performance Comparison
Spark installation and learning
Spark Parallel Computing Model
Ubuntu 14.04 LTS install Spark 1.6.0 (pseudo-distributed)
Spark details: click here
Spark: click here
This article permanently updates the link address: