This article describes how to create
spark rdd.
1. Collection parallel creation (created by scala collection) local collection in scala -->
Spark RDD
spark-shell --master spark://hadoop01:7077
scala> val arr = Array(1,2,3,4,5)
scala> val rdd = sc.parallelize(arr)
scala> val rdd = sc.makeRDD(arr)
scala> rdd.collect
res0: Array[Int] = Array(1, 2, 3, 4, 5)
Create RDD through set parallelization, suitable for local testing and experiment
2. External file system, such as HDFS
Read HDFS file system
val rdd2 = sc.textFile("hdfs://hadoop01:9000/words.txt")
Read local file
val rdd2 = sc.textFile("file:///root/words.txt")
scala> val rdd2 = sc.textFile("file:////root/word.txt")
scala> rdd2.collect
res2: Array[String] = Array(hadoop hbase java, hbase java spark, java, hadoop hive hive, hive hbase)
3. Convert from parent RDD to new child RDD
Call the methods of the Transformation class to generate a new RDD
As long as the operator of the transformation class is called, a new RDD will be generated. The data type in RDD is determined by the return type of the function passed to the operator
Note: Operators of the action class will not generate new RDDs
scala> rdd.collect
res3: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val rdd = sc.parallelize(arr)
scala> val rdd2 = rdd.map(_*100)
scala> rdd2.collect
res4: Array[Int] = Array(100, 200, 300, 400, 500)
All methods on Spark have a proper noun called operator