Spark RDD Commonly Used Operator Operations

Source: Internet
Author: User
Keywords spark spark rdd spark rdd operation
This tutorial discusses the spark learning of big data technology: commonly used operator operations in RDD.

Start spark-shell to test:
spark-shell --master spark://node1:7077
Exercise 1: map, filter
//Generate rdd through parallelization
val rdd1 = sc.parallelize(List(5, 6, 4, 7, 3, 8, 2, 9, 1, 10))
//Multiply each element in rdd1 by 2 and sort
val rdd2 = rdd1.map( * 2).sortBy(x => x, true)
//Filter out elements greater than or equal to 5
val rdd3 = rdd2.filter( >= 5)
//Display elements on the client in an array
rdd3.collect
Exercise 2: flatMap
val rdd1 = sc.parallelize(Array("a b c", "d e f", "h i j"))
//Separate each element in rdd1 to flatten it first
val rdd2 = rdd1.flatMap(.split(" "))
rdd2.collect
Exercise 3: Intersection, union
val rdd1 = sc.parallelize(List(5, 6, 4, 3))
val rdd2 = sc.parallelize(List(1, 2, 3, 4))
// seeking union
val rdd3 = rdd1.union(rdd2)
//Seeking intersection
val rdd4 = rdd1.intersection(rdd2)
//Deduplication
rdd3.distinct.collect
rdd4.collect
Exercise 4: join, groupByKey
val rdd1 = sc.parallelize(List(("tom", 1), ("jerry", 3), ("kitty", 2)))
val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 1), ("shuke", 2)))
//See join
val rdd3 = rdd1.join(rdd2)
rdd3.collect
// seeking union
val rdd4 = rdd1 union rdd2
rdd4.collect
//Group by key
val rdd5=rdd4.groupByKey
rdd5.collect
Exercise 5: cogroup
val rdd1 = sc.parallelize(List(("tom", 1), ("tom", 2), ("jerry", 3), ("kitty", 2)))
val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 1), ("jim", 2)))
//cogroup
val rdd3 = rdd1.cogroup(rdd2)
//Note the difference between cogroup and groupByKey
rdd3.collect
Exercise 6: reduce
val rdd1 = sc.parallelize(List(1, 2, 3, 4, 5))
//reduce aggregation
val rdd2 = rdd1.reduce( +)
rdd2.collect
Exercise 7: reduceByKey, sortByKey
val rdd1 = sc.parallelize(List(("tom", 1), ("jerry", 3), ("kitty", 2), ("shuke", 1)))
val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 3), ("shuke", 2), ("kitty", 5)))
val rdd3 = rdd1.union(rdd2)
//Aggregate by key
val rdd4 = rdd3.reduceByKey( + _)
rdd4.collect
//Sort by value in descending order
val rdd5 = rdd4.map(t => (t._2, t._1)).sortByKey(false).map(t => (t._2, t._1))
rdd5.collect
Exercise 8: repartition, coalesce
val rdd1 = sc.parallelize(1 to 10,3)
//Use repartition to change the number of rdd1 partitions
//Reduce the partition
rdd1.repartition(2).partitions.size
//Add partition
rdd1.repartition(4).partitions.size
//Use coalesce to change the number of rdd1 partitions
//Reduce the partition
rdd1.coalesce(2).partitions.size
Note: repartition can increase and decrease the number of partitions in rdd, coalesce can only reduce the number of rdd partitions, increasing the number of rdd partitions will not take effect
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.