RDD, DataFrame, DataSet Introduction

Source: Internet
Author: User
Tags serialization databricks
Rdd
Advantages:
Compile-Time type safety
The type error can be checked at compile time
Object-oriented Programming style
Manipulate data directly from the class name point
Disadvantages:
Performance overhead for serialization and deserialization
Both the communication between the clusters and the IO operations require serialization and deserialization of the object's structure and data.
Performance overhead of GC

Frequent creation and destruction of objects is bound to increase the GC

Val sparkconf = new sparkconf (). Setmaster ("local"). Setappname ("Test"). Set ("Spark.port.maxRetries", "" ")
Val Spark = Sparksession.builder (). config (sparkconf). Getorcreate ()
Val rdd=spark.sparkcontext.parallelize (Seq ("a ", 1), (" B ", 1), (" A ", 1)))


DataFrame:

Dataframe introduces schema and Off-heap
Schema:rdd each row of data, the structure is the same. This structure is stored in the schema. Spark can read data through Schame, so it only needs to serialize and deserialize the data in communication and Io, and the parts of the structure can be omitted.
Off-heap: means memory outside the JVM heap, which is directly managed by the operating system (not the JVM). Spark can serialize data in binary form (not including structure) into Off-heap, and when the data is being manipulated, the off-heap memory is manipulated directly. Because Spark understands the schema, it knows what to do.
Off-heap is like a territory, the schema is like a map, Spark has a map and its own territory, you can make the final decision, no longer be restricted by the JVM, also no longer collect GC.
Through schema and Off-heap, Dataframe solves the drawbacks of the RDD, but loses the advantages of the RDD. Dataframe is not type-safe, and the API is not object-oriented.

Import Org.apache.spark.sql.types. {datatypes, Structfield, structtype}
Import Org.apache.spark.sql. {Row, SqlContext}
Import Org.apache.spark. {sparkconf, sparkcontext}

Object Run {
  def main (args:array[string]) {
    val conf = new sparkconf (). Setappname ("Test"). Setmaster ("local") C5/>val sc = new Sparkcontext (conf)
    sc.setloglevel ("WARN")
    val sqlcontext = new SqlContext (SC)
    /**
      * ID      Age
      * 1       *
      2
      * 3 * *
    val idagerddrow = sc.parallelize (Array (Row (1, +), row (2, max), row (4, +)))

    val schema = Structtype (Array (Structfield ("id", Datatypes.integertype), Structfield ("Age", Datatypes.integertype)))

    val Idagedf = Sqlcontext.createdataframe (idagerddrow, schema)
    //API is not object-oriented
    Idagedf.filter (Idagedf.col ("Age") >) 
    //No error, dataframe not compile-time type-
    safe Idagedf.filter (Idagedf.col ("Age") > "") 
  }
}


1, Rdd and dataset, dataframe the type of each row is fixed to row, only by parsing to get the values of each field

testdf.foreach{line
  =
    val col1=line.getas[string] ("Col1")
    Val col2=line.getas[string] ("col2")
}
2, Dataframe and datasets are supported SPARKSQL operations, such as SELECT,GROUPBY, can also register temporary tables/Windows, SQL statement operations, such as
Datadf.createorreplacetempview ("tmp")
spark.sql ("Select  row,date from TMP where DATE was not null order by DATE") . Show (100,false)

3, Dataframe and datasets support some particularly convenient ways to save, such as saving to CSV, you can bring the table header, so that the field name of each column at a glance

Save
val saveoptions = Map ("header", "true", "delimiter", "\ T", "path", "Hdfs://172.xx.xx.xx:9000/test" ")
DatawDF.write.format (" Com.databricks.spark.csv "). Mode (Savemode.overwrite). Options (SaveOptions). Save ()
//Read
val options = Map ("header", "true", "delimiter", "\ T", "path", "Hdfs://172.xx.xx.xx:9000/tes") T ")
Val datardf= spark.read.options (options). Format (" Com.databricks.spark.csv "). Load ()

Dataset:
The main comparison here is with datasets and Dataframe, because datasets and Dataframe have exactly the same member functions, except that each row has a different data type
Dataframe can also be called Dataset[row], the type of each row is row, do not parse, what is the field of each row, and what type of each field is not known, can only use the above mentioned Getas method or the generality of the pattern match mentioned in seventh to take out a specific field
In a dataset, the type of each row is not necessarily, and the information of each row can be freely obtained after customizing the case class.

Case Class Coltest (col1:string,col2:int) extends Serializable//define field name and type
/**
      rdd
      ("a", 1)
      ("B", 1)
      ("A", 1)
      * *
/val test:dataset[coltest]=rdd.map{line=>
      coltest (line._1,line._2)
    }.tods
test.map{
      line=>
        println (line.col1)
        println (line.col2)
    }

Datasets are very handy when you need to access a field in a column, however, if you are writing some highly adaptable functions, if you are using a dataset, the type of the row is not deterministic, it could be a variety of case classes that cannot be adapted, and the Dataframe is the dataset [Row] can be a better solution to the problem

The dataset combines the benefits of RDD and Dataframe, and brings a new concept encoder
When serializing data, encoder produces bytecode that interacts with off-heap to achieve the effect of on-demand access to the data without deserializing the entire object. Spark has not yet provided a custom encoder API, but will join in the future.

The following code, created in 1.6.x, is the Dataframe

The
val idagerddrow = sc.parallelize (Array (Row (1), row (2), row (4, +)) is extracted from the dataframe example above,

val schema = St Ructtype (Array (Structfield ("id", Datatypes.integertype), Structfield ("Age", Datatypes.integertype))

Val IDAGEDF = Sqlcontext.createdataframe (idagerddrow, schema)

But the same code is created in 2.0.0-preview, although it's called Dataframe.

Sqlcontext.createdataframe (Idagerddrow, Schema) method implementation, the return value is still dataframe
def createdataframe (Rowrdd:rdd[row], Schema:structtype): DataFrame = {
Sparksession.createdataframe (rowrdd, schema)
}
Import Org.apache.spark.sql.types. {datatypes, Structfield, structtype} import org.apache.spark.sql. {Row, SqlContext} import Org.apache.spark. {sparkconf, Sparkcontext} object Test {def main (args:array[string]) {val conf = new sparkconf (). Setappname ("Test
    "). Setmaster (" local ")//debug must not use local[*] val sc = new Sparkcontext (conf) val sqlcontext = new SqlContext (SC) Import sqlcontext.implicits._ val idagerddrow = sc.parallelize (Array (Row (1), Row (2, max), row (4,)) Val schema = Structtype (Array (Structfield ("id", Datatypes.integertype), Structfield ("Age", Datatypes.integertype))//in 2 The 0.0-preview in this line of code creates the Dataframe, which is actually Dataset[row] val idageds = Sqlcontext.createdataframe (idagerddrow, Schema)// In 2.0.0-preview, the custom encoder is not supported, the row type is not available, the custom bean does not work//The Official document also has an example of writing a dataset through the bean, but I do not succeed in running it//so I currently need to create a Datafra Me method to create Dataset[row]//Sqlcontext.createdataset (Idagerddrow)//currently supports string, Integer, long, etc. type directly create DataSet Se Q (1,2, 3). ToDS (). Show () Sqlcontext.createdataset (Sc.parallelize (Array (1, 2, 3)). Show ()}} 


But it's actually a dataset, because Dataframe is declared as Dataset[row]

Package object SQL {
  //... The irrelevant code is omitted

  type DataFrame = Dataset[row]
}

So when we migrated from 1.6.x to 2.0.0, we used the dataset without any modifications.

The following is a sample code for a dataset


Dataframe/dataset Turn the RDD:

The conversion is simple.

Val Rdd1=testdf.rdd
Val Rdd2=testds.rdd


Rdd Turn Dataframe:
Import Spark.implicits._
val testdf = Rdd.map {line=>
      (line._1,line._2)
    }.todf ("Col1", "col2")


Generally use tuples to write a row of data together, and then specify the field name in Todf. Rdd to DataSet:

Import spark.implicits._ Case
class Coltest (col1:string,col2:int) extends Serializable//define field name and type
val Testds = Rdd.map {line=>
      coltest (line._1,line._2)
    }.tods


You can notice that when you define the type of each row (case Class), the field name and type are given, and the dataset goes Dataframe as soon as you add a value to the case class:

This is also very simple, because it just encapsulates the case class as a row

Import Spark.implicits._
val testdf = testds.todf


Dataframe turn to DataSet:
Import spark.implicits._ Case
class Coltest (col1:string,col2:int) extends Serializable//define field name and type
val Testds = Testdf.as[coltest]


This approach is to use the As method to turn to a dataset after giving the type of each column, which is especially handy when the data type is dataframe and needs to be handled for each field :

When using some special operation, must add the import spark.implicits._ otherwise todf, Tods cannot use



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.