Spark's growth path (-dataset) and Dataframe

Source: Internet
Author: User
Tags json serialization

Datasets and Dataframes

Foreword Source DataFrame DataSet Create DataSet read JSON string Rdd Convert to DataSet summarize DataFrame summary

Preface

The concept of datasets and Dataframe is introduced in spark1.6, and the Spark SQL API is based on these two concepts, and the stable version of structured streaming, released to 2.2, is also dependent on the Spark SQL API. To spark Mlib is also beginning to be converted from the Rdd API to the Dataframe API, a step-by-step gesture indicating that the future of Spark is a dataset. So this is the article that the author intends to take a serious look at the dataset. Source Code

Let's look at the source code of these two objects first. DataFrame

Package object SQL {
  ...
  Type DataFrame = Dataset[row]
}

You can see that dataframe is a special case of a dataset, so I first understand the DataSet interface. Dataset

Private[sql] Object Dataset {
  def Apply[t:encoder] (sparksession:sparksession, Logicalplan:logicalplan): dataset[ T] = {
    new Dataset (Sparksession, Logicalplan, implicitly[encoder[t])
  }

  def ofrows (sparksession: Sparksession, Logicalplan:logicalplan): DataFrame = {
    val QE = SparkSession.sessionState.executePlan (Logicalplan)
    qe.assertanalyzed ()
    new Dataset[row] (sparksession, QE, Rowencoder (Qe.analyzed.schema))
  }
}


Class Dataset[t] private[sql] (
    @transient Val sparksession:sparksession,
    @DeveloperApi @ Interfacestability.unstable @transient Val queryexecution:queryexecution,
    encoder:encoder[t])
  extends Serializable {

Datasets are similar to RDD, but unlike RDD, datasets have special encoders to serialize JVM objects as well as network data, and the Spark program has been developed to know that the RDD used to do these things using Kryo or the Java serializer. Serialization is the serialization of objects into binary, the dataset can directly manipulate the binary data, the related conversion operations, which is more powerful than the RDD.
Because the dataset is a strongly typed DataSet, it must be used to describe the data type. Create a DataSet

Case class Person (name:string, Age:int)
  def main (args:array[string]): Unit = {
    import spark.implicits._
    va L CASECLASSDF = Seq (Person ("Andy", +)). ToDS
    caseclassdf.show ()
  }

The above code automatically maps the data to a table with the name and age respectively. Let's take a look at the results:

+----+---+
|name|age|
+----+---+
| andy| 32|
+----+---+

As you can see, Sparksession automatically takes a person object as one of the data in the table.

If we do not use person packaging, the code is modified to read as follows:

> val caseclassdf = Seq ("Andy", "Doctorq"). ToDS
+-------+
|  value|
+-------+
|   andy|
| doctorq|
+-------+

A column name of "value" is given by default, and each piece of data is treated as a record. If a row of data has more than one property, it is easier to wrap it with an object. Read JSON string

Val Path = "Src/main/resources/people.json"
    val peopleds = Spark.read.json (path). As[person]
    peopleds.show ( )

The interface provided by Sparksession to read external files is very simple. Rdd conversion to DataSet

Val Peopledf = Spark.sparkcontext
      . Textfile ("Src/main/resources/people.txt")
      . Map (_.split (","))
      . Map ( attributes = person (attributes (0), attributes (1). Trim.toint))
      . ToDS ()
Summary

There are many ways to get a DataSet object, a common collection, and reading external data can be easily converted to DS, but remember to introduce implicit conversions. Import spark.implicits._, but one thing is that you need to specify a type. DataFrame

Dataframe for the dataset type as a row object, let's look at the Row object:

Defines the number of elements in a row, the data type schema, to define the property type. This row is also similar to a collection type, which defines the size of the collection, the type of each element, and also defines some values to get a location.

Before we said the dataset, we all said that creating a dataset specifies the data type, and Dataframe is special, just use the row too generic, other to row to handle, this will be some can not define the object, with DF instead. Summary

The dataset, as a new version of the Spark base dataset, will gradually replace the RDD in the future, officially becoming the underlying support, and the current flow of computing and machine learning Library transfers are all signals. Before reading the article that the dataset is inherited from the RDD, but in fact it is independent, but can be converted from the RDD.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.