Spark's growth path (-dataset) and Dataframe

Last Update:2018-07-23 Source: Internet

Author: User

Tags json serialization

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Datasets and Dataframes

Foreword Source DataFrame DataSet Create DataSet read JSON string Rdd Convert to DataSet summarize DataFrame summary

Preface

The concept of datasets and Dataframe is introduced in spark1.6, and the Spark SQL API is based on these two concepts, and the stable version of structured streaming, released to 2.2, is also dependent on the Spark SQL API. To spark Mlib is also beginning to be converted from the Rdd API to the Dataframe API, a step-by-step gesture indicating that the future of Spark is a dataset. So this is the article that the author intends to take a serious look at the dataset. Source Code

Let's look at the source code of these two objects first. DataFrame

Package object SQL {
  ...
  Type DataFrame = Dataset[row]
}

You can see that dataframe is a special case of a dataset, so I first understand the DataSet interface. Dataset

Private[sql] Object Dataset {
  def Apply[t:encoder] (sparksession:sparksession, Logicalplan:logicalplan): dataset[ T] = {
    new Dataset (Sparksession, Logicalplan, implicitly[encoder[t])
  }

  def ofrows (sparksession: Sparksession, Logicalplan:logicalplan): DataFrame = {
    val QE = SparkSession.sessionState.executePlan (Logicalplan)
    qe.assertanalyzed ()
    new Dataset[row] (sparksession, QE, Rowencoder (Qe.analyzed.schema))
  }
}


Class Dataset[t] private[sql] (
    @transient Val sparksession:sparksession,
    @DeveloperApi @ Interfacestability.unstable @transient Val queryexecution:queryexecution,
    encoder:encoder[t])
  extends Serializable {

Datasets are similar to RDD, but unlike RDD, datasets have special encoders to serialize JVM objects as well as network data, and the Spark program has been developed to know that the RDD used to do these things using Kryo or the Java serializer. Serialization is the serialization of objects into binary, the dataset can directly manipulate the binary data, the related conversion operations, which is more powerful than the RDD.
Because the dataset is a strongly typed DataSet, it must be used to describe the data type. Create a DataSet

Case class Person (name:string, Age:int)
  def main (args:array[string]): Unit = {
    import spark.implicits._
    va L CASECLASSDF = Seq (Person ("Andy", +)). ToDS
    caseclassdf.show ()
  }

The above code automatically maps the data to a table with the name and age respectively. Let's take a look at the results:

+----+---+
|name|age|
+----+---+
| andy| 32|
+----+---+

As you can see, Sparksession automatically takes a person object as one of the data in the table.

If we do not use person packaging, the code is modified to read as follows:

> val caseclassdf = Seq ("Andy", "Doctorq"). ToDS
+-------+
|  value|
+-------+
|   andy|
| doctorq|
+-------+

A column name of "value" is given by default, and each piece of data is treated as a record. If a row of data has more than one property, it is easier to wrap it with an object. Read JSON string

Val Path = "Src/main/resources/people.json"
    val peopleds = Spark.read.json (path). As[person]
    peopleds.show ( )

The interface provided by Sparksession to read external files is very simple. Rdd conversion to DataSet

Val Peopledf = Spark.sparkcontext
      . Textfile ("Src/main/resources/people.txt")
      . Map (_.split (","))
      . Map ( attributes = person (attributes (0), attributes (1). Trim.toint))
      . ToDS ()

Summary

There are many ways to get a DataSet object, a common collection, and reading external data can be easily converted to DS, but remember to introduce implicit conversions. Import spark.implicits._, but one thing is that you need to specify a type. DataFrame

Dataframe for the dataset type as a row object, let's look at the Row object:

Defines the number of elements in a row, the data type schema, to define the property type. This row is also similar to a collection type, which defines the size of the collection, the type of each element, and also defines some values to get a location.

Before we said the dataset, we all said that creating a dataset specifies the data type, and Dataframe is special, just use the row too generic, other to row to handle, this will be some can not define the object, with DF instead. Summary

The dataset, as a new version of the Spark base dataset, will gradually replace the RDD in the future, officially becoming the underlying support, and the current flow of computing and machine learning Library transfers are all signals. Before reading the article that the dataset is inherited from the RDD, but in fact it is independent, but can be converted from the RDD.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More