Datasets and Dataframes
Foreword Source DataFrame DataSet Create DataSet read JSON string Rdd Convert to DataSet summarize DataFrame summary
Preface
The concept of datasets and Dataframe is introduced in spark1.6, and the Spark SQL API is based on these two concepts, and the stable version of structured streaming, released to 2.2, is also dependent on the Spark SQL API. To spark Mlib is also beginning to be converted from the Rdd API to the Dataframe API, a step-by-step gesture indicating that the future of Spark is a dataset. So this is the article that the author intends to take a serious look at the dataset. Source Code
Let's look at the source code of these two objects first. DataFrame
Package object SQL {
...
Type DataFrame = Dataset[row]
}
You can see that dataframe is a special case of a dataset, so I first understand the DataSet interface. Dataset
Private[sql] Object Dataset {
def Apply[t:encoder] (sparksession:sparksession, Logicalplan:logicalplan): dataset[ T] = {
new Dataset (Sparksession, Logicalplan, implicitly[encoder[t])
}
def ofrows (sparksession: Sparksession, Logicalplan:logicalplan): DataFrame = {
val QE = SparkSession.sessionState.executePlan (Logicalplan)
qe.assertanalyzed ()
new Dataset[row] (sparksession, QE, Rowencoder (Qe.analyzed.schema))
}
}
Class Dataset[t] private[sql] (
@transient Val sparksession:sparksession,
@DeveloperApi @ Interfacestability.unstable @transient Val queryexecution:queryexecution,
encoder:encoder[t])
extends Serializable {
Datasets are similar to RDD, but unlike RDD, datasets have special encoders to serialize JVM objects as well as network data, and the Spark program has been developed to know that the RDD used to do these things using Kryo or the Java serializer. Serialization is the serialization of objects into binary, the dataset can directly manipulate the binary data, the related conversion operations, which is more powerful than the RDD.
Because the dataset is a strongly typed DataSet, it must be used to describe the data type. Create a DataSet
Case class Person (name:string, Age:int)
def main (args:array[string]): Unit = {
import spark.implicits._
va L CASECLASSDF = Seq (Person ("Andy", +)). ToDS
caseclassdf.show ()
}
The above code automatically maps the data to a table with the name and age respectively. Let's take a look at the results:
+----+---+
|name|age|
+----+---+
| andy| 32|
+----+---+
As you can see, Sparksession automatically takes a person object as one of the data in the table.
If we do not use person packaging, the code is modified to read as follows:
> val caseclassdf = Seq ("Andy", "Doctorq"). ToDS
+-------+
| value|
+-------+
| andy|
| doctorq|
+-------+
A column name of "value" is given by default, and each piece of data is treated as a record. If a row of data has more than one property, it is easier to wrap it with an object. Read JSON string
Val Path = "Src/main/resources/people.json"
val peopleds = Spark.read.json (path). As[person]
peopleds.show ( )
The interface provided by Sparksession to read external files is very simple. Rdd conversion to DataSet
Val Peopledf = Spark.sparkcontext
. Textfile ("Src/main/resources/people.txt")
. Map (_.split (","))
. Map ( attributes = person (attributes (0), attributes (1). Trim.toint))
. ToDS ()
Summary
There are many ways to get a DataSet object, a common collection, and reading external data can be easily converted to DS, but remember to introduce implicit conversions. Import spark.implicits._, but one thing is that you need to specify a type. DataFrame
Dataframe for the dataset type as a row object, let's look at the Row object:
Defines the number of elements in a row, the data type schema, to define the property type. This row is also similar to a collection type, which defines the size of the collection, the type of each element, and also defines some values to get a location.
Before we said the dataset, we all said that creating a dataset specifies the data type, and Dataframe is special, just use the row too generic, other to row to handle, this will be some can not define the object, with DF instead. Summary
The dataset, as a new version of the Spark base dataset, will gradually replace the RDD in the future, officially becoming the underlying support, and the current flow of computing and machine learning Library transfers are all signals. Before reading the article that the dataset is inherited from the RDD, but in fact it is independent, but can be converted from the RDD.