An important reason Apache Spark attracts a large community of developers is that Apache Spark provides extremely simple, easy-to-use APIs that support the manipulation of big data across multiple languages such as Scala, Java, Python, and R.
This article focuses on the Apache Spark 2.0 rdd,dataframe and dataset three APIs, their respective usage scenarios, their performance and optimizations, and the scenarios that use Dataframe and datasets instead of RDD. Most of the articles focus on dataframe and datasets, as this is the unifying focus of the Apache Spark 2.0 API.
The main motive of the Apache Spark 2.0 Unified API is: The quest to simplify spark. Process by reducing the concept of user learning and providing structured data. In addition to structuring, Spark also provides higher-level abstraction and API as a specific domain language (DSL).
Elastic data Set (RDD)
The RDD is the core API of Spark's inception. The RDD is an immutable distributed elastic dataset that can be partitioned across nodes in a spark cluster and provides a distributed low-level API to manipulate the RDD, including transformation and action.
When do you use the RDD?
General scenarios for using RDD:
- You need to use low-level's transformation and action to control your data set;
- Your data sets are unstructured, such as streaming media or text streams;
- You want to use functional programming to manipulate your data, rather than using a domain-specific language (DSL) to express it;
- You don't care about schema, for example, when processing (or accessing) data attributes by name or column does not care about the Columnstore format;
- You discard the use of dataframe and datasets to optimize both structured and semi-structured datasets.
The RDD was abandoned in Apache Spark 2.0? You might ask: Does the Rdd become a "second class citizen"? Or is it simply not in the future? The answer is, of course, no!. You will learn from the following description: Spark users can seamlessly convert between Rdd,dataframe and dataset three datasets, and only need to use super-simple API methods. Dataframedataframe is the same as the RDD, and is an immutable distributed elastic data set. The difference is that dataframe datasets are stored by the specified column, which is structured data. Similar to a table in a traditional database. Dataframe is designed to make big data processing easier. Dataframe allows developers to import structured datasets into dataframe and higher-level abstraction; Dataframe provides a domain-specific language (DSL) API to manipulate your datasets. In Spark2.0, the DataFrame API will be merged with the DataSet API to unify the data processing API. Since this unification is "a bit urgent", most spark developers have little idea about the high-level and Type-safe APIs of the dataset.
DataSet in Spark 2.0, the dataset has two completely different API features: The strongly-typed API and the weakly-typed API, as shown in the following table. Dataframe is a special dataset whose each line is a weakly typed JVM object. Correspondingly, the dataset is a collection of strongly typed JVM objects, either through the Scala case class or the Java class. Strongly-typed APIs and weakly-typed APIs
Language |
Main Abstraction |
Scala |
Dataset[t] & DataFrame (alias for Dataset[row]) |
Java |
Dataset<t> |
python* |
DataFrame |
r* |
DataFrame |
Note: Python and R do not have compile-time type-safety, so only weakly typed api:dataframe are provided. The benefits of the Dataset API for spark developers, you'll get the following benefits from the Dataframe and DataSet Unified API of Spark 2.0:
1. Static type and run-time type safetyConsidering static types and run-time type safety, SQL has very few limitations and the dataset is much more restrictive. For example, the Spark SQL query statement, you can not find syntax errors (syntax error) until run time, the cost is large. Dataframe and datasets can then be compiled to catch errors, saving development time and cost. The Dataset API is both a lambda function and a JVM typed object, and any typed-parameters mismatch will cause an error in the compile phase. Therefore, the use of datasets saves development time.
2. High-level abstract and custom views of structured and semi-structured datasetsDataframe is a special case of Dataset[row], which uses a structured dataset view for semi-structured data sets. For example, there is a massive IoT device event data set, expressed in JSON format. JSON is a semi-structured data format, where you can customize a dataset:dataset[deviceiotdata].
{ "device_id":198164, "device_name":"SENSOR-PAD-198164OWOMCJZ", "IP":"80.55.20.25", "CCA2":"PL", "CCA3":"POL", "cn":"Poland", "Latitude":53.08, "Longitude":18.62, " Scale":"Celsius", "Temp": +, "Humidity": $, "Battery_level":8, "C02_level":1408, "LCD":"Red", "timestamp":1458081226051 }
Use Scala to define the case class for JSON data deviceiotdata.
Case class Deviceiotdata (Battery_level:long, C02_level:long, cca2:string, cca3:string, cn:string, Device_id:long, Dev Ice_name:string, Humidity:long, ip:string, latitude:double, lcd:string, longitude:double, scale:String, tem P:long, Timestamp:long)
Immediately after, read the data from the JSON file
// read the JSON file and create the dataset from the // Case class Deviceiotdata // DS is now a collection of JVM Scala objects Deviceiotdata = Spark.read.json ("/databricks-public-datasets/data/iot/iot_devices.json "). as [Deviceiotdata]
There are three things that can happen at this time:
- Spark reads the JSON file, infers its schema, and creates a dataframe;
- Spark converts the dataset to Dataframe-Dataset[row], the generic Row object, because the exact type is not yet known;
- Spark convert: Dataset[row]-Dataset[deviceiotdata],deviceiotdata class Scala JVM object.
3. Easy-to-use APIWhile structured data brings a lot of limitations to the Spark program's operational datasets, it introduces a rich set of semantics and easy-to-use, domain-specific languages. Most calculations can be supported by the high-level API of the dataset. For example, a simple operation Agg,select,avg,map,filter or groupby can access a dataset of type Deviceiotdata. Using a specific domain language API for calculations is straightforward. For example, usingfilter ()Andmap ()Create another dataset.
//Use filter (), map (), GroupBy () country, and compute AVG ()//For temperatures and humidity. This operation results in//another immutable Dataset. The query is simpler to read,//and expressiveVal dsavgtmp= Ds.filter (D + = {d.temp > -}). map (d = (d.temp, d.humidity, D.CCA3)). GroupBy ($"_3"). AVG ()//display the resulting datasetdisplay (dsavgtmp)
4. Performance and optimization
Two reasons for spatial efficiency and performance optimization using the Dataframe and dataset APIs:
First, the Dataframe and dataset APIs are built on the spark SQL engine, which uses the Catalyst optimizer to generate optimized logical plans and physical query plans. R,java,scala or Python's Dataframe/dataset API enables queries to be optimized for the same code and to increase the efficiency of space and speed.
Second, Spark, as the compiler can understand the dataset type of JVM object, can map specific types of JVM object to tungsten memory management, using encoder. Tungsten's encoder can efficiently serialize/deserialize JVM object, generating bytecode to improve execution speed.
When do I use a dataframe or a dataset?
- You want to use rich semantics, high-level abstractions, and specific domain language APIs, so you can use dataframe or datasets;
- The semi-structured datasets you work with require high-level expression, FILTER,MAP,AGGREGATION,AVERAGE,SUM,SQL queries, column access, and lambda functions, so you can use dataframe or datasets;
- You want to take advantage of the compile-time height of type-safety,catalyst optimization and tungsten code generation, then you can use the Dataframe or dataset;
- You want to unify and simplify the API using the library across the spark, then you can use Dataframe or datasets;
- If you are an R user, you can use Dataframe or datasets;
- If you are a Python user, you can use Dataframe or a dataset.
You can seamlessly convert a dataframe or dataset into an RDD with just a simple call. Rdd:
//Select specific fields from the Dataset, apply a predicate//using the Where () method, convert to an RDD, and show first//RDD rowsVal Deviceeventsds= ds.Select($"device_name", $"CCA3", $"C02_level").where($"C02_level">1300) //Convert to RDDs and take the first rowsVal Eventsrdd= DeviceEventsDS.rdd.take (Ten)
Summarize
From the above analysis, it is clear what is the choice of Rdd,dataframe or DataSet. The RDD is suitable for applications requiring low-level functional programming and operation of datasets, and dataframe and datasets are suitable for structured datasets, using high-level and domain-specific language (DSL) programming, with high spatial efficiency and speed.
Apache Spark 2.0 Three API Legends: RDD, Dataframe, and dataset