This article explains the structured data processing of spark, including: Spark SQL, DataFrame, DataSet, and Spark SQL services. This article focuses on the structured data processing of the spark 1.6.x, but because of the rapid development of spark (the writing time of this article is when Spark 1.6.2 is released, and the preview version of Spark 2.0 has been published), please feel free to follow spark Official SQL documentation to get the latest information.
The article uses Scala to explain spark SQL, and most of the code runs in Spark-shell, so please be aware of this.
Overview
Compared to spark RDD Api,spark SQL contains more information on structured data and the operations on it, Spark SQL uses this information for additional optimizations to make the operation of structured data more efficient and convenient.
There are several ways to use Spark SQL, including SQL, Dataframes APIs, and datasets APIs. But regardless of the API or programming language, they are based on the same execution engine, so you can switch between different APIs at will, each with its own characteristics, depending on what you like.
Sql
One way to use Spark SQL is to execute SQL queries through SQL statements. When SQL is used in a programming language, its return result is encapsulated as a dataframe.
DataFrame
Dataframe is a distributed collection in which data is organized into named columns. It is conceptually equivalent to a table in a relational database, but the bottom layer does more optimization. Dataframe can be built from a number of data sources, such as an existing RDD, a structured file, an external database, and a hive table.
The predecessor of Dataframe was Schemardd, from the start of Spark 1.3.0 Schemardd renamed Dataframe. The main difference with Schemardd is that Dataframe no longer directly inherits from the Rdd, but realizes most of the functionality of the RDD. You can still call the. Rdd method on the Dataframe to convert it to an RDD. The RDD can be thought of as a collection of distributed objects, and Spark does not know the detailed schema information of the object, and Dataframe can be seen as a collection of distributed row objects that provide detailed schema information made up of columns, allowing spark SQL to perform some form of execution optimization. The logical framework differences between the dataframe and the normal rdd are as follows:
Dataframe not only has a richer operator than the RDD, but more importantly it can perform plan optimizations (thanks to the catalyst SQL Parser), and the Tungsten project provides a significant boost to dataframe execution efficiency ( However, tungsten optimizations may also be added to the RDD API in subsequent development.
But in some cases the RDD can express logic with dataframe, so the subsequent presentation of the DataSet Api,dataset combines the benefits of RDD and Dataframe.
For tungsten optimizations See: Project Tungsten: Let spark squeeze hardware performance to its limits
Dataset
The dataset is a newly added experimental interface to spark 1.6, which is intended to combine the benefits of an RDD (a strong type (which means a type security check at compile time), a powerful lambda function that can be used), and the benefits of an optimized execution engine for spark SQL. You can construct a dataset from a JVM object, and then manipulate it with a function-like conversion operator (Map/flatmap/filter, and so on) that resembles an rdd.
The dataset implements a custom serialization format through encoder, allowing certain operations to proceed directly without the need for deserialization. In addition, the dataset also performs many performance optimizations, including tungsten optimization.
In fact, the dataset contains the functionality of the dataframe, so that there is a lot of redundancy, so Spark 2.0 will unify: Preserve the DataSet API, the Dataframe is represented as Dataset[row], which is a subset of the dataset.
API Evolution
The rapid development of spark, from the original Rdd API, to the Dataframe API, to the advent of datasets, is surprisingly fast, and there is a great improvement in performance.
When we use the API, we should give preference to the Dataframe & Dataset, because it performs well and can be enjoyed in future optimizations, but the RDD API is maintained for compatibility with earlier versions of the program. Subsequent spark libraries will all use DataFrame & datasets, such as mllib, streaming, Graphx, and so on.
For a more detailed discussion of these three APIs and options see: Apache Spark:rdd, DataFrame or Dataset?
Entry
The SqlContext of the beginning
To use Spark SQL, first you have to create a SqlContext object, before which you just need to create a sparkcontext on the line, as follows:
Alternatively, you can use Hivecontext, which is a superset of the sqlcontext, providing some additional functionality: Using the HIVEQL parser, accessing the hive user-defined function, and reading data from the Hive table. And, you don't need to install hive to use Hivecontext. However, future versions of Spark SQL May gradually narrow the gap between SqlContext and Hivecontext.
For SqlContext, there is currently only a simple SQL parser SQL, and for Hivecontext, you can use HIVEQL and SQL two parsers, the default is HIVEQL, we can modify the default parser by the following statement:
For now, however, the HIVEQL parser is more complete, so it is recommended to use HIVEQL.
Create and use Dataframe
With SqlContext, applications can create dataframe from existing RDD, structured files, external databases, and hive tables. The following code creates a dataframe from the JSON file, which can be found from the spark release package:
Dataframe provides a domain-specific language (DSL) to facilitate the manipulation of structured data. Here are some examples of use, and a more complete DataFrame operation, see Org.apache.spark.sql.DataFrame in the Spark API documentation:
In addition, Org.apache.spark.sql.functions Singleton object also contains some functions of operation Dataframe, mainly have these aspects: aggregation operation, collection operation, string processing, sorting, date calculation, General mathematical function, check Code operation, window operation, User-defined function support, and so on.
Executing SQL queries in a program
We can execute the SQL query by using Sqlcontext.sql () in the program, and the result will be returned as a dataframe:
We can also register the dataframe as a temporary table, and then we can execute the SQL statement on it to query, and the life cycle of the temporary table ends after the life of the sqlcontext associated with it ends. Examples are as follows:
Create and use a dataset
The dataset is similar to the RDD, but the dataset does not use the Java Serialization Library and the Kryo serialization library, but rather uses a specific encoder to serialize the object. Encoder is usually generated automatically by code (in Scala by implicit conversions), and its serialization format allows spark to perform many operations directly, such as: filtering, sorting, hashing, without the need for deserialization.
Interoperate with RDD
Spark SQL supports two different ways to convert an rdd to Dataframe. The first method uses reflection to infer the schema of the RDD containing a particular type of object, which can make your code more streamlined, but requires that you know the schema information when you write the Spark program (for example, the object in the RDD is your own defined case class type). The second method passes through a programming interface, at which point you need to construct a pattern and apply it to an existing RDD to convert it to dataframe, which is suitable for cases where the column and the type of the column are not known until run time.
Inference mode with Reflection
The Scala interface for Spark SQL supports the automatic conversion of the RDD containing the case class to Dataframe. The case class defines the schema of the table, and the parameter name of the class is reflected to read and becomes the column name of the table. Case class can also be nested or contain complex types (such as sequences or arrays). Examples are as follows:
Manual programming of specified modes
When case class cannot be defined in advance (for example, the structure of a record is encoded as a string, or when a text dataset is parsed and different users need to map different fields), the RDD can be converted to Dataframe by the following three steps:
Create an Rdd containing a row object from the original RDD.
Create a structtype that matches the structure of row in step 1th to represent the schema information.
Apply the pattern information to the RDD created in step 1th through Sqlcontext.createdataframe ().
Data source
The dataframe can be operated as a standard RDD, or it can be registered as a temporary table. Registers dataframe as a temporary table that allows you to execute SQL queries on them. The Dataframe interface handles a variety of data sources, and Spark SQL also has built-in support for several extremely useful data source formats, including JSON, parquet, and JDBC, where parquet is the default format. In addition, when you use SQL to query data from these data sources and use only a subset of the fields, Spark SQL can intelligently scan only those fields that are used.
General load/Save functions
Dataframereader and Dataframewriter wrap up some common load and save functions, all of which use the Parquet format as the default data format. Examples are as follows:
Specify options manually
You can also manually specify the data source and any other options that you want to pass to the data source. Specifying a data source usually requires the full name of the data source (such as Org.apache.spark.sql.parquet), but for built-in data sources, you can also use their short names (JSON, parquet, and JDBC). and different data source types can be converted to each other. Examples are as follows:
Querying files directly with SQL
You can also use the Read API to load a file to Dataframe and then query it, using SQL statements directly on the file:
Val df = Sqlcontext.sql ("SELECT * from Parquet. ' Examples/src/main/resources/users.parquet '")
Save to persistent table
You can use Dataframewriter.saveastable () to save a dataframe to a persisted table, depending on the settings, the table may be saved as a hive-compatible format, or it may be saved as a spark SQL-specific format. See the API documentation for this.
We can load a table and return a dataframe by sqlcontext.table () or dataframereader.table ().
Parquet file
The parquet format is a columnstore format supported by many other data processing systems. It effectively stores records with nested fields and supports all data types of spark SQL. Spark SQL supports the automatic storage of schema information for raw data when reading and writing to parquet files. For compatibility reasons, all columns are automatically converted to nullable when the parquet file is written.
Loading data
The following is an example sql:
Partition discovery
In many systems, such as hive, table partitioning is a common optimization method. In a partitioned table, data is typically stored in different directories, and column names and column values are usually encoded in the partition directory name to differentiate between partitions. The Parquet data source is able to automatically discover and infer partition information. The following is the Population partition table directory structure, where gender and country are partition columns:
When reading path/to/table with SQLContext.read.parquet or SQLContext.read.load, Spark SQL can automatically extract the partition information from the path, returning the dataframe pattern information as follows:
In the above pattern, the data type of the partition column is automatically inferred. Currently, the supported data types are numeric types and string types. If you don't want the data type to be automatically inferred, You can configure spark.sql.sources.partitionColumnTypeInference.enabled, which defaults to true, and if set to False, automatic type inference is disabled and string types are used by default.
Starting with Spark 1.6.0, partition discovery defaults to discovering only the partitions under a given path. If the user passes Path/to/table/gender=male as a path to read the data, gender will not be used as a partition column. You can set basepath in the data source option to specify the base path where the partition discovery should begin. For example, Path/to/table/gender=male is also used as a data path, but setting BasePath to Path/to/table/,gender will be used as the partitioning column.
Pattern Merging
Like Protocolbuffer, Avro, and Thrift,parquet also support schema Evolution (schema Evolution). This means that you can incrementally add columns to a simple pattern to build a complex pattern. This approach can cause pattern information to be scattered across different parquet files, and the Parquet data source can automatically detect this situation and merge schema information from all of those files.
However, because schema merging is a relatively expensive operation and is not required in most cases, the default shutdown mode merge starts from Spark 1.5.0. On: When reading the parquet file, set the data source option Mergeschema to True, or set the Global SQL option Spark.sql.parquet.mergeSchema to True. Examples are as follows:
JSON data
SQLContext.read.json () can load and convert rdd[string] or JSON files to a dataframe.
One thing to note is that the JSON file used here is not a random, typical JSON file, and each line must be a valid JSON object, if an object spans multiple rows and causes a failure. The same is true for rdd[string].
The following is an example sql:
Database
Spark SQL can also use JDBC to read data from other databases, you should prefer it instead of jdbcrdd, because it will return the data as a dataframe, so it is easy to operate. Note, however, that this is different from the Spark SQL JDBC Service, and the Spark SQL JDBC Service allows other apps to query through the JDBC connection to spark SQL.
To connect to a specified database in spark SQL, you first need to set the JDBC-driven path of your database through the environment variable Spark_classpath. For example, to connect to the MySQL database in Spark-shell, you can use the following command:
Can be passed SQLContext.read.format ("JDBC"). Options (...). Load () or SQLCONTEXT.READ.JDBC (...) Load data from the database to Dataframe. In the following example, we load data from the MySQL database:
The following is an example sql:
The JDBC driver needs to exist on the same path for all nodes because it is distributed, just like the spark core.
Some databases need to use uppercase to refer to the corresponding name, as if an uppercase table name is required in Oracle.
Distributed SQL engine
Spark SQL can also act as a distributed SQL engine, you can connect to it using the JDBC/ODBC or spark SQL command line interface, and execute interactive SQL queries directly.
Run Thrift Jdbc/odbc Server
The Thrift JDBC/ODBC Server implemented in Spark SQL is consistent with HiveServer2 in hive. You can use the following command to turn on the JDBC/ODBC service, by default the service listener address is localhost:10000:
This script can not only accept all the options that the Spark-submit command can accept, but also supports the –hiveconf property = value option to configure hive properties. You can perform./sbin/start-thriftserver.sh–help to see a complete list of options.
You can use Beeline to connect to the spark SQL engine that is already turned on. The command is as follows:
Connecting to Beeline requires entering a username and password. In non-secure mode, simply enter your own username and the password can be empty. For Safe mode, see the Beeline documentation.
Run the spark SQL CLI
The introduction of the spark SQL CLI makes it easy to query hive via hive Metastore in Spark SQL. You cannot use the Spark SQL CLI to interact with thrift JDBC/ODBC server so far. This script is primarily useful for local development, where you should have each user connect to thrift JDBC/ODBC Server on a shared cluster.
Before using the Spark SQL CLI, you need to be aware that:
Copy the Hive-site.xml configuration file to the $spark_home/conf directory.
The JDBC-driven jar package needs to be added to the Spark_classpath in./conf/spark-env.sh.
The command to start the Spark SQL CLI is as follows:
When you start Spark-sql, if you do not specify Master, you run as local, and master can specify either the standalone address or yarn. When you set Master to yarn (Spark-sql–master yarn), you can monitor the entire job execution process through the http://master:8088 page. If you configure Spark Master in./conf/spark-defaults.conf, you do not need to specify Master when you start Spark-sql. After the Spark-sql is started, you can tap the SQL statement.
For the available commands and parameters for the spark SQL CLI, tap./bin/spark-sql–help to view.
Performance tuning
Cache data in memory
by calling Sqlcontext.cachetable ("TableName") or Dataframe.cache (), Spark SQL can cache the table in memory in Columnstore. In this case, Spark SQL can scan only those columns that are used, and it will automatically compress the data to reduce memory consumption and GC overhead. You can call Sqlcontext.uncachetable ("TableName") to remove the cached table from memory. In addition, you can cache tables and remove cached tables in SQL/HIVEQL using the caches tablename and Uncache tablename.
As with the motivation of the cache rdd, these tables should be cached if you want to run tasks or queries multiple times on the same data.
Cache the data in memory and also support some options, see the Spark SQL Official Document Performance Tuning section.
Other Tuning related parameters
There are other parameters that can be used to optimize query performance, see the Spark SQL Official Document Performance Tuning section.
End.
Spark structured data processing: Spark SQL, Dataframe, and datasets