PrefaceSpark SQL allows us to perform relational queries using SQL or hive SQL in the spark environment. Its core is a special type of spark Rdd:schemardd. Schemardd is a table similar to a traditional relational database, and consists of two parts: rows: Data Row object schema: Data row Schema: Column name, column data type, column can be empty, etc. schema is created in four ways: (1) Existing RDD (2) Parquet file (3) JSON Dataset (4) by running Hive SQL takes into account that the Parquet file has not yet started to be used on the platform, so only the other three items are discussed for the time being. Attention:
Spark SQL is currently an alpha component.
SqlContext (Hivecontext)The initialization of Spark SQL's entry point for Sqlcontext,sqlcontext depends on Sparkcontext, and the code example is as follows: SqlContext currently uses only a simple SQL parser with limited functionality, And a lot of the data warehouses are built on top of hive, so spark gives us another option: Hivecontext. Hivecontext uses a relatively sophisticated HIVEQL parser that can use HIVEUDF to access the data in an existing hive Data warehouse and to fit all of the sqlcontext data sources, which are recommended for use. The Hivecontext initialization process is similar to the following:
Data SourceSpark SQL (SCHEMARDD) data source can be simply understood as the ordinary spark rdd, all can be applied to the spark RDD operation can be applied to Schemardd, in addition, Schemardd can also "register" as a temporary table, It then parses the data in SQL (Hive sql), which is actually the data associated with the spark RDD.
SchemarddThe schemardd data source is actually the spark Rdd, but the spark Rdd is different from the Schemardd, and the spark Rdd lacks a "Schema" relative to Schemardd, So spark provides two ways to complete the conversion of the spark Rdd to Schemardd, which is actually the application of "Schema" for the spark Rdd. (1) Using reflection inference schema if the data type of a spark rdd is row, spark can infer the schema of the spark Rdd through reflection and convert it to a schemardd. When spark uses reflection to infer the schema of a spark RDD, it only uses the first data (Row) of the spark Rdd, so the integrity of the data must be ensured. The row build process requires a list of key-value pairs, row (id = 1, name = "a", age = 28) This list of key-value pairs has clearly defined the column name and column value of the data row, which is inferred only for the column type. Code samples processing logic can be divided into the following steps: a. Create a string list datas for simulating the data source; b. Perform a "parallelize" operation on datas, convert it to a spark RDD source, the data type is a string, and C. After slicing (split) each piece of data from the spark RDD source into the spark rdd rows, the data type is row; the spark rdd rows already has the condition of converting to Schemardd: It has a data type of row. d. Using Hivecontext to infer the schema of rows, convert it to Schemardd people; through People.printschema (), we can view the results of the inferred schema: e. Registering Schemardd people as a temporary table "people"; f. Execute SQL query statement: select * from people where age > 28 and age <&Nbsp;30, and save the query results to the spark RDD results, through the output of Results.printschema (): can see that spark RDD results is actually schemardd, So we can continue to register it as a temporary table; g. Register Schemardd results as a temporary table "people" and Execute SQL query statements:select name from People2, and save the query results to the spark RDD results2, through F we can know results2 is actually schemardd,results2.printschema () output: The data type of the Schemardd results2 is row and is affected by the query statement (select name), which contains only one column of data, and the column is named name. h. Schemardd can also perform all spark rdd operations, where we use map to convert the name value in Results2 to uppercase, and the final output: The example above illustrates the following three points: a. We can convert a spark rdd with a data type of row to a schemardd; b. Schemardd can be registered as a temporary table to execute a SQL query statement, and its query result is also a schemardd; c. The SCHEMARDD can perform all spark RDD operations. (2) specifying schema using reflection to infer schema by encoding requires that we be able to build a spark RDD with a data type of row and then convert it to Schemardd In some cases, we may need a more flexible way to control the Schemardd build process, which is where the meaning of the schema is specified by encoding. Specifying a schema by encoding is divided into three steps: a. Build a spark rdd;b that has a data type of tuple or list. To build the schema, you need to match the tuple or list;c in a. Apply the schema in B to the spark RDD in a. Code example code processing logic corresponds to the above three steps, the final output: which requires attention to the ID, the age of the data type is declared as Integertype, so the data source (The data in the string) requires forced type conversion processing.
JSON DatasetsSpark can automatically infer the data schema of the JSON dataset and load it as a Schemardd instance. This "automatic" behavior is achieved by the following two methods: Jsonfile: Loading data from a file directory, each line of the file in this directory is a JSON string (if the JSON string "crosses the line", it may result in parsing errors); Jsonrdd: Loading data from an already existing rdd, each element in this rdd is a JSON string, and the code example can draw the following two points: a. If the data input is a text file of a JSON string, we can build the spark RDD directly using Jsonfile , the actual is Schemardd; b. If the data type of a spark rdd is a string, and the string is in JSON format, you can use Jsonrdd to convert it to a schemardd.
Hive TablesHive tables is already a "table", so we don't need to create or convert, just use SQL queries. Official code example
Hive UDF (Register Function) spark SQL can support hive UDFs when using Hivecontext, where the UFD contains the UDF built into Hive itself, It also includes our own extended UDF (Permanent function), which is not properly used in the actual spark-1.2.0-cdh5.3.2 version, and has been repaired by extension source. Here focuses on the register function of Spark SQL, which means you can dynamically create functions for SQL queries that actually work like Hive UDFs. code sample The processing logic of the code is similar to the previous one, that is, first create the Schemardd people by encoding, and then register it as a table (note that there is another way: Hivecontext registerrddastable), and finally executes the query statement and prints the result. In particular, a custom SQL function named "MyFunc" is used in the query statement, and this function is not pre-existing (such as a hive UDF), it is dynamically created and registered during the run of our application, and the registration process is used to Hivecontext Registerfunction. for Python, the process of creating a custom function can actually be divided into two steps: (1) to define the Python function, and (2) to register the Python function defined in (1) as SQL. The name of the registration can be different from that of the function. You can also use a lambda expression to complete the definition function with the registration process, as in the example above. Our custom SQL functions can be used in conjunction with Hive UDFs, as shown in the following example: where Func.iptolocationbysina is a hive udf (Permanent Function). Mychange is a custom SQL function. As you can see from the two examples above, custom SQL functions are far more flexible than hive UDFs. The creation of Hive UDFs is complex and requires the use of the Java language to complete the encoding and deployment as a jar, and it needs to exist in the form of temporaty function or permanent function before using functions, each time hive The update of the UDF requires that the jar be re-encoded and updated, while the custom SQL functions are created dynamically during runtime, and the creation and updating of the function is simple and recommended when using Python encoding. &nbsP
SummarySpark SQL provides us with powerful data analysis capabilities, mainly in the following three areas: (1) Spark Rdd can be converted to Schemardd by reflection inference schema or encoding specified schema, and Schemardd is created as "datasheet", Allows us to analyze the data in the form of SQL statements, saving a lot of coding effort; (2) Spark SQL allows us to dynamically create custom SQL functions on demand while the application is running, expanding the data processing capabilities of SQL, and (3) Schemardd can perform all Spark rdd operations. If SQL cannot express our computational logic, we can do it through the spark Rdd rich API.
Spark SQL Programming Guide (Python)