Spark SQL Tutorial
Spark SQL is a relational query expression that supports the use of SQL, Hivesql, and Scala in Spark. Its core component is a new RDD type, Schemardd, which uses a schema to describe the data type of all the columns in the row, which is like a table in a relational database. It can be created from an existing RDD, or it can be a parquet file, and most importantly it can support reading data from hive with HIVEQL.
Here are some cases that you can run in the spark shell.
First we want to create a familiar context, the people familiar with spark know, with the context we can do all kinds of operations.
Val Sc:sparkcontext//already exists sparkcontextval SqlContext = new Org.apache.spark.sql.SQLContext (SC) Import sqlcontext._
Data Sources (source)
Spark SQL supports operations on a variety of data sources through the Schemardd interface. Once a data set is loaded, it can be registered as a table and can even be connected to other data sources.
RDDs
One of the types of tables supported by Spark SQL is the Scala case Class,case class defines the type of table, here's an example:
SC is an already existing Sprakcontext
Val sqlcontext = new Org.apache.spark.sql.SQLContext (SC)//import Sqlcontext._import sqlcontext.createschemardd//case Class supports up to 22 columns in Scala 2.10, in order to break this limit, it is best to define a class to implement the product interface case class person (name:string, age:int)//To create an RDD for the object of person, Then register it as a table val people = Sc.textfile ("Examples/src/main/resources/people.txt"). Map (_.split (",")). Map (P and person (P (0 ), P (1). Trim.toint)) people.registerastable ("people")//write SQL directly, this method is provided by sqlcontext val teenagers = SQL ("SELECT name From people WHERE age >= and <= ")//teenagers is the Schemardds type, it supports all common RDD operations Teenagers.map (t =" Name: "+ t (0)). Collect (). foreach (println)
From the above method, not very good, a table of dozens of fields, I have to one of the assignment, it now supports the operation is very simple operation, want to implement complex operation can be specific to see Hivecontext provide HIVEQL.
Parquet Files
Parquet is a tabular storage format and is supported by many data processing systems. Parquet provides column-based data representations that support high-efficiency compression for all projects in the Hadoop ecosystem, and is not related to data processing frameworks, data models, or programming languages. Spark SQL provides read and write to parquet and automatically preserves the schema of the original data.
Val sqlcontext = new Org.apache.spark.sql.SQLContext (SC)//import sqlcontext._// Createschemardd is used to convert the RDD into a Schemardd
Import Sqlcontext.createschemardd
Val People:rdd[person] = ...//with the above example.//This RDD has been implicitly converted to a schemardd, allowing it to be stored in parquet format. People.saveasparquetfile (" People.parquet ")//read from the file created above, the result of loading a parquet file is also a javaschemardd.val parquetfile = Sqlcontext.parquetfile (" People.parquet ")//register as a table, then use Parquetfile.registerastable (" Parquetfile ") in SQL state val teenagers = Sqlcontext.sql (" Select name from Parquetfile WHERE age >= and <= ") teenagers.map (t = =" Name: "+ t (0)). Collect (). foreach ( println
JSON Datasets (JSON data set)
JSON (JavaScript Object Notation) is a lightweight data interchange format. It is based on a subset of JavaScript (standard ECMA-262 3rd edition-december 1999). JSON takes a completely language-independent text format, but also uses a similar idiom to the C language family (c, C + +, C #, Java, JavaScript, Perl, Python, etc.). These features make JSON an ideal data exchange language. Easy to read and write, but also easy to machine parse and build (network transfer speed).
Sparksql can automatically infer a JSON dataset pattern and load it as a schemardd. This conversion can be obtained by using one of the two methods in SqlContext:
Jsonfile-loads data from the JSON file's directory, where each line of the file is a JSON object.
Jsonrdd-loads data from an existing RDD, where each RDD element is a string containing a JSON object.
SC is an already existing sparkcontextval SqlContext = new Org.apache.spark.sql.SQLContext (SC)//A JSON dataset with a path indicated// This path can be either a separate text file or a directory that stores text files val Path = "Examples/src/main/resources/people.json"//Generates a SCHEMARDD Val based on the path indicated by the file People = Sqlcontext.jsonfile (path)//inferred pattern can be explicitly people.printschema ()//root//|--by using the Printschema () method : integertype// |--name:stringtype//to register Schemardd as a table people.registerastable ("people")// The SQL state can be run by using the SQL method provided by the SqlContext val teenagers = sqlcontext.sql ("Select name from people WHERE age >= 19 In addition, a schemardd can also generate Val Anotherpeoplerdd = Sc.parallelize ("" "{" name ") by storing a string-type Rdd of a JSON dataset object per string :" Yin "," address ": {" City ":" Columbus "," state ":" Ohio "}}" "":: Nil) val anotherpeople = Sqlcontext.jsonrdd ( ANOTHERPEOPLERDD)
Hive Tables
Spark SQL also supports reading and writing data stored on Apache hive. However, there is too much dependency on HIVE, the default SPARK assembly is not with these dependencies, we need to run Spark_hive=true SBT/SBT assembly/assembly recompile, or add-phive parameters with Maven , it will recompile a hive assembly jar package and then need to put the jar package on all nodes. In addition, we need tohive-site.xml放到conf目录下。没进行hive部署的话,下面的例子也可以用LocalHiveContext来代替HiveContext。
Val Sc:sparkcontext//already exists sparkcontextval Hivecontext = new Org.apache.spark.sql.hive.HiveContext (SC)//introduced this context, All SQL statements are then implicitly converted to the import hivecontext._hql ("CREATE TABLE IF not EXISTS src (key INT, value STRING)") hql ("LOAD DATA LOCAL I Npath ' examples/src/main/resources/kv1.txt ' into TABLE src ")//Use HIVEQL query hql (" from SRC SELECT key, value "). Collect (). foreach (println)
or write the following form:
SC is an existing sparkcontext.val Hivecontext = new Org.apache.spark.sql.hive.HiveContext (SC) hivecontext.hql (" CREATE TABLE IF not EXISTS src (key INT, value STRING) ") hivecontext.hql (" LOAD DATA LOCAL inpath ' EXAMPLES/SRC/MAIN/RESOURC Es/kv1.txt ' into TABLE src ')//Queries is expressed in hiveqlhivecontext.hql ("from SRC SELECT key, value"). Collect (). Fore Ach (println)
Writing language-integrated Relational Queries
Text Language Comprehensive Association query, currently this feature is only supported in Scala.
Spark SQL also supports writing queries for a specific domain language. Again, using the above example data:
SC is an already existing Sparkcontext
Val sqlcontext = new Org.apache.spark.sql.SQLContext (SC) import sqlcontext._val People:rdd[person] = ...//With previous examples.//And the back of this The statement is the same as the ' SELECT name from people where the age >= and the age of <= ' val teenagers = People.where (' Age >= '). where (' the ' AG e <=). Select (' name ')
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
The DSL (domain language) uses the Scala notation to denote columns in an implied table, marked by a preceding ('). Implicit conversions represent these symbolic expressions as values of the SQL execution engine. A complete list of feature support can be found in the Scaladoc.
Spark SQL Tutorial