Spark SQL is a spark module that processes structured data. It provides dataframes for this programming abstraction and can also be used as a distributed SQL query engine.
Dataframes
Dataframe is a distributed collection of data with column names. Equivalent to a table in a relational database or a data frame in a r/python, but a lot of optimizations are done at the bottom, and we can use structured data files, Hive tables, external databases, or Rdds to construct dataframes.
1. Start the entrance:
The entry needs to start with the SqlContext class or its subclasses, and of course you need to create the sqlcontext using Sparkcontext, where we use Pyspark (which already comes with SqlContext as SC):
fromimport SQLContextsqlContext = SQLContext(sc)
You can also use Hivecontext, which can provide more functionality than SqlContext, such as using a more complete HIVEQL parser to write queries, using hive UDFs, reading data from hive tables, and so on. The use of Hivecontext does not require installation Hive,spark by default the Hivecontext is packaged separately to avoid excessive dependency on hive
2. Create Dataframes
Using a JSON file to create:
fromimport SQLContextsqlContext = SQLContext(sc)df = sqlContext.read.json("examples/src/main/resources/people.json")# Displays the content of the DataFrame to stdoutdf.show()
Note:
Here you may need to save the file in HDFs (here's the file in the Spark installation directory, version 1.4)
hadoop fs -mkdir examples/src/main/resources/hadoop fs -put /appcom/spark/examples/src/main/resources/* /user/hdpuser/examples/src/main/resources/
3.DataFrame operation
fromPyspark.sqlImportSqlcontextsqlcontext = SqlContext (SC)# Create The DataFrameDF = SqlContext.read.json ("Examples/src/main/resources/people.json")# Show The content of the DataFrameDf.show ()# # Age name# # NULL Michael# # Andy# # Justin# Print The schema in a tree formatDf.printschema ()# # Root# # |--Age:long (nullable = True)# # |--name:string (nullable = True)# Select only the "Name" columnDf.select ("Name"). Show ()# # Name# # Michael# # Andy# # Justin# Select Everybody, but increment the age by 1Df.select (df[' name '], df[' age '] +1). Show ()# # NAME (age + 1)# # Michael NULL# # Andy# # Justin# Select people older thanDf.filter (df[' age '] > +). Show ()# # Age name# # Andy# Count People by ageDf.groupby ("Age"). Count (). Show ()# # Age Count# # null 1# 1# 1
4. Using programming to run SQL queries
SqlContext can use programming to run SQL queries and return dataframe.
fromimport SQLContextsqlContext = SQLContext(sc)df = sqlContext.sql("SELECT * FROM table")
5. Interacting with the RDD
There are two ways to convert an RDD to Dataframes:
- Use reflection to infer the schema of the RDD that contains a particular type of object. This approach simplifies the code and works well when you already know the schema.
- Use the programming interface to construct a schema and apply it to a known rdd.
I. Using reflection to infer schema
Spark SQL can convert an RDD with a row object into a dataframe and infer the data type. The rows are constructed by passing a list of key-value pairs (key/value) as Kwargs to the row class. Key defines the column name of the table, and the type is inferred by looking at the first column of data. (So the first column of data on the RDD cannot be missing) in future versions, the data types will be inferred by looking at more data, as is now done with JSON files.
# SC is an existing sparkcontext. fromPyspark.sqlImportSqlContext, Rowsqlcontext = SqlContext (SC)# Load A text file and convert each line to a Row.Lines = Sc.textfile ("Examples/src/main/resources/people.txt") Parts = Lines.map (LambdaL:l.split (",")) people = Parts.map (LambdaP:row (name=p[0], Age=int (p[1])))# Infer the schema, and register the DataFrame as a table.Schemapeople = Sqlcontext.createdataframe (People) schemapeople.registertemptable ("People")# SQL can is run over dataframes, that has been registered as a table.Teenagers = Sqlcontext.sql ("Select name from people, WHERE age >=, and age <=")# The results of SQL queries is RDDs and support all the normal RDD operations.Teennames = Teenagers.map (LambdaP:"Name:"+ p.name) forTeennameinchTeennames.collect ():PrintTeenname
II. Programming the specified schema
It takes 3 steps to programmatically specify a schema:
- Create a meta-ancestor or a list of Rdd from the original RDD.
- Create a schema that matches the structure of the meta-ancestor or list in the Rdd created in step one with Structtype.
- Use the Createdataframe method provided by SqlContext to apply the schema to the RDD.
# Import SqlContext and data types fromPyspark.sqlImportSqlContext fromPyspark.sql.typesImport*# SC is an existing sparkcontext.SqlContext = SqlContext (SC)# Load A text file and convert each line to a tuple.Lines = Sc.textfile ("Examples/src/main/resources/people.txt") Parts = Lines.map (LambdaL:l.split (",")) people = Parts.map (LambdaP: (p[0], p[1].strip ()))# The schema is encoded in a string.Schemastring ="Name Age"Fields = [Structfield (Field_name, StringType (),True) forField_nameinchSchemastring.split ()]schema = Structtype (Fields)# Apply The schema to the RDD.Schemapeople = Sqlcontext.createdataframe (People, schema)# Register The DataFrame as a table.Schemapeople.registertemptable ("People")# SQL can is run over dataframes, that has been registered as a table.Results = Sqlcontext.sql ("SELECT name from people")# The results of SQL queries is RDDs and support all the normal RDD operations.Names = Results.map (LambdaP:"Name:"+ p.name) forNameinchNames.collect ():PrintName
Spark SQL and DataFrame Guide (1.4.1)--dataframes