Spark SQL and DataFrame Guide (1.4.1)--Dataframes

Source: Internet
Author: User
Tags reflection pyspark hadoop fs

Spark SQL is a spark module that processes structured data. It provides a programming abstraction such as Dataframes. It can also be used as a distributed SQL query engine at the same time.

Dataframes

Dataframe is a distributed collection of data with column names. The equivalent of a table in a relational database or a data frame in a r/python is a lot more optimized at the bottom, and we can use structured data files, Hive tables, external databases, or Rdds to construct dataframes.

1. Start the entry:

The entry needs to start from the SqlContext class or its subclasses, and of course it is necessary to create the sqlcontext using Sparkcontext, where we use Pyspark (which has already brought SqlContext as SC):

fromimport SQLContextsqlContext = SQLContext(sc)

It is also able to use Hivecontext, which provides many other functions than sqlcontext. For example, you can write queries using a more complete HIVEQL parser, using hive UDFs. Reads data from the Hive table, and so on.

Using hivecontext does not need to be installed Hive,spark by default the Hivecontext is packaged separately to avoid excessive dependency on hive

2. Create Dataframes
Using a JSON file to create:

fromimport SQLContextsqlContext = SQLContext(sc)df = sqlContext.read.json("examples/src/main/resources/people.json")# Displays the content of the DataFrame to stdoutdf.show()

Note:
Here you may need to save the file in HDFs (here's the file in the Spark installation folder, version 1.4)

hadoop fs -mkdir examples/src/main/resources/hadoop fs -put /appcom/spark/examples/src/main/resources/*         /user/hdpuser/examples/src/main/resources/

3.DataFrame operation

 fromPyspark.sqlImportSqlcontextsqlcontext = SqlContext (SC)# Create The DataFrameDF = SqlContext.read.json ("Examples/src/main/resources/people.json")# Show The content of the DataFrameDf.show ()# # Age name# # NULL Michael# # Andy# # Justin# Print The schema in a tree formatDf.printschema ()# # Root# # |--Age:long (nullable = True)# # |--name:string (nullable = True)# Select only the "Name" columnDf.select ("Name"). Show ()# # Name# # Michael# # Andy# # Justin# Select Everybody, but increment the age by 1Df.select (df[' name '], df[' age '] +1). Show ()# # NAME (age + 1)# # Michael NULL# # Andy# # Justin# Select people older thanDf.filter (df[' age '] > +). Show ()# # Age name# # Andy# Count People by ageDf.groupby ("Age"). Count (). Show ()# # Age Count# # null 1# 1# 1

4. Using programming to execute SQL queries
SqlContext can use programming to execute SQL queries and return dataframe.

fromimport SQLContextsqlContext = SQLContext(sc)df = sqlContext.sql("SELECT * FROM table")

5. Interacting with the RDD

There are two ways to convert an RDD to Dataframes:

    • Use reflection to determine the schema of the RDD that includes a particular type of object. This approach simplifies the code and works well when you already know the schema.
    • Use the programming interface. Constructs a schema and applies it to the known rdd.

The use of reflection to judge the schema
Spark SQL can convert an RDD with a row object into a dataframe. and determine the data type. The rows are constructed by passing a list of key-value pairs (key/value) as Kwargs to the row class.

Key defines the column name of the table, and the type is judged by the first column of data.

(So the first column of the RDD data is not missing) The future version number will look at a lot of other data to determine the data type. As with the JSON file today.

# SC is an existing sparkcontext. fromPyspark.sqlImportSqlContext, Rowsqlcontext = SqlContext (SC)# Load A text file and convert each line to a Row.Lines = Sc.textfile ("Examples/src/main/resources/people.txt") Parts = Lines.map (LambdaL:l.split (",")) people = Parts.map (LambdaP:row (name=p[0], Age=int (p[1])))# Infer the schema, and register the DataFrame as a table.Schemapeople = Sqlcontext.createdataframe (People) schemapeople.registertemptable ("People")# SQL can is run over dataframes, that has been registered as a table.Teenagers = Sqlcontext.sql ("Select name from people, WHERE age >=, and age <=")# The results of SQL queries is RDDs and support all the normal RDD operations.Teennames = Teenagers.map (LambdaP:"Name:"+ p.name) forTeennameinchTeennames.collect ():PrintTeenname

II. Programming the specified schema
Specifying a schema programmatically requires 3 steps:

    1. Create a meta-ancestor or a list of Rdd from the original RDD.
    2. Create a schema that matches the structure of the meta-ancestor or list in the Rdd created in step one with Structtype.

    3. Use the Createdataframe method provided by SqlContext to apply the schema to the RDD.

# Import SqlContext and data types fromPyspark.sqlImportSqlContext fromPyspark.sql.typesImport*# SC is an existing sparkcontext.SqlContext = SqlContext (SC)# Load A text file and convert each line to a tuple.Lines = Sc.textfile ("Examples/src/main/resources/people.txt") Parts = Lines.map (LambdaL:l.split (",")) people = Parts.map (LambdaP: (p[0], p[1].strip ()))# The schema is encoded in a string.Schemastring ="Name Age"Fields = [Structfield (Field_name, StringType (),True) forField_nameinchSchemastring.split ()]schema = Structtype (Fields)# Apply The schema to the RDD.Schemapeople = Sqlcontext.createdataframe (People, schema)# Register The DataFrame as a table.Schemapeople.registertemptable ("People")# SQL can is run over dataframes, that has been registered as a table.Results = Sqlcontext.sql ("SELECT name from people")# The results of SQL queries is RDDs and support all the normal RDD operations.Names = Results.map (LambdaP:"Name:"+ p.name) forNameinchNames.collect ():PrintName

Spark SQL and DataFrame Guide (1.4.1)--Dataframes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.