2 Dataframes
Similar to Python's Dataframe, Pyspark also has dataframe, which is handled much faster than an unstructured rdd.
Spark 2.0 replaced the SqlContext with Sparksession. Various Spark contexts, including:
Hivecontext, SqlContext, StreamingContext, and Sparkcontext
All are merged into Sparksession, which is used only as a portal to read data.
2.1 Creating Dataframes
Preparatory work:
>>> Import Pyspark
>>> from pyspark.sql import sparksession
>>> spark = Sparksession.builder \
. AppName ("Python Spark SQL Basic example") \
. config ("spark.some.config.option", " Some-value ") \
. Getorcreate ()
>>> sc = spark.sparkcontext
First create a Stringjsonrdd RDD and then convert it to a dataframe.
>>> Stringjsonrdd = Sc.parallelize ("... {' id ': ' 123 ',
... ' Name ': ' Katie ',
... ' Age ':
... ' Eyecolor ': ' Brown '} ',
... ' {' id ', ' 234 ',
... ' Name ': ' Michael ',
... ' Age ': ... ' Eyecolor ': ' Green '} ',
... ' {' id ': ' 345 ',
... ' Name ': ' Simone ',
... ' Age ':
... ' Eyecolor ': ' Blue '} ')
With Spark.read.json (...) method to Dataframe:
>>> Swimmersjson = Spark.read.json (Stringjsonrdd)
Create a temporary table:
Swimmersjson.createorreplacetempview ("Swimmersjson")
Note that creating a temporary table is dataframe transformation until you execute an action (for example, execute an SQL query).
2.2 DataFrame API query
The first 10 rows are printed by default with the. Show () method.
>>> Swimmersjson.show ()
SQL Query
>>> spark.sql ("SELECT * from Swimmersjson"). Collect ()
There are two different ways to convert an existing rdds to dataframes or datasets:
First, the schema is defined with reflection, that is, with the Printschema () method.
>>> Swimmersjson.printschema ()
root
|--Age:long (nullable = True)
|--eyecolor:string (nullable = True)
|--id:string (nullable = True)
|--name:string (nullable = True)
The second is to define the schema in the program itself.
>>> from pyspark.sql.types Import *
# Generate comma delimited data
>>> Stringcsvrdd = sc.paral Lelize ([
... (123, ' Katie ', ', ' Brown '),
... (234, ' Michael ', A, ' green '),
... (345, ' Simone ', (), ' Blue ')])
# Specify schema
>>> schema = Structtype ([
... Structfield (' id ', Longtype (), True),
... Structfield (' name ', StringType (), True),
... Structfield ("Age", Longtype (), True),
... Structfield ("Eyecolor", StringType (), True)])
The Structfield class is decomposed into:
Datastore: The name of the field
datatype: The data type of the field
Nullable: Specifies whether the value of this field is empty
Apply the schema we created to the Stringjsonrdd RDD (that is, the generated. CSV data) and create a temporary view so that we can query it using sql:
# Apply the schema to the RDD and Create DataFrame
>>> swimmers = Spark.createdataframe (Stringcsvrdd,schema)
# Creates a temporary view using the DataFrame
>>> swimmers.createorreplacetempview (' swimmers ')
> >> Swimmers.printschema ()
root
|--Id:long (nullable = True)
|--name:string (nullable = True)
| | -Age:long (nullable = True)
|--eyecolor:string (nullable = True)
2.2.1 Querying with the DataFrame API query Row Count
>>> Swimmers.count ()
3
Filter conditions
# Get The ID, age where age = $
>>> swimmers.select (' id ', ' age '). Filter (' age=22 '). Show ()
+---+---+
| id|age|
+---+---+
|234| 22|
+---+---+
# Another-to-write the above query is below
>>> swimmers.select (swimmers.id,swimmers.age) . filter (Swimmers.age = =). Show ()
+---+---+
| id|age|
+---+---+
|234| 22|
+---+---+
# Get the name, Eyecolor where Eyecolor like ' b% '
>>> swimmers.select ("name", "Eyecolor"). Filter ("Eyecolor like ' b% '"). Show ()
+------+--------+
| name|eyecolor|
+------+--------+
| katie| brown|
| simone| blue|
+------+--------+
2.2.2 Querying with the DataFrame API
We have created the view earlier and can now query using SQL statements. Number of rows queried
>>> spark.sql (' Select COUNT (1) from swimmers '). Show ()
+--------+
|count (1) |
+--------+
| 3|
+--------+
Filter conditions
# Get The ID, age where age = $ in SQL
>>> spark.sql (' Select Id,age from swimmers where age=22 '). Show ()
+---+---+
| id|age|
+---+---+
|234| 22|
+---+---+
>>> spark.sql ("Select name, Eyecolor from swimmers where eyecolor like ' b% '"). Show ()
+------ +--------+
| name|eyecolor|
+------+--------+
| katie| brown|
| simone| blue|
+------+--------+
2.3 Example Applications
# Set File Paths flightperffilepath = "/databricks-datasets/flights/departuredelays.csv" Airportsfilepath = "/ Databricks-datasets/flights/airport-codes-na.txt "# obtain Airports DataSet Airports = Spark.read.csv ( Airportsfilepath, header= ' true ', inferschema= ' true ', sep= ' \ t ') Airports.createorreplacetempview ("airports") # obtain
Departure delays DataSet Flightperf = Spark.read.csv (Flightperffilepath, header= ' true ') Flightperf.createorreplacetempview ("Flightperformance") # Cache the departure delays DataSet Flightperf.cache () # Query Sum of Flight delays by city and Origin Code # (for Washington State) Spark.sql ("" "Select A.city, F.origin, sum (f.delay) As delays from Flightperformance F join airports A in a.iata = f.origin where a.state = ' WA ' GROUP by a.city, F.origin Ord Er by sum (f.delay) desc ""). Show () +-------+------+--------+ | city|origin|
delays| +-------+------+--------+
| seattle| sea|159086.0| | spokane| Geg| 12404.0| | pasco| psc|
949.0| +-------+------+--------+