Pyspark Study notes Two

Source: Internet
Author: User
Tags json pyspark databricks

2 Dataframes
Similar to Python's Dataframe, Pyspark also has dataframe, which is handled much faster than an unstructured rdd.

Spark 2.0 replaced the SqlContext with Sparksession. Various Spark contexts, including:
Hivecontext, SqlContext, StreamingContext, and Sparkcontext
All are merged into Sparksession, which is used only as a portal to read data.

2.1 Creating Dataframes
Preparatory work:

>>> Import Pyspark
>>> from pyspark.sql import sparksession
>>> spark = Sparksession.builder \
        . AppName ("Python Spark SQL Basic example") \
        . config ("spark.some.config.option", " Some-value ") \
        . Getorcreate ()
>>> sc = spark.sparkcontext

First create a Stringjsonrdd RDD and then convert it to a dataframe.

>>> Stringjsonrdd = Sc.parallelize ("...                     {' id ': ' 123 ',
...                     ' Name ': ' Katie ',
...                     ' Age ':
...                     ' Eyecolor ': ' Brown '} ',
...                     ' {' id ', ' 234 ',
...                     ' Name ': ' Michael ',
...                     ' Age ': ...                     ' Eyecolor ': ' Green '} ',
...                     ' {' id ': ' 345 ',
...                     ' Name ': ' Simone ',
...                     ' Age ':
...                     ' Eyecolor ': ' Blue '} ')

With Spark.read.json (...) method to Dataframe:

>>> Swimmersjson = Spark.read.json (Stringjsonrdd)

Create a temporary table:

Swimmersjson.createorreplacetempview ("Swimmersjson")

Note that creating a temporary table is dataframe transformation until you execute an action (for example, execute an SQL query).

2.2 DataFrame API query
The first 10 rows are printed by default with the. Show () method.

>>> Swimmersjson.show ()
SQL Query
>>> spark.sql ("SELECT * from Swimmersjson"). Collect ()

There are two different ways to convert an existing rdds to dataframes or datasets:
First, the schema is defined with reflection, that is, with the Printschema () method.

>>> Swimmersjson.printschema ()
root
 |--Age:long (nullable = True)
 |--eyecolor:string (nullable = True)
 |--id:string (nullable = True)
 |--name:string (nullable = True)

The second is to define the schema in the program itself.

>>> from pyspark.sql.types Import *
# Generate comma delimited data
>>> Stringcsvrdd = sc.paral Lelize ([
... (123, ' Katie ', ', ' Brown '),
... (234, ' Michael ', A, ' green '),
... (345, ' Simone ', (), ' Blue ')])
# Specify schema
>>> schema = Structtype ([
... Structfield (' id ', Longtype (), True),
... Structfield (' name ', StringType (), True),
... Structfield ("Age", Longtype (), True),
... Structfield ("Eyecolor", StringType (), True)])

The Structfield class is decomposed into:
Datastore: The name of the field
datatype: The data type of the field
Nullable: Specifies whether the value of this field is empty

Apply the schema we created to the Stringjsonrdd RDD (that is, the generated. CSV data) and create a temporary view so that we can query it using sql:

# Apply the schema to the RDD and Create DataFrame
>>> swimmers = Spark.createdataframe (Stringcsvrdd,schema)
# Creates a temporary view using the DataFrame
>>> swimmers.createorreplacetempview (' swimmers ')
> >> Swimmers.printschema ()
root
 |--Id:long (nullable = True)
 |--name:string (nullable = True)
 | | -Age:long (nullable = True)
 |--eyecolor:string (nullable = True)

2.2.1 Querying with the DataFrame API query Row Count

>>> Swimmers.count ()
3
Filter conditions
# Get The ID, age where age = $
>>> swimmers.select (' id ', ' age '). Filter (' age=22 '). Show ()
+---+---+
| id|age|
+---+---+
|234| 22|
+---+---+

# Another-to-write the above query is below
>>> swimmers.select (swimmers.id,swimmers.age) . filter (Swimmers.age = =). Show ()
+---+---+
| id|age|
+---+---+
|234| 22|
+---+---+

# Get the name, Eyecolor where Eyecolor like ' b% '
>>> swimmers.select ("name", "Eyecolor"). Filter ("Eyecolor like ' b% '"). Show ()
+------+--------+
|  name|eyecolor|
+------+--------+
| katie|   brown|
| simone|    blue|
+------+--------+

2.2.2 Querying with the DataFrame API
We have created the view earlier and can now query using SQL statements. Number of rows queried

>>> spark.sql (' Select COUNT (1) from swimmers '). Show ()
+--------+
|count (1) |
+--------+
|       3|
+--------+
Filter conditions
# Get The ID, age where age = $ in SQL
>>> spark.sql (' Select Id,age from swimmers where age=22 '). Show ()
+---+---+
| id|age|
+---+---+
|234| 22|
+---+---+

>>> spark.sql ("Select name, Eyecolor from swimmers where eyecolor like ' b% '"). Show ()
+------ +--------+
|  name|eyecolor|
+------+--------+
| katie|   brown|
| simone|    blue|
+------+--------+

2.3 Example Applications

# Set File Paths flightperffilepath = "/databricks-datasets/flights/departuredelays.csv" Airportsfilepath = "/ Databricks-datasets/flights/airport-codes-na.txt "# obtain Airports DataSet Airports = Spark.read.csv ( Airportsfilepath, header= ' true ', inferschema= ' true ', sep= ' \ t ') Airports.createorreplacetempview ("airports") # obtain
Departure delays DataSet Flightperf = Spark.read.csv (Flightperffilepath, header= ' true ') Flightperf.createorreplacetempview ("Flightperformance") # Cache the departure delays DataSet Flightperf.cache () # Query Sum of Flight delays by city and Origin Code # (for Washington State) Spark.sql ("" "Select A.city, F.origin, sum (f.delay) As delays from Flightperformance F join airports A in a.iata = f.origin where a.state = ' WA ' GROUP by a.city, F.origin Ord   Er by sum (f.delay) desc ""). Show () +-------+------+--------+ |  city|origin|
delays| +-------+------+--------+
|   seattle| sea|159086.0| |   spokane| Geg| 12404.0| |   pasco|   psc|
949.0| +-------+------+--------+

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.