Pyspark Study notes Two

Last Update:2018-07-26 Source: Internet

Author: User

Tags json pyspark databricks

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

2 Dataframes
Similar to Python's Dataframe, Pyspark also has dataframe, which is handled much faster than an unstructured rdd.

Spark 2.0 replaced the SqlContext with Sparksession. Various Spark contexts, including:
Hivecontext, SqlContext, StreamingContext, and Sparkcontext
All are merged into Sparksession, which is used only as a portal to read data.

2.1 Creating Dataframes
Preparatory work:

>>> Import Pyspark
>>> from pyspark.sql import sparksession
>>> spark = Sparksession.builder \
        . AppName ("Python Spark SQL Basic example") \
        . config ("spark.some.config.option", " Some-value ") \
        . Getorcreate ()
>>> sc = spark.sparkcontext

First create a Stringjsonrdd RDD and then convert it to a dataframe.

>>> Stringjsonrdd = Sc.parallelize ("...                     {' id ': ' 123 ',
...                     ' Name ': ' Katie ',
...                     ' Age ':
...                     ' Eyecolor ': ' Brown '} ',
...                     ' {' id ', ' 234 ',
...                     ' Name ': ' Michael ',
...                     ' Age ': ...                     ' Eyecolor ': ' Green '} ',
...                     ' {' id ': ' 345 ',
...                     ' Name ': ' Simone ',
...                     ' Age ':
...                     ' Eyecolor ': ' Blue '} ')

With Spark.read.json (...) method to Dataframe:

>>> Swimmersjson = Spark.read.json (Stringjsonrdd)

Create a temporary table:

Swimmersjson.createorreplacetempview ("Swimmersjson")

Note that creating a temporary table is dataframe transformation until you execute an action (for example, execute an SQL query).

2.2 DataFrame API query
The first 10 rows are printed by default with the. Show () method.

>>> Swimmersjson.show ()

SQL Query

>>> spark.sql ("SELECT * from Swimmersjson"). Collect ()

There are two different ways to convert an existing rdds to dataframes or datasets:
First, the schema is defined with reflection, that is, with the Printschema () method.

>>> Swimmersjson.printschema ()
root
 |--Age:long (nullable = True)
 |--eyecolor:string (nullable = True)
 |--id:string (nullable = True)
 |--name:string (nullable = True)

The second is to define the schema in the program itself.

>>> from pyspark.sql.types Import *
# Generate comma delimited data
>>> Stringcsvrdd = sc.paral Lelize ([
... (123, ' Katie ', ', ' Brown '),
... (234, ' Michael ', A, ' green '),
... (345, ' Simone ', (), ' Blue ')])
# Specify schema
>>> schema = Structtype ([
... Structfield (' id ', Longtype (), True),
... Structfield (' name ', StringType (), True),
... Structfield ("Age", Longtype (), True),
... Structfield ("Eyecolor", StringType (), True)])

The Structfield class is decomposed into:
Datastore: The name of the field
datatype: The data type of the field
Nullable: Specifies whether the value of this field is empty

Apply the schema we created to the Stringjsonrdd RDD (that is, the generated. CSV data) and create a temporary view so that we can query it using sql:

# Apply the schema to the RDD and Create DataFrame
>>> swimmers = Spark.createdataframe (Stringcsvrdd,schema)

# Creates a temporary view using the DataFrame
>>> swimmers.createorreplacetempview (' swimmers ')
> >> Swimmers.printschema ()
root
 |--Id:long (nullable = True)
 |--name:string (nullable = True)
 | | -Age:long (nullable = True)
 |--eyecolor:string (nullable = True)

2.2.1 Querying with the DataFrame API query Row Count

>>> Swimmers.count ()
3

Filter conditions

# Get The ID, age where age = $
>>> swimmers.select (' id ', ' age '). Filter (' age=22 '). Show ()
+---+---+
| id|age|
+---+---+
|234| 22|
+---+---+

# Another-to-write the above query is below
>>> swimmers.select (swimmers.id,swimmers.age) . filter (Swimmers.age = =). Show ()
+---+---+
| id|age|
+---+---+
|234| 22|
+---+---+

# Get the name, Eyecolor where Eyecolor like ' b% '
>>> swimmers.select ("name", "Eyecolor"). Filter ("Eyecolor like ' b% '"). Show ()
+------+--------+
|  name|eyecolor|
+------+--------+
| katie|   brown|
| simone|    blue|
+------+--------+

2.2.2 Querying with the DataFrame API
We have created the view earlier and can now query using SQL statements. Number of rows queried

>>> spark.sql (' Select COUNT (1) from swimmers '). Show ()
+--------+
|count (1) |
+--------+
|       3|
+--------+

Filter conditions

# Get The ID, age where age = $ in SQL
>>> spark.sql (' Select Id,age from swimmers where age=22 '). Show ()
+---+---+
| id|age|
+---+---+
|234| 22|
+---+---+

>>> spark.sql ("Select name, Eyecolor from swimmers where eyecolor like ' b% '"). Show ()
+------ +--------+
|  name|eyecolor|
+------+--------+
| katie|   brown|
| simone|    blue|
+------+--------+

2.3 Example Applications

# Set File Paths flightperffilepath = "/databricks-datasets/flights/departuredelays.csv" Airportsfilepath = "/ Databricks-datasets/flights/airport-codes-na.txt "# obtain Airports DataSet Airports = Spark.read.csv ( Airportsfilepath, header= ' true ', inferschema= ' true ', sep= ' \ t ') Airports.createorreplacetempview ("airports") # obtain
Departure delays DataSet Flightperf = Spark.read.csv (Flightperffilepath, header= ' true ') Flightperf.createorreplacetempview ("Flightperformance") # Cache the departure delays DataSet Flightperf.cache () # Query Sum of Flight delays by city and Origin Code # (for Washington State) Spark.sql ("" "Select A.city, F.origin, sum (f.delay) As delays from Flightperformance F join airports A in a.iata = f.origin where a.state = ' WA ' GROUP by a.city, F.origin Ord   Er by sum (f.delay) desc ""). Show () +-------+------+--------+ |  city|origin|
delays| +-------+------+--------+
|   seattle| sea|159086.0| |   spokane| Geg| 12404.0| |   pasco|   psc|
949.0| +-------+------+--------+

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More