From Pandas to Apache Spark ' s Dataframe

Source: Internet
Author: User
Tags diff min pyspark databricks
From Pandas to Apache Spark ' s DataFrameAugust by Olivier Girardot Share article on Twitter Share article on LinkedIn Share article on Facebook

This was a cross-post from the blog of Olivier Girardot. Olivier is a software engineer and the co-founder of Lateral Thoughts, where he works on machine learning, Big Data, and D Evops Solutions.

With the introduction in Spark 1.4 of Windows operations, you can finally port pretty much any relevant piece of Pandas ' Da Taframe computation to Apache Spark parallel computation framework using spark SQL ' s DataFrame. If you're not yet familiar Withspark's DataFrame, don ' t hesitate to check out RDDs is the new bytecode of Apache Spark an D come back to here after.

I figured some feedback on what to port existing complex code might is useful, so the goal of this article would be is to take A few concepts from Pandas DataFrame and see how we can translate this to Pyspark ' s DataFrame using Spark 1.4.

DISCLAIMER:A few operations that's can do in Pandas don ' t translate to Spark well. Please remember this dataframes in Spark is like the RDD in the sense this they ' re an immutable data structure. Therefore things like:

# to create a new column "three"
df[' three ') = Df[' One '] * df[' one ']

Can ' t exist, just because this kind of affectation goes against the principles of Spark. Another example would is trying to access by index a single element within a DataFrame. Don ' t forget that your ' re using a distributed data structure, not a in-memory random-access data structure.

To is clear, this doesn ' t mean so you can ' t do the same kind of thing (i.e. create a new column) using Spark, it means t Hat you has to think immutable/distributed and re-write parts of your code, mostly the parts that is not purely thought of as transformations on a stream of data.

So let's dive in. Column selection

This was not a much different in Pandas and Spark, and if you had to take into account the immutable character of you R DataFrame. First let's create dataframes one in Pandas *pdf* and one in Spark *df*:

# Pandas = PDF in
[+]: PDF = PD. Dataframe.from_items ([' A ', [1, 2, 3]), (' B ', [4, 5, 6])] in

[]: PDF. A
out[18]:
0    1
1    2
2    3
name:a, Dtype:int64

# SPARK SQL = df in
[+]: DF = Sqlctx.createdataframe ([(1, 4), (2, 5), (3, 6)], ["A", "B"]) in

[]: DF
out[20]: Dataframe[a:bigint, B:bigi NT] in

[+]: df.show ()
+-+-+
| a| b|
+-+-+
|1|4|
| 2|5|
| 3|6|
+-+-+

Now in Spark SQL or Pandas your use of the same syntax to refer to a column:

in [+]: DF. A
out[27]: column<a>
out[27]: column<a> in

[Japanese]: df[' a ']
out[28]: column<a>

In []: PDF. A
out[29]:
0    1
1    2
2    3
name:a, Dtype:int64 in

[]: pdf[' A ']
out[30] :
0    1
1    2
2    3
name:a, Dtype:int64

The output seems different, but these is still the same ways of referencing a column using Pandas or Spark. The only difference are that in Pandas, it's a mutable data structure that's can change–not in Spark. Column Adding

in [+]: pdf[' c '] = 0 in [+]: PDF out[32]: A B C 0 1 4 0 1 2 5 0 2 3 6 0 # in Spark SQL you'll use the WI 
Thcolumn or the Select method, # need to create a "Column", a simple int won ' t do:in [in]: Df.withcolumn (' C ', 0) ---------------------------------------------------------------------------Attributeerror Tra 

Ceback (most recent) <ipython-input-33-fd1261f623cf> in <module> ()----> 1 df.withcolumn (' C ', 0) /users/ogirardot/downloads/spark-1.4.0-bin-hadoop2.4/python/pyspark/sql/dataframe.pyc in WithColumn (self, colName , col) 1196 "" "," "" 1197 return Self.select (' * ', Col.alias (colname)) 1198 1199 @ignore_unic 
Ode_prefix attributeerror: ' int ' object have no attribute ' alias ' # Here's your new best friend "pyspark.sql.functions.*" # If You can ' t create it from composing columns # This package contains all the functions you ll need:in [+]: from Pys Park.sql import functions as F In [approx]: df.withcolumn (' C ', F.lit (0)) out[36]: Dataframe[a:bigint, B:bigint, C:int] in [PNS]: Df.withcolumn (' C ', F.lit (0 ). Show () +-+-+-+ | a| b|
c|
+-+-+-+ |1|4|0| |2|5|0| |3|6|0|
 +-+-+-+

Most of the "Time in Spark" SQL can use Strings to reference columns but there is the cases where you'll want to use th E Column objects rather than strings:in Spark SQL DataFrame columns is allowed to has the same name, they ' ll be given Unique names inside of Spark SQL, but this means so can ' t reference them with the column name is only as this becomes a Mbiguous. When you need to manipulate columns using expressions like Adding and columns to all other, twice the value of this Colum N or even is the column value larger than 0?, you won ' am able to use simple strings and would need the column reference. Finally If you need renaming, cast or any other complex feature, you'll need the Column reference too.

Here's an example:

In [All]: Df.withcolumn (' C ', DF. A * 2)
out[39]: Dataframe[a:bigint, B:bigint, C:bigint] in

[MAX]: Df.withcolumn (' C ', DF. A * 2). Show ()
+-+-+-+
| a| b| c|
+-+-+-+
|1|4|2|
| 2|5|4|
| 3|6|6|
+-+-+-+

in [+]: df.withcolumn (' C ', DF. B > 0). Show ()
+-+-+----+
| a| b|   c|
+-+-+----+
|1|4|true|
| 2|5|true|
| 3|6|true|
+-+-+----+

When you ' re selecting columns, to create another projected DataFrame, you can also use expressions:

In [All]: Df.select (DF. B > 0)
out[42]: dataframe[(b > 0): boolean] in

[PDF]: Df.select (DF. B > 0). Show ()
+-------+
| ( B > 0) |
+-------+
|   true|
|   true|
|   true|
+-------+

s can see the column name would actually is computed according to the expression you defined if you want to rename thi s, you'll need to use the alias method on Column:

In []: Df.select (DF. B > 0). Alias ("Is_positive")). Show ()
+-----------+
|is_positive|
+-----------+
|       true|
|       true|
|       true|
+-----------+

All of the expressions so we ' re building here can is used for Filtering, Adding a new column or even inside aggregations So once you get a general idea of how it works, and you'll be fluent throughout all of the DataFrame manipulation framework. Filtering

Filtering is pretty much straightforward too, you can use the Rdd-like filter method and copy any of the your existing Pandas Expression/predicate for filtering:

In []: pdf[(PDF. B > 0) & (PDF. A < 2)] out[48]:    a  B  C 0  1  4  0 in [PDF]: Df.filter (DF. B > 0) & (DF. A < 2)). Show () +-+-+ | a| b| +-+-+ |1|4| +-+-+ in []: df[(DF. B > 0) & (DF. A < 2)].show ()
+-+-+
| a| b|
+-+-+
|1|4|
+-+-+
Aggregations

What can being confusing at first on using aggregations is so the minute you write groupBy your ' re not using a DataFrame obj ECT, you ' re actually using Agroupeddata object and you need to precise your aggregations to get back the output DataFrame:

in [+]: Df.groupby ("a") out[77]: <pyspark.sql.group.groupeddata at 0x10dd11d90> in [+]: Df.groupby ("a"). AVG ("B ") out[78]: Dataframe[a:bigint, AVG (B): double] in [+]: Df.groupby (" A "). AVG (" B "). Show () +-+------+ | a|
AVG (B) |   +-+------+ |1|   4.0| |2|   5.0| |3|
6.0| +-+------+

As a syntactic sugar if you need only one aggregation, you can use the simplest functions like: avg, cout, Max, MI N, mean and sum directly On groupeddata, but most of the time, this'll be too simple and you ll want To create a few aggregations during a single groupBy operation. After all (C.f. rdds was the new bytecode of Apache spark ) This is one of the greatest features of the Datafram Es. To does so you'll be using The agg method:

In []: Df.groupby ("A"). Agg (F.avg ("B"), F.min ("B"), F.max ("B")). Show ()
+-+------+------+------+
| a| AVG (B) | MIN (B) | MAX (B) |
+-+------+------+------+
|1|   4.0|     4|     4|
| 2|   5.0|     5|     5|
| 3|   6.0|     6|     6|
+-+------+------+------+

Of course, just like before, you can use any expression especially column compositions, alias definitions etc ... and some ot Her non-trivial functions:

in [+]: Df.groupby ("A"). Agg (
   .....: F.first ("B"). Alias ("My First"),
   ...: F.last ("B"). Alias ("My Last"),
   ....: F.sum ("B"). Alias ("My Everything") ....
   Show ()
+-+--------+-------+-------------+
| A|my first|my last|my everything|
+-+--------+-------+-------------+
|1|       4|      4|            4|
| 2|       5|      5|            5|
| 3|       6|      6|            6|
+-+--------+-------+-------------+
Complex Operations & Windows

Now that Spark 1.4 are out, the Dataframe API provides a efficient and easy to use window-based framework–this single FE Ature is what makes any Pandas to Spark migration actually do-able for 99% of the Projects–even considering some of pand As ' features-seemed hard-to-reproduce in a distributed environment.

A simple example, we can pick is, the Pandas you can compute a diff on a column and Pandas would compare the values Of one of the last one and compute the difference between them. Typically the kind of feature hard-to-do in a distributed environment because all line are supposed to be treated independ  Ently, now with Spark 1.4 windows operations you can define a window on which Spark would execute some aggregation functions But relatively-a specific line. Here's how to port some existing Pandas code using diff:

in [+]: df = Sqlctx.createdataframe ([(1, 4), (1, 5), (2, 6), (2, 6), (3, 0)], ["A", "B"]) in

[the]: PDF = Df.topandas ( ) in

[]: PDF
out[96]:
   A  B
0  1  4
1  1 5 2  2  6
3  2  6
4  3  0 in

[98]: pdf[' diff '] = pdf. B.diff () in

[102]: PDF
out[102]:
   A  B  diff
0  1  4   NaN
1  1  5     1
2  2  6     1
3  2  6     0
4  3  0    -6

In Pandas you can compute a diff on an arbitrary column, with no regard for keys, no regards for order or anything. It's cool ... but the very most of the time isn't exactly what are want and you might end up cleaning up the mess afterwards by setting The column value, back to NaN, from a line to another when the keys changed.

Here's how you can do such a thing in Pyspark using Window functions, a Key and, if you want, in a specific order:

In [107]: From Pyspark.sql.window import window in

[108]: Window_over_a = Window.partitionby ("A"). ("B")

In [109]: Df.withcolumn ("diff", F.lead ("B"). Over (window_over_a)-DF. B). Show ()
+---+---+----+
|  a|  B|diff|
+---+---+----+
|  1|  4|   1|
|  1|  5|null|
|  2|  6|   0|
|  2|  6|null|
|  3|  0|null|
+---+---+----+

With so you is now able to compute a diff line by line–ordered or not–given a specific key. The great point, about Windows operation is, you ' re not actually breaking the structure of your data. Let me explain myself.

When your ' re computing some kind of aggregation (once again according to a key), you'll usually be executing a groupBy oper Ation given this key and compute the multiple metrics so you'll need (at the same time if you ' re lucky, otherwise in Mul Tiple Reducebykey or Aggregatebykey transformations).

But whether your ' re using RDDs or DataFrame, if you're not using Windows operations then you'll actually crush your data in A part of your flow and then you'll need to join back again the results of your aggregations to the main–dataflow. Window operations allow you to execute your computation and copy the results as additional columns without any explicit Jo In.

This is a quick-to-enrich your data adding rolling computations as just another column directly. Additional resources is worth noting regarding these new features, the official Databricks blog article on Window ope Rations and Christophe Bourguignat ' s article evaluating Pandas and Spark DataFrame differences.

To sum up, you are now having all the tools of need in Spark 1.4 to port any Pandas computation in a distributed environment us ing the very similar DataFrame API.

Original: https://databricks.com/blog/2015/08/12/from-pandas-to-apache-sparks-dataframe.html


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.