This article describes how to use Python to draw data waterfall diagrams. the Pandas and matplotlib libraries are used in this tutorial. For more information, see
Introduction
Waterfall chart is a useful tool for drawing certain types of data. Not surprisingly, we can use Pandas and matplotlib to create a reusable waterfall chart.
Before proceeding, I want to tell you which type of chart I refer. I will create a 2D waterfall chart described in Wikipedia.
A typical use of this type of chart is to display the + and-values that act as a "bridge" between the start value and the end value. For this reason, financial personnel sometimes call it a bridge. Similar to other examples I used earlier, this type of drawing is not easy to generate in Excel. of course, there must be a way to generate it, but it is not easy to remember.
The key point to remember about a waterfall chart is that it is essentially a stacked bar chart, but it has a blank bottom bar, so the top bar will be "suspended" in the air. Let's get started.
Create a chart
First, execute the standard input and ensure that IPython can display the matplot diagram.
import numpy as npimport pandas as pdimport matplotlib.pyplot as plt %matplotlib inline
Set the data we want to plot the waterfall chart and load it into the data frame (DataFrame.
Data needs to start with your starting value, but you need to give the final total number. We will calculate it below.
index = ['sales','returns','credit fees','rebates','late charges','shipping']data = {'amount': [350000,-30000,-7500,-25000,95000,-7000]}trans = pd.DataFrame(data=data,index=index)
I used the convenient display function in IPython to control the content that I want to display more easily.
from IPython.display import displaydisplay(trans)
The greatest skill of a waterfall chart is to calculate the content of the stacked bar chart at the bottom. I have learned a lot from the stackoverflow discussion.
First, we get the sum.
display(trans.amount.cumsum())sales 350000returns 320000credit fees 312500rebates 287500late charges 382500shipping 375500Name: amount, dtype: int64
This looks good, but we need to move data from one location to the right.
blank=trans.amount.cumsum().shift(1).fillna(0)display(blank) sales 0returns 350000credit fees 320000rebates 312500late charges 287500shipping 382500Name: amount, dtype: float64
We need to add a net total to the trans and blank data frames.
total = trans.sum().amounttrans.loc["net"] = totalblank.loc["net"] = totaldisplay(trans)display(blank)
sales 0returns 350000credit fees 320000rebates 312500late charges 287500shipping 382500net 375500Name: amount, dtype: float64
Create steps for displaying changes.
step = blank.reset_index(drop=True).repeat(3).shift(-1)step[1::3] = np.nandisplay(step) 0 00 NaN0 3500001 3500001 NaN1 3200002 3200002 NaN2 3125003 3125003 NaN3 2875004 2875004 NaN4 3825005 3825005 NaN5 3755006 3755006 NaN6 NaNName: amount, dtype: float64
For "net" rows, in order not to double the stack, we need to ensure that the blank value is 0.
blank.loc["net"] = 0
Then, draw a picture and see what it looks like.
my_plot = trans.plot(kind='bar', stacked=True, bottom=blank,legend=None, title="2014 Sales Waterfall")my_plot.plot(step.index, step.values,'k')
It looks pretty good, but let's try to format the y axis to make it more readable. To this end, we use FuncFormatter and Python2.7 + syntax to cut off decimals and add a comma to the format.
def money(x, pos): 'The two args are the value and tick position' return "${:,.0f}".format(x) from matplotlib.ticker import FuncFormatterformatter = FuncFormatter(money)
Then, combine them.
my_plot = trans.plot(kind='bar', stacked=True, bottom=blank,legend=None, title="2014 Sales Waterfall")my_plot.plot(step.index, step.values,'k')my_plot.set_xlabel("Transaction Types")my_plot.yaxis.set_major_formatter(formatter)
Complete Script
The basic graph works normally, but I want to add some labels and make some small format modifications. The following is my final script:
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom matplotlib.ticker import FuncFormatter #Use python 2.7+ syntax to format currencydef money(x, pos): 'The two args are the value and tick position' return "${:,.0f}".format(x)formatter = FuncFormatter(money) #Data to plot. Do not include a total, it will be calculatedindex = ['sales','returns','credit fees','rebates','late charges','shipping']data = {'amount': [350000,-30000,-7500,-25000,95000,-7000]} #Store data and create a blank series to use for the waterfalltrans = pd.DataFrame(data=data,index=index)blank = trans.amount.cumsum().shift(1).fillna(0) #Get the net total number for the final element in the waterfalltotal = trans.sum().amounttrans.loc["net"]= totalblank.loc["net"] = total #The steps graphically show the levels as well as used for label placementstep = blank.reset_index(drop=True).repeat(3).shift(-1)step[1::3] = np.nan #When plotting the last element, we want to show the full bar,#Set the blank to 0blank.loc["net"] = 0 #Plot and labelmy_plot = trans.plot(kind='bar', stacked=True, bottom=blank,legend=None, figsize=(10, 5), title="2014 Sales Waterfall")my_plot.plot(step.index, step.values,'k')my_plot.set_xlabel("Transaction Types") #Format the axis for dollarsmy_plot.yaxis.set_major_formatter(formatter) #Get the y-axis position for the labelsy_height = trans.amount.cumsum().shift(1).fillna(0) #Get an offset so labels don't sit right on top of the barmax = trans.max()neg_offset = max / 25pos_offset = max / 50plot_offset = int(max / 15) #Start label looploop = 0for index, row in trans.iterrows(): # For the last item in the list, we don't want to double count if row['amount'] == total: y = y_height[loop] else: y = y_height[loop] + row['amount'] # Determine if we want a neg or pos offset if row['amount'] > 0: y += pos_offset else: y -= neg_offset my_plot.annotate("{:,.0f}".format(row['amount']),(loop,y),ha="center") loop+=1 #Scale up the y axis so there is room for the labelsmy_plot.set_ylim(0,blank.max()+int(plot_offset))#Rotate the labelsmy_plot.set_xticklabels(trans.index,rotation=0)my_plot.get_figure().savefig("waterfall.png",dpi=200,bbox_inches='tight')
Run the script to generate the following beautiful chart:
Last thought
If you are not familiar with the waterfall chart before, I hope this example can show you how useful it is. I think some people may think it is a bit bad to need so much script code for a chart. In some respects, I agree with this idea. If you just make a waterfall chart and will not touch it again later, you should continue to use the Excel method.
However, what if a waterfall chart is really useful and you need to copy it to 100 customers? What will you do next? Using Excel is a challenge, and using the scripts in this article to create 100 different tables is quite easy. Once again, the true value of this program is that it facilitates you to create a program that is easy to copy when you need to expand this solution.
I really like learning more about Pandas, matplotlib, and IPothon. I'm glad that this method can help you, and hope other people can learn some knowledge from it and apply it to their daily work.