Introduction
The purpose of this article is to show you how to use pandas to perform some common Excel tasks. Some examples are trivial, but I think showing these simple things is just as important as the complex functions you can find elsewhere. As an extra benefit, I'm going to do some fuzzy string matching to show some little tricks, and show how pandas uses the complete Python module system to do something that is simple in Python, but complex in Excel.
Does it make sense? Let's get started.
add a sum to a row
The first task that I'm going to introduce is to add a few columns and append a sum column.
First we import the Excel data into the pandas data frame.
Import pandas as PD
import numpy as np
df = pd.read_excel ("excel-comp-data.xlsx")
Df.head ()
We want to add a sum column to show total sales for the three months of the old, Feb and Mar.
This is straightforward in Excel and pandas. For Excel, I added formula sum (G2:I2) to the J column. It looks like this in Excel:
Here's how we operate in Pandas:
df["Total"] = "df["] + df["Feb"] + df["Mar"]
df.head ()
Next, let's calculate some summary information and other values for each column. As shown in the Excel table below, we are going to do these things:
As you can see, we added sum (G2:G16) to the 17th row of the column representing the month to get the sum of each month.
It is simple to perform a column-level analysis in pandas. Here are some examples:
df["].sum" (), df["The].mean" (), df["The", "].min" (), df["The", "" "].max ()
(1462000, 97466.666666666672, 10000, 162000)
Now we're going to add up the sum of each month to get their sums. Pandas and Excel are a little different here. It's easy to add the sum of each month in the Excel cell. Because pandas needs to maintain the integrity of the entire dataframe, some additional steps are required.
First, create the sum column for all columns
"sum_row=df[[", "Feb", "Mar", "Total"]].sum ()
sum_row 1462000
Feb 717000
Total 3686000
Dtype:int64
This is intuitive, but you'll need to do some fine-tuning if you want to display the sum as a single line in the table.
We need to transform the data to convert this series of numbers to dataframe so that it can be easily merged into existing data. The T function allows us to transform the data arranged by rows into columns.
DF_SUM=PD. Dataframe (Data=sum_row). T
Df_sum
The last thing we need to do before we calculate the sum is to add the missing columns. We use Reindex to help us finish. The trick is to add all the columns and let pandas add all the missing data.
Df_sum=df_sum.reindex (columns=df.columns)
df_sum
Now that we have a well-formed dataframe, we can use append to add it to the existing content.
Df_final=df.append (df_sum,ignore_index=true)
Df_final.tail ()
Additional Data transformations
Another example, let's try to add a status abbreviation to the dataset.
For Excel, the easiest way to do this is to add a new column, use the VLOOKUP function on the state name, and populate the abbreviation bar.
I did this, and here's a screenshot of the result:
You can note that after the VLOOKUP, there are some values that have not been correctly obtained. That's because we misspelled some of the states ' names. Dealing with this problem in Excel is a huge challenge (for large datasets)
Fortunately, using pandas, we can leverage the powerful Python ecosystem. Considering how to solve this kind of trouble data problem, I consider doing some fuzzy text matching to determine the correct value.
Fortunately, other people have done a lot of work in this area. The Fuzzy Wuzzy library contains some very useful functions to solve such problems. First make sure you install him.
Another piece of code we need is a map of the state and its initials. Instead of typing them in person, Google you can find this code.
First, import the appropriate Fuzzywuzzy function and define our state name Mapping table.
From Fuzzywuzzy import fuzz from fuzzywuzzy import process State_to_code = {"VERMONT": "VT", "GEORGIA": "GA", "IOWA": "IA "," Armed Forces Pacific ":" AP "," Guam ":" GU "," KANSAS ":" KS "," FLORIDA ":" FL "," American Samoa ":" as "," North CA Rolina ": NC", "HAWAII": "HI", "NEW YORK": "NY", "CALIFORNIA": "CA", "ALABAMA": "AL", "IDAHO": "ID", "Federated S
Tates of Micronesia ": FM", "Armed Forces Americas": "AA", "DELAWARE": "DE", "ALASKA": "AK", "ILLINOIS": "IL",
"Armed Forces Africa": "AE", "SOUTH DAKOTA": "SD", "Connecticut": "CT", "MONTANA": "MT", "Massachusetts": "MA",
"PUERTO RICO": "PR", "Armed Forces Canada": "AE", "New HAMPSHIRE": "NH", "MARYLAND": "MD", "New MEXICO": "NM", "Mississippi": "MS", "Tennessee": "TN", "PALAU": "PW", "COLORADO": "CO", "Armed Forces Middle East": "AE", "N EW JERSEY ":" NJ "," UTAH ":" UT "," MICHIGAN ":" MI "," West VIRGINIA ":" WV "," WASHINGTON ":" WA "," Minnesota ":" MN "," Oregon ":" or "," VIRGINIA":" VA "," VIRGIN ISLANDS ":" VI "," MARSHALL ISLANDS ":" MH "," Wyoming ":" WY "," OHIO ":" OH "," SOUTH CAROLINA ":" SC ", "INDIANA": "In", "NEVADA": "NV", "Louisiana": "LA", "NORTHERN Mariana ISLANDS": "MP", "Nebraska": "NE", "ARIZONA" ":" AZ "," Wisconsin ":" WI "," North DAKOTA ":" ND "," Armed Forces Europe ":" AE "," Pennsylvania ":" PA "," OKLAHOMA ": "OK", "Kentucky": "KY", "RHODE ISLAND": "RI", "DISTRICT of COLUMBIA": "DC", "ARKANSAS": "AR", "Missouri": "MO", "
TEXAS ": TX", "Maine": "ME"}
Here are some examples of how fuzzy text matching functions work.
Process.extractone ("Minnesotta", Choices=state_to_code.keys ())
(' Minnesota ',)
Process.extractone (" Alabammazzz ", Choices=state_to_code.keys (), score_cutoff=80)
Now that I know how it works, we create our own function to accept the data in the state name and then convert him to a valid abbreviation. Here we use the Score_cutoff value of 80. You can make some adjustments to see which value is better for your data. You will notice that the return value is either a valid abbreviation or a Np.nan so there will be some valid values in the field.
def convert_state (Row):
abbrev = Process.extractone (row["state"],choices=state_to_code.keys (), score_cutoff=80 )
if abbrev: Return
state_to_code[abbrev[0]] return
Np.nan
Add this column to the cell we want to fill, and then fill it with Nan
Df_final.insert (6, "abbrev", Np.nan)
Df_final.head ()
We use apply to add abbreviations to the appropriate columns.
df_final[' abbrev '] = df_final.apply (convert_state, Axis=1)
Df_final.tail ()
I think it's cool. We have developed a very simple process to intelligently clean up the data. Obviously, it's no big deal when you have 15 lines of data. But what if it's 15000 lines? In Excel you have to do some manual cleanup.
Subtotal
In the last section of this article, let's do some subtotals by state (subtotal).
In Excel, we'll use the Subtotal tool to do this.
The output is as follows:
Creating subtotals in Pandas is done using GroupBy.
"Abbrev", "df_sub=df_final[[", "Feb", "Mar", "Total"]].groupby (' abbrev '). SUM ()
df_sub
Then, we want to format the data unit as currency by using Applymap for all the values in the database frame.
def money (x): Return
' ${:,.0f} '. Format (x)
FORMATTED_DF = Df_sub.applymap (money)
FORMATTED_DF
Formatting looks like it's going well and now we can get the sum as before.
"sum_row=df_sub[[", "Feb", "Mar", "Total"]].sum ()
Sum_row
1462000
Feb 1507000
Mar 717000 total 3686000
Dtype:int64
Transforms the value into a column and formats it.
DF_SUB_SUM=PD. Dataframe (Data=sum_row). T
Df_sub_sum=df_sub_sum.applymap (money)
Df_sub_sum
Finally, add the sum to the dataframe.
final_table = Formatted_df.append (df_sub_sum)
final_table
You can note that the index number of the sum row is ' 0 '. We want to rename it using Rename.
final_table = Final_table.rename (index={0: "Total"})
final_table
Conclusion
So far, most people have learned that using pandas can do a lot of complex things with data-just like Excel. Because I've been learning pandas, I've found that I still try to remember how I do these things in Excel, not in pandas. I realize it doesn't seem fair to compare them to each other-they are totally different tools. However, I want to have access to those who understand Excel and want to learn about other alternatives that can meet the needs of analyzing their data. I hope that these examples can help others and give them the confidence that they can use pandas to replace their fragmented and complex excel for data manipulation.