This article mainly introduces how to use Python pandas framework to operate data in Excel files, including basic operations such as unit format conversion and classification and Summarization. For more information, see
Introduction
The purpose of this article is to show you how to use pandas to execute some common Excel tasks. Some examples are trivial, but I think it is equally important to present these simple things with complex functions that you can find elsewhere. As an extra benefit, I will perform some fuzzy string matching to demonstrate some small tricks, it also shows how pandas uses the complete Python module system to do something simple in Python, but complicated in Excel.
Does that make sense? Let's get started.
Add a sum for a row
The first task I want to introduce is to add several columns and then add a sum column.
First, import the excel data to the pandas data framework.
import pandas as pdimport numpy as npdf = pd.read_excel("excel-comp-data.xlsx")df.head()
We want to add a sum column to show the total sales for Jan, Feb, and Mar for three months.
In Excel and pandas, this is simple and straightforward. For Excel, I added the formula sum (G2: I2) to column J ). In Excel, it looks like this:
Below, we operate in pandas as follows:
df["total"] = df["Jan"] + df["Feb"] + df["Mar"]df.head()
Next, let's calculate some summary information and other values for each column. As shown in the following Excel table, we need to do this:
As you can see, SUM (G2: G16) is added to the 17th row of the column indicating the month to get the total of the month.
It is easy to perform row-and-column analysis in pandas. The following are some examples:
df["Jan"].sum(), df["Jan"].mean(),df["Jan"].min(),df["Jan"].max() (1462000, 97466.666666666672, 10000, 162000)
Now we need to sum up the sum of every month to get their sum. Pandas and Excel are a little different here. Adding the sum of each month in an Excel cell is simple. Pandas requires some additional steps to maintain the integrity of the entire DataFrame.
First, create the sum column of all columns
sum_row=df[["Jan","Feb","Mar","total"]].sum()sum_row Jan 1462000Feb 1507000Mar 717000total 3686000dtype: int64
This is intuitive, but if you want to display the total value as a single row in the table, you still need to make some minor adjustments.
We need to transform the data and convert this series of numbers into DataFrame so that it can be more easily merged into existing data. The T function allows us to change the data arranged by row to the data arranged by column.
df_sum=pd.DataFrame(data=sum_row).Tdf_sum
The last thing we need to do before calculating the sum is to add missing columns. We use reindex to help us complete this process. The trick is to add all the columns and then let pandas add all the missing data.
df_sum=df_sum.reindex(columns=df.columns)df_sum
Now we have a DataFrame in good format. we can use append to add it to the existing content.
df_final=df.append(df_sum,ignore_index=True)df_final.tail()
Extra Data Transformation
In another example, let's try to add the abbreviation of status to the dataset.
For Excel, the easiest way is to add a new column, use the vlookup function for the state name, and fill the abbreviated column.
I performed this operation and the result is as follows:
You can note that after vlookup, some values are not obtained correctly. This is because we misspelled the names of some states. Processing this problem in Excel is a huge challenge (for large datasets)
Fortunately, with pandas, we can use the powerful python ecosystem. Considering how to solve such troublesome data problems, I would like to conduct some fuzzy text matching to determine the correct value.
Fortunately, others have already done a lot of work in this area. The fuzzy wuzzy Library contains some useful functions to solve such problems. First, make sure that you have installed him.
Another piece of code we need is the ing table between the state name and its abbreviation. Instead of entering them in person, Google will find the code.
First, import the appropriate fuzzywuzzy function and define our state name ing table.
from fuzzywuzzy import fuzzfrom fuzzywuzzy import processstate_to_code = {"VERMONT": "VT", "GEORGIA": "GA", "IOWA": "IA", "Armed Forces Pacific": "AP", "GUAM": "GU", "KANSAS": "KS", "FLORIDA": "FL", "AMERICAN SAMOA": "AS", "NORTH CAROLINA": "NC", "HAWAII": "HI", "NEW YORK": "NY", "CALIFORNIA": "CA", "ALABAMA": "AL", "IDAHO": "ID", "FEDERATED STATES OF MICRONESIA": "FM", "Armed Forces Americas": "AA", "DELAWARE": "DE", "ALASKA": "AK", "ILLINOIS": "IL", "Armed Forces Africa": "AE", "SOUTH DAKOTA": "SD", "CONNECTICUT": "CT", "MONTANA": "MT", "MASSACHUSETTS": "MA", "PUERTO RICO": "PR", "Armed Forces Canada": "AE", "NEW HAMPSHIRE": "NH", "MARYLAND": "MD", "NEW MEXICO": "NM", "MISSISSIPPI": "MS", "TENNESSEE": "TN", "PALAU": "PW", "COLORADO": "CO", "Armed Forces Middle East": "AE", "NEW JERSEY": "NJ", "UTAH": "UT", "MICHIGAN": "MI", "WEST VIRGINIA": "WV", "WASHINGTON": "WA", "MINNESOTA": "MN", "OREGON": "OR", "VIRGINIA": "VA", "VIRGIN ISLANDS": "VI", "MARSHALL ISLANDS": "MH", "WYOMING": "WY", "OHIO": "OH", "SOUTH CAROLINA": "SC", "INDIANA": "IN", "NEVADA": "NV", "LOUISIANA": "LA", "NORTHERN MARIANA ISLANDS": "MP", "NEBRASKA": "NE", "ARIZONA": "AZ", "WISCONSIN": "WI", "NORTH DAKOTA": "ND", "Armed Forces Europe": "AE", "PENNSYLVANIA": "PA", "OKLAHOMA": "OK", "KENTUCKY": "KY", "RHODE ISLAND": "RI", "DISTRICT OF COLUMBIA": "DC", "ARKANSAS": "AR", "MISSOURI": "MO", "TEXAS": "TX", "MAINE": "ME"}
Here are some examples of how fuzzy text matching functions work.
process.extractOne("Minnesotta",choices=state_to_code.keys()) ('MINNESOTA', 95)process.extractOne("AlaBAMMazzz",choices=state_to_code.keys(),score_cutoff=80)
Now I know how it works. we create our own function to accept the data in the state column and convert it into a valid abbreviation. The score_cutoff value is 80. You can make some adjustments to see which value is better for your data. You will notice that the returned value is either a valid abbreviation or an np. nan, so there will be some valid values in the domain.
def convert_state(row): abbrev = process.extractOne(row["state"],choices=state_to_code.keys(),score_cutoff=80) if abbrev: return state_to_code[abbrev[0]] return np.nan
Add this column to the cell we want to fill, and then fill it with NaN
df_final.insert(6, "abbrev", np.nan)df_final.head()
We use apply to add abbreviations to appropriate columns.
df_final['abbrev'] = df_final.apply(convert_state, axis=1)df_final.tail()
I think this is cool. We have developed a very simple process to intelligently clean up data. Obviously, when you only have about 15 rows of data, this is nothing remarkable. But what if it is 15000 rows? In Excel, you must perform manual cleanup.
Category summary
In the last section of this article, let's do some subtotal by state ).
In Excel, we use the subtotal tool.
The output is as follows:
Create a category summary in pandas using groupby.
df_sub=df_final[["abbrev","Jan","Feb","Mar","total"]].groupby('abbrev').sum()df_sub
Then, we want to format the data unit as a currency by using applymap for all values in the data frame.
def money(x): return "${:,.0f}".format(x) formatted_df = df_sub.applymap(money)formatted_df
The formatting looks smooth. now we can get the sum as before.
sum_row=df_sub[["Jan","Feb","Mar","total"]].sum()sum_row
Jan 1462000Feb 1507000Mar 717000total 3686000dtype: int64
Convert the value to a column and format it.
df_sub_sum=pd.DataFrame(data=sum_row).Tdf_sub_sum=df_sub_sum.applymap(money)df_sub_sum
Finally, add the sum to DataFrame.
final_table = formatted_df.append(df_sub_sum)final_table
You can note that the index number of the total row is '0 '. We want to rename it using rename.
final_table = final_table.rename(index={0:"Total"})final_table
Conclusion
So far, most people have known that pandas can perform many complex operations on data-just like Excel. Because I have been learning pandas, but I found that I still try to remember how I completed these operations in Excel instead of in pandas. I realized that comparing them seems unfair-they are completely different tools. However, I would like to see who knows Excel and who want to learn other alternative tools that can meet their data analysis needs. I hope these examples can help others and give them confidence that they can use pandas to replace their fragmented and complex Excel files for data operations.
Related articles:
How to use Pandas to read CSV files to MySQL in Python
Real IP request Pandas for Python data analysis
Detailed analysis of cdn logs using the pandas library in Python