Use the pandas framework of Python to perform data tutorials in Excel files,

Last Update:2015-04-01 Source: Internet

Author: User

Tags vlookup function

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Use the pandas framework of Python to perform data tutorials in Excel files,

Introduction

The purpose of this article is to show you how to use pandas to execute some common Excel tasks. Some examples are trivial, but I think it is equally important to present these simple things with complex functions that you can find elsewhere. As an extra benefit, I will perform some fuzzy string matching to demonstrate some small tricks, it also shows how pandas uses the complete Python module system to do something simple in Python, but complicated in Excel.

Does that make sense? Let's get started.
Add a sum for a row

The first task I want to introduce is to add several columns and then add a sum column.

First, import the excel Data to the pandas data framework.

import pandas as pdimport numpy as npdf = pd.read_excel("excel-comp-data.xlsx")df.head()

We want to add a sum column to show the total sales for Jan, Feb, and Mar for three months.

In Excel and pandas, this is simple and straightforward. For Excel, I added the formula sum (G2: I2) to column J ). In Excel, it looks like this:

Below, we operate in pandas as follows:

df["total"] = df["Jan"] + df["Feb"] + df["Mar"]df.head()

Next, let's calculate some summary information and other values for each column. As shown in the following Excel table, we need to do this:

As you can see, SUM (G2: G16) is added to the 17th row of the column indicating the month to get the total of the month.
It is easy to perform row-and-Column Analysis in pandas. The following are some examples:

df["Jan"].sum(), df["Jan"].mean(),df["Jan"].min(),df["Jan"].max() (1462000, 97466.666666666672, 10000, 162000)

Now we need to sum up the sum of every month to get their sum. Pandas and Excel are a little different here. Adding the sum of each month in an Excel cell is simple. Pandas requires some additional steps to maintain the integrity of the entire DataFrame.

First, create the sum column of all columns

sum_row=df[["Jan","Feb","Mar","total"]].sum()sum_row Jan   1462000Feb   1507000Mar    717000total  3686000dtype: int64

This is intuitive, but if you want to display the total value as a single row in the table, you still need to make some minor adjustments.

We need to transform the data and convert this series of numbers into DataFrame so that it can be more easily merged into existing data. The T function allows us to change the data arranged by row to the data arranged by column.

df_sum=pd.DataFrame(data=sum_row).Tdf_sum

The last thing we need to do before calculating the sum is to add missing columns. We use reindex to help us complete this process. The trick is to add all the columns and then let pandas add all the missing data.

df_sum=df_sum.reindex(columns=df.columns)df_sum

Now we have a DataFrame in good format. We can use append to add it to the existing content.

df_final=df.append(df_sum,ignore_index=True)df_final.tail()

Extra Data Transformation

In another example, let's try to add the abbreviation of status to the dataset.

For Excel, the easiest way is to add a new column, use the vlookup function for the state name, and fill the abbreviated column.

I performed this operation and the result is as follows:

You can note that after vlookup, some values are not obtained correctly. This is because we misspelled the names of some States. Processing this problem in Excel is a huge challenge (for large datasets)

Fortunately, with pandas, we can use the powerful python ecosystem. Considering how to solve such troublesome data problems, I would like to conduct some fuzzy text matching to determine the correct value.

Fortunately, others have already done a lot of work in this area. The fuzzy wuzzy library contains some useful functions to solve such problems. First, make sure that you have installed him.

Another piece of code we need is the ing table between the State name and its abbreviation. Instead of entering them in person, Google will find the code.

First, import the appropriate fuzzywuzzy function and define our state name ing table.

from fuzzywuzzy import fuzzfrom fuzzywuzzy import processstate_to_code = {"VERMONT": "VT", "GEORGIA": "GA", "IOWA": "IA", "Armed Forces Pacific": "AP", "GUAM": "GU",         "KANSAS": "KS", "FLORIDA": "FL", "AMERICAN SAMOA": "AS", "NORTH CAROLINA": "NC", "HAWAII": "HI",         "NEW YORK": "NY", "CALIFORNIA": "CA", "ALABAMA": "AL", "IDAHO": "ID", "FEDERATED STATES OF MICRONESIA": "FM",         "Armed Forces Americas": "AA", "DELAWARE": "DE", "ALASKA": "AK", "ILLINOIS": "IL",         "Armed Forces Africa": "AE", "SOUTH DAKOTA": "SD", "CONNECTICUT": "CT", "MONTANA": "MT", "MASSACHUSETTS": "MA",         "PUERTO RICO": "PR", "Armed Forces Canada": "AE", "NEW HAMPSHIRE": "NH", "MARYLAND": "MD", "NEW MEXICO": "NM",         "MISSISSIPPI": "MS", "TENNESSEE": "TN", "PALAU": "PW", "COLORADO": "CO", "Armed Forces Middle East": "AE",         "NEW JERSEY": "NJ", "UTAH": "UT", "MICHIGAN": "MI", "WEST VIRGINIA": "WV", "WASHINGTON": "WA",         "MINNESOTA": "MN", "OREGON": "OR", "VIRGINIA": "VA", "VIRGIN ISLANDS": "VI", "MARSHALL ISLANDS": "MH",         "WYOMING": "WY", "OHIO": "OH", "SOUTH CAROLINA": "SC", "INDIANA": "IN", "NEVADA": "NV", "LOUISIANA": "LA",         "NORTHERN MARIANA ISLANDS": "MP", "NEBRASKA": "NE", "ARIZONA": "AZ", "WISCONSIN": "WI", "NORTH DAKOTA": "ND",         "Armed Forces Europe": "AE", "PENNSYLVANIA": "PA", "OKLAHOMA": "OK", "KENTUCKY": "KY", "RHODE ISLAND": "RI",         "DISTRICT OF COLUMBIA": "DC", "ARKANSAS": "AR", "MISSOURI": "MO", "TEXAS": "TX", "MAINE": "ME"}

Here are some examples of how fuzzy text matching functions work.

process.extractOne("Minnesotta",choices=state_to_code.keys()) ('MINNESOTA', 95) process.extractOne("AlaBAMMazzz",choices=state_to_code.keys(),score_cutoff=80)

Now I know how it works. We create our own function to accept the data in the state column and convert it into a valid abbreviation. The score_cutoff value is 80. You can make some adjustments to see which value is better for your data. You will notice that the returned value is either a valid abbreviation or an np. nan, so there will be some valid values in the domain.

def convert_state(row):  abbrev = process.extractOne(row["state"],choices=state_to_code.keys(),score_cutoff=80)  if abbrev:    return state_to_code[abbrev[0]]  return np.nan

Add this column to the cell we want to fill, and then fill it with NaN

df_final.insert(6, "abbrev", np.nan)df_final.head()

We use apply to add abbreviations to appropriate columns.

df_final['abbrev'] = df_final.apply(convert_state, axis=1)df_final.tail()

I think this is cool. We have developed a very simple process to intelligently clean up data. Obviously, when you only have about 15 rows of data, this is nothing remarkable. But what if it is 15000 rows? In Excel, you must perform manual cleanup.
Category summary

In the last section of this article, let's do some subtotal by state ).

In Excel, we use the subtotal tool.

The output is as follows:

Create a Category summary in pandas using groupby.

df_sub=df_final[["abbrev","Jan","Feb","Mar","total"]].groupby('abbrev').sum()df_sub

Then, we want to format the data unit as a currency by using applymap for all values in the data frame.

def money(x):  return "${:,.0f}".format(x) formatted_df = df_sub.applymap(money)formatted_df

The formatting looks smooth. Now we can get the sum as before.

sum_row=df_sub[["Jan","Feb","Mar","total"]].sum()sum_row

Jan   1462000Feb   1507000Mar    717000total  3686000dtype: int64

Convert the value to a column and format it.

df_sub_sum=pd.DataFrame(data=sum_row).Tdf_sub_sum=df_sub_sum.applymap(money)df_sub_sum

Finally, add the sum to DataFrame.

final_table = formatted_df.append(df_sub_sum)final_table

You can note that the index number of the total row is '0 '. We want to rename it using rename.

final_table = final_table.rename(index={0:"Total"})final_table

Conclusion

So far, most people have known that pandas can perform many complex operations on data-just like Excel. Because I have been learning pandas, but I found that I still try to remember how I completed these operations in Excel instead of in pandas. I realized that comparing them seems unfair-they are completely different tools. However, I would like to see who knows Excel and who want to learn other alternative tools that can meet their data analysis needs. I hope these examples can help others and give them confidence that they can use pandas to replace their fragmented and complex Excel files for data operations.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More