Use the Python Pandas framework to manipulate the data in Excel files tutorial

Use the Python Pandas framework to manipulate the data in Excel files tutorial _python

Last Update:2017-01-19 Source: Internet

Author: User

Tags vlookup function

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction

The purpose of this article is to show you how to use pandas to perform some common Excel tasks. Some examples are trivial, but I think showing these simple things is just as important as the complex functions you can find elsewhere. As an extra benefit, I'm going to do some fuzzy string matching to show some little tricks, and show how pandas uses the complete Python module system to do something that is simple in Python, but complex in Excel.

Does it make sense? Let's get started.
add a sum to a row

The first task that I'm going to introduce is to add a few columns and append a sum column.

First we import the Excel data into the pandas data frame.

Import pandas as PD
import numpy as np
df = pd.read_excel ("excel-comp-data.xlsx")
Df.head ()

We want to add a sum column to show total sales for the three months of the old, Feb and Mar.

This is straightforward in Excel and pandas. For Excel, I added formula sum (G2:I2) to the J column. It looks like this in Excel:

Here's how we operate in Pandas:

df["Total"] = "df["] + df["Feb"] + df["Mar"]
df.head ()

Next, let's calculate some summary information and other values for each column. As shown in the Excel table below, we are going to do these things:

As you can see, we added sum (G2:G16) to the 17th row of the column representing the month to get the sum of each month.
It is simple to perform a column-level analysis in pandas. Here are some examples:

df["].sum" (), df["The].mean" (), df["The", "].min" (), df["The", "" "].max ()
 
(1462000, 97466.666666666672, 10000, 162000)

Now we're going to add up the sum of each month to get their sums. Pandas and Excel are a little different here. It's easy to add the sum of each month in the Excel cell. Because pandas needs to maintain the integrity of the entire dataframe, some additional steps are required.

First, create the sum column for all columns

"sum_row=df[[", "Feb", "Mar", "Total"]].sum ()
sum_row 1462000
 
Feb    717000
Total  3686000
Dtype:int64

This is intuitive, but you'll need to do some fine-tuning if you want to display the sum as a single line in the table.

We need to transform the data to convert this series of numbers to dataframe so that it can be easily merged into existing data. The T function allows us to transform the data arranged by rows into columns.

DF_SUM=PD. Dataframe (Data=sum_row). T
Df_sum

The last thing we need to do before we calculate the sum is to add the missing columns. We use Reindex to help us finish. The trick is to add all the columns and let pandas add all the missing data.

Df_sum=df_sum.reindex (columns=df.columns)
df_sum

Now that we have a well-formed dataframe, we can use append to add it to the existing content.

Df_final=df.append (df_sum,ignore_index=true)
Df_final.tail ()

Additional Data transformations

Another example, let's try to add a status abbreviation to the dataset.

For Excel, the easiest way to do this is to add a new column, use the VLOOKUP function on the state name, and populate the abbreviation bar.

I did this, and here's a screenshot of the result:

You can note that after the VLOOKUP, there are some values that have not been correctly obtained. That's because we misspelled some of the states ' names. Dealing with this problem in Excel is a huge challenge (for large datasets)

Fortunately, using pandas, we can leverage the powerful Python ecosystem. Considering how to solve this kind of trouble data problem, I consider doing some fuzzy text matching to determine the correct value.

Fortunately, other people have done a lot of work in this area. The Fuzzy Wuzzy library contains some very useful functions to solve such problems. First make sure you install him.

Another piece of code we need is a map of the state and its initials. Instead of typing them in person, Google you can find this code.

First, import the appropriate Fuzzywuzzy function and define our state name Mapping table.

From Fuzzywuzzy import fuzz from fuzzywuzzy import process State_to_code = {"VERMONT": "VT", "GEORGIA": "GA", "IOWA": "IA "," Armed Forces Pacific ":" AP "," Guam ":" GU "," KANSAS ":" KS "," FLORIDA ":" FL "," American Samoa ":" as "," North CA Rolina ": NC", "HAWAII": "HI", "NEW YORK": "NY", "CALIFORNIA": "CA", "ALABAMA": "AL", "IDAHO": "ID", "Federated S
         Tates of Micronesia ": FM", "Armed Forces Americas": "AA", "DELAWARE": "DE", "ALASKA": "AK", "ILLINOIS": "IL",
         "Armed Forces Africa": "AE", "SOUTH DAKOTA": "SD", "Connecticut": "CT", "MONTANA": "MT", "Massachusetts": "MA",
         "PUERTO RICO": "PR", "Armed Forces Canada": "AE", "New HAMPSHIRE": "NH", "MARYLAND": "MD", "New MEXICO": "NM", "Mississippi": "MS", "Tennessee": "TN", "PALAU": "PW", "COLORADO": "CO", "Armed Forces Middle East": "AE", "N EW JERSEY ":" NJ "," UTAH ":" UT "," MICHIGAN ":" MI "," West VIRGINIA ":" WV "," WASHINGTON ":" WA "," Minnesota ":" MN "," Oregon ":" or "," VIRGINIA":" VA "," VIRGIN ISLANDS ":" VI "," MARSHALL ISLANDS ":" MH "," Wyoming ":" WY "," OHIO ":" OH "," SOUTH CAROLINA ":" SC ", "INDIANA": "In", "NEVADA": "NV", "Louisiana": "LA", "NORTHERN Mariana ISLANDS": "MP", "Nebraska": "NE", "ARIZONA" ":" AZ "," Wisconsin ":" WI "," North DAKOTA ":" ND "," Armed Forces Europe ":" AE "," Pennsylvania ":" PA "," OKLAHOMA ": "OK", "Kentucky": "KY", "RHODE ISLAND": "RI", "DISTRICT of COLUMBIA": "DC", "ARKANSAS": "AR", "Missouri": "MO", "
 TEXAS ": TX", "Maine": "ME"}

Here are some examples of how fuzzy text matching functions work.

Process.extractone ("Minnesotta", Choices=state_to_code.keys ())
 
(' Minnesota ',)
 
Process.extractone (" Alabammazzz ", Choices=state_to_code.keys (), score_cutoff=80)

Now that I know how it works, we create our own function to accept the data in the state name and then convert him to a valid abbreviation. Here we use the Score_cutoff value of 80. You can make some adjustments to see which value is better for your data. You will notice that the return value is either a valid abbreviation or a Np.nan so there will be some valid values in the field.

def convert_state (Row):
  abbrev = Process.extractone (row["state"],choices=state_to_code.keys (), score_cutoff=80 )
  if abbrev: Return
    state_to_code[abbrev[0]] return
  Np.nan

Add this column to the cell we want to fill, and then fill it with Nan

Df_final.insert (6, "abbrev", Np.nan)
Df_final.head ()

We use apply to add abbreviations to the appropriate columns.

df_final[' abbrev '] = df_final.apply (convert_state, Axis=1)
Df_final.tail ()

I think it's cool. We have developed a very simple process to intelligently clean up the data. Obviously, it's no big deal when you have 15 lines of data. But what if it's 15000 lines? In Excel you have to do some manual cleanup.
Subtotal

In the last section of this article, let's do some subtotals by state (subtotal).

In Excel, we'll use the Subtotal tool to do this.

The output is as follows:

Creating subtotals in Pandas is done using GroupBy.

"Abbrev", "df_sub=df_final[[", "Feb", "Mar", "Total"]].groupby (' abbrev '). SUM ()
df_sub

Then, we want to format the data unit as currency by using Applymap for all the values in the database frame.

def money (x): Return
  ' ${:,.0f} '. Format (x)
 
FORMATTED_DF = Df_sub.applymap (money)
FORMATTED_DF

Formatting looks like it's going well and now we can get the sum as before.

"sum_row=df_sub[[", "Feb", "Mar", "Total"]].sum ()
Sum_row

   1462000
Feb   1507000
Mar 717000 total  3686000
Dtype:int64

Transforms the value into a column and formats it.

DF_SUB_SUM=PD. Dataframe (Data=sum_row). T
Df_sub_sum=df_sub_sum.applymap (money)
Df_sub_sum

Finally, add the sum to the dataframe.

final_table = Formatted_df.append (df_sub_sum)
final_table

You can note that the index number of the sum row is ' 0 '. We want to rename it using Rename.

final_table = Final_table.rename (index={0: "Total"})
final_table

Conclusion

So far, most people have learned that using pandas can do a lot of complex things with data-just like Excel. Because I've been learning pandas, I've found that I still try to remember how I do these things in Excel, not in pandas. I realize it doesn't seem fair to compare them to each other-they are totally different tools. However, I want to have access to those who understand Excel and want to learn about other alternatives that can meet the needs of analyzing their data. I hope that these examples can help others and give them the confidence that they can use pandas to replace their fragmented and complex excel for data manipulation.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More