Teach you how to use Pandas pivot tables to process data (with learning materials) and pandas learning materials

Source: Internet
Author: User
Tags pandas pivot

Teach you how to use Pandas pivot tables to process data (with learning materials) and pandas learning materials

Source: bole online-PyPer

Total2203 words,Read5Minutes.
This article mainly explains pandas's pivot_table function and teaches you how to use it for data analysis.


Introduction


Most people may have experience using pivot tables in Excel. In fact, Pandas also provides a similar function called pivot_table. Although pivot_table is very useful, I find that in order to format and output the content I need, I often need to remember its syntax. Therefore, this article will focus on the pandas function pivot_table and teach you how to use it for data analysis.


If you are not familiar with this concept, Wikipedia provides a detailed explanation of it. By the way, do you know that Microsoft has registered a trademark for the PivotTable table? I did not know before. Needless to say, the pivot table I will discuss below is not PivotTable.


As an extra benefit, I created a simple memo sheet that summarizes the shortt_table. You can find it at the end of this article. I hope it will help you. If it helps you, please let me know.


Data


One challenge in using pandas's pivot_table is that you need to make sure that you understand your data and have a clear understanding of what issues you want to solve through the pivot table. In fact, although pivot_table seems to be a simple function, it can quickly perform powerful data analysis.


In this article, I will track a sales channel (also called a funnel ). The basic problem is that some sales cycle is long (you can think about "enterprise software", "capital equipment", etc.), and managers want to learn more about it throughout the year.


Typical problems include:


  • What is the revenue of this channel?

  • What is a channel product?

  • Who has any products at what stage?

  • How likely is the transaction to end before the end of the year?


Many companies will use CRM tools or other software for sale to track this process. Although they may have an effective tool for data analysis, some must export the data to Excel and use

A pivot table tool to summarize the data.


Using Pandas pivot tables is a good choice and should have the following advantages:


  • Faster (once set)

  • Self-explanation (by checking the code, you will know what it has done)

  • Easy to generate reports or emails

  • More flexible, because you can define custom Aggregate functions


Read in the data


First, let's build the required environment.


If you want to continue with me, you can download this Excel file.


Import pandas as pd

Import numpy as np


Version reminder


Because the effect_table API has changed over time, to make the sample code in this article work properly, make sure that you have installed the latest version of Pandas (> 0.15 ). In this example, the category data type is also used, and it must be the latest version.


First, read data from our sales channels into data frames.


Df = pd. read_excel ("../in/sales-funnel.xlsx ")

Df. head ()



For convenience, we define the "Status" column in the above table as category and set the order according to the desired viewing method.


In fact, this is not strictly required, but it can help us maintain the desired order throughout the data analysis process.


Df ["Status"] = df ["Status"]. astype ("category ")

Df ["Status"]. cat. set_categories (["won", "pending", "presented", "declined"], inplace = True)


Process Data


Since we create a pivot table, I think the easiest way is to do it step by step. Add a project and check each step to verify that you get the expected results step by step. Do not be afraid of handling the order and variable complexity to view what looks best to meet your needs.


The simplest Pivot table must have a data frame and an index. In this example, we use the "Name" column as our index.


Pd. effect_table (df, index = ["Name"])



You can also have multiple indexes. In fact, most of the shortt_table parameters can obtain multiple values through the list.


Pd. effect_table (df, index = ["Name", "Rep", "Manager"])



This is interesting but not especially useful. What we may want to do is to set "Manager" and "Rep" as indexes to view the results.


To implement it, you only need to change the index.


Pd. effect_table (df, index = ["Manager", "Rep"])



As you can see, the pivot table is intelligent. It has begun to implement data aggregation and summarization by grouping the "Rep" column and the "Manager" column. Now let's take a look at what the pivot table can do for us.


Therefore, the "Account" and "Quantity" columns are useless for us. Therefore, by explicitly defining columns we care about using the "values" field, we can remove columns that do not care about.


Pd. effect_table (df, index = ["Manager", "Rep"], values = ["Price"])



The "Price" column automatically calculates the average value of the data, but we can also count or sum the elements of this column. To add these features, it is easy to use aggfunc and np. sum.


Pd. effect_table (df, index = ["Manager", "Rep"], values = ["Price"], aggfunc = np. sum)




Aggfunc can contain many functions. Let's try a way to use the mean and len functions in numpy for counting.


Pd. effect_table (df, index = ["Manager", "Rep"], values = ["Price"], aggfunc = [np. mean, len])



If we want to analyze the sales situation through different products, the variable "columns" will allow us to define one or more columns.


Column vs. Value


In my opinion, one confusing aspect of pivot_table is the use of "columns (columns)" and "values (values. Remember, the variable "columns" is optional. It provides an additional method to split the actual values you are concerned about. However, the aggregate function aggfunc is finally applied to the project you listed in the variable "values.


Pd. effect_table (df, index = ["Manager", "Rep"], values = ["Price"],

Columns = ["Product"], aggfunc = [np. sum])



However, non-numeric values (NaN) are somewhat distracting. If you want to remove them, you can use fill_value to set it to 0.


Pd. effect_table (df, index = ["Manager", "Rep"], values = ["Price"],

Columns = ["Product"], aggfunc = [np. sum], fill_value = 0)



In fact, I think adding the "Quantity" column will be helpful to us, so we will add "Quantity" to the "values" list.


Pd. effect_table (df, index = ["Manager", "Rep"], values = ["Price", "Quantity"],

Columns = ["Product"], aggfunc = [np. sum], fill_value = 0)



Pd. effect_table (df, index = ["Manager", "Rep", "Product"],

Values = ["Price", "Quantity"], aggfunc = [np. sum], fill_value = 0)



This display method makes more sense for this dataset. But what if I want to check the sum? "Margins = True" can implement this function for us.


Pd. effect_table (df, index = ["Manager", "Rep", "Product"],

Values = ["Price", "Quantity"],

Aggfunc = [np. sum, np. mean], fill_value = 0, margins = True)



Next, let's analyze this channel from a higher manager perspective. According to the previous definition of category, note how "Status" is sorted.


Pd. effect_table (df, index = ["Manager", "Status"], values = ["Price"],

Aggfunc = [np. sum], fill_value = 0, margins = True)




A convenient feature is that you can pass a dictionary to aggfunc to execute different functions for different values you choose. However, there is a side effect, that is, the tag must be more concise.


Pd. effect_table (df, index = ["Manager", "Status"], columns = ["Product"], values = ["Quantity", "Price"],

Aggfunc = {"Quantity": len, "Price": np. sum}, fill_value = 0)



In addition, you can provide a series of aggregate functions and apply them to each element in "values.


Table = pd. effect_table (df, index = ["Manager", "Status"], columns = ["Product"], values = ["Quantity", "Price"],

Aggfunc = {"Quantity": len, "Price": [np. sum, np. mean]}, fill_value = 0)

Table




It may be daunting to put all these things together at the same time, but once you start to process the data and add a new project step by step, you will be able to see how it works. My general rule of thumb is that once you use multiple "grouby", You need to evaluate whether using a pivot table is a good choice.


Advanced Pivot Table Filtering


Once you generate the required data, the data will exist in the data frame. Therefore, you can use a custom standard data frame function to filter data frames.


If you only want to view the data of a manager (such as Debra Henley), you can:


Table. query ('manager = ["Debra Henley"] ')



We can view all pending and successful transactions. The Code is as follows:


Table. query ('status = ["pending", "won"] ')


This is a very powerful feature in pivot_table, so once you get the data in the pivot_table format you need, don't forget that you have the power of pandas.



If you want to save it as a reference, here are the complete notes: http://nbviewer.ipython.org/url/

Pbpython.com/extras/pandas-performant-table-explained.ipynb


Data: http://pbpython.com/extras/sales-funnel.xlsx


Memo form


To sum up all of this, I have created a memo, and I hope it will help you remember how to use pandas's pivot_table.



Edit: 文

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.