Tutorials | An introductory Python data analysis Library pandas

Last Update:2018-01-08 Source: Internet

Author: User

Tags jupyter jupyter notebook

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First of all, for those unfamiliar with Pandas, Pandas is the most popular data analysis library in the Python ecosystem. It can accomplish many tasks, including:

Read/write data in different formats
Select a subset of data
Cross-row/column calculations
Find and fill in missing data
Apply actions in a separate group of data
Reshape data into different formats
Merging multiple datasets
Advanced timing Features
Visualize operations with Matplotlib and Seaborn

Although Pandas is powerful, it does not provide complete functionality for the entire data science process. Pandas is often used as a tool for data acquisition and storage, as well as for data modeling and prediction, in the role of mining and cleanup.

Data Science Pipeline

For a typical data scientist, Pandas plays a very important role in the data pipeline transmission process. One of the quantitative indicators is the community-based discussion of frequency trends (Stack Overflow Trends App (https://insights.stackoverflow.com/trends)).

Now, Pandas on the Stack Overflow the top of the Python Data Science Library, accounting for 1% of the total number of new problem submissions across the site.

The misuse of Stack Overflow

From the above icon, we find that many people are using Pandas, but it is also very confusing. I answered about 400 questions about Pandas on the Stack Overflow and saw how badly people understood the library. Stack Overflow provides a great deal of convenience to programmers, but it also creates a huge drawback. Because programmers can instantly find answers to questions and get a sense of satisfaction, people are reluctant to read the literature and other resources they own. In fact, I suggest that programmers spend a few weeks a year without Stack Overflow to solve the problem.

Teach you to learn Pandas

A few weeks ago I was asked how I practiced using Pandas, so I posted a simple guide on R/datascience subreddit. The following will detail the information that was expressed in that article.

First of all, you should set a positive target. Your goal is not to really "learn pandas". " It is useful to know how to perform operations in a library, but this is not the same Pandas knowledge you need to use in real-world data analysis. You can divide your study into two categories:

Independent of data analysis, learning Pandas Library
Learn to use Pandas in real-world data analysis

For example, the difference between the two is similar to that of learning how to cut a twig in half, the latter is to chop some trees in the forest. Before we discuss this in more detail, let's take a look at both of these methods.

Independent of data analysis, Learning Pandas Library: This method mainly includes reading, more critical is the exploration of Pandas official documents. (http://pandas.pydata.org/pandas-docs/stable/)

Learning to use Pandas in real-world data analysis: This approach involves locating and collecting real data, and performing end-to-end data analysis. Kaggle datasets are a great place to find data. However, I strongly recommend that you avoid using Kaggle or pieces before you use Pandas smoothly.

Alternating learning

As you learn how to use Pandas for data analysis, you should alternately learn the basics of Pandas documents and Pandas use in real database processing. This is very important. Otherwise, it's easy to have a complete dependency on the Pandas basics you need to accomplish most of your tasks. But in fact, when more advanced operations exist, these foundations are too cumbersome.

Start with the document

If you've never been in touch with Pandas but have enough basic knowledge of Python, I suggest you start with the Pandas official documentation. The documentation is very detailed and there are now 2195 pages. Even if the size of the document is so large, it does not cover every operation and certainly does not cover all the combinations of functions/methods and parameters that you can use in Pandas.

Take full advantage of documentation

To get the most out of your document, don't just read it. I suggest you read the 15 sections. For each section, create a new Jupyter notebook. If you are not familiar with Jupyter notebook, please read this article from Data Camp first: Https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook

Build your first Jupyter notebook

Start with the section "Getting Started with data structures (Intro to data Structures)". Open this page next to your Jupyter notebook. When you read the document, write down (rather than copy) the code and execute it in the notebook. As you execute your code, explore these operations and try to explore new ways to use them.

Then select the section "Index and select data (indexing, Selecting data)". Create a new Jupyter notebook, write and execute the code, and then explore the different actions you learned. The choice of data is the most difficult part for beginners, and I have written a lengthy article on. Locvs Iloc (https://stackoverflow.com/questions/28757389/ loc-vs-iloc-vs-ix-vs-at-vs-iat/47098873#47098873), you may want to see another explanation.

After studying these two parts, you should be able to understand a DataFrame and a Series component, and also understand how to select different subsets from the data. You can now read "10 minutes to pandas" for a broad overview of more useful operations. As with all parts of learning, please create a new notebook.

Press SHIFT + TAB + TAB for help

I often press SHIFT + TAB + TAB while using Pandas. When the pointer is placed in the name or in parentheses in the valid Python code, the object pops up with a small scroll box to display its document. This small box is very useful to me because it is not possible to remember all the parameter names and their input types.

Press SHIFT + TAB + TAB to open the Stack mode document

You can also be in "." Then press the TAB key directly to get the drop-down menu for all valid objects

Press TAB after DataFrame (DF.) to get a list of valid 200+ objects

Main drawbacks of official documentation

Although the official documentation is described in great detail, it does not well guide the proper use of real data for data analysis. All data is artificially designed or randomly generated. Real data analysis involves several or even dozens of Pandas operation Serial. If you only look at the documents, you will never be exposed to these. Use of document Learning Pandas machine, each method learned independent of each other without contact.

Build your First Data analysis

After reading these three parts of the document, you will be able to contact real data for the first time. As mentioned earlier, I recommend that you start with the Kaggle dataset. You can choose from popular voting heat, such as choosing the TMDB Movie DataSet. Download the data, and then create a new Jupyter notebook on the dataset. You may not be able to perform advanced data processing at this time, but you should be able to contact the knowledge you learned in the first three parts of the document.

View Kernel

Each Kaggle dataset has a kernel (kernel) section. Don't be fooled by the name "kernel"-it's just a Jupyter notebook that puts Kaggle datasets in Python or R language processing. This is a good opportunity to learn. After you have done some basic data analysis, open a more popular Python kernel, read through several of them, and insert several snippets of code that you are interested in into your own code.

If you don't understand some of the questions, you can ask questions in the comments section. You can actually create your own kernel, but now I think it's better for you to work on your local laptop.

Return to official documents

Once you have completed your first kernel, you can return to the document and read the rest. Here is my suggested reading order:

Processing of lost data
Group: Split-apply-combine Mode
Reshaping and data cross-table
Data merging and linking
Input/Output tool (Text,csv,hdf5 ... ）
Working with text data
Visualization of
Time Series/Date function
Time difference
Categorical data
Calculation tools
Multi-index/Advanced Index

The order is significantly different from the order on the left side of the document's home page, which covers the topics I think are most important. Some parts of the document are not listed above, and you can read them yourself later.

After reading the above section of the document and completing about 10 Kaggle kernel, you should be able to understand the mechanism of Pandas without hindrance, and can perform the actual data analysis smoothly.

Learning Exploratory data analysis

By reading many popular Kaggle kernel, you will gain a lot in building good data analysis. For a more formal and rigorous approach, I suggest you read the fourth chapter of the Howard Seltman online book, "exploratory Data analysis". (http://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf)

Build your own Kernel.

You should consider creating your own kernel on the Kaggle. This is a good way to force yourself to write the program clearly. Often, the code you write yourself is messy and out of order, and not readable to others (including Your future self). But when you publish Kernel online, I would advise you to do better, just as you would expect from your current or prospective employer. You can write an executive summary or summary at the beginning, and then interpret each block of code with a comment. I usually write an exploratory but confusing program and then write a completely independent and readable program as the final product. This is one of my students. Kernel:https://www.kaggle.com/aselad/why-are-our-employees-leaving-prematurely on the HR analytics dataset

Don't just rely on Pandas, try to master it.

There is a big difference between a Pandas and a man who has mastered Pandas. Pandas's regular users can only write poorly, because Pandas has multiple functions and multiple ways to achieve the same results. Writing simple programs also makes it easy to get your results, but in fact it's very inefficient.

If you are a data scientist using Python, you may have used Pandas frequently. So you should put your mastery of Pandas in an important position, it can create a lot of value for you.

You can get a lot of interesting tips in the following links: https://stackoverflow.com/questions/17095101/ outputting-difference-in-two-pandas-dataframes-side-by-side-highlighting-the-d/47112033#47112033

Use Stack Overflow to test your knowledge

If you can't answer Stack Overflow's most questions about a Python library, you don't really know it. This assertion may be a bit overwhelming, but in general, Stack Overflow provides a good test platform for a specific understanding of a library. There are over 50,000 issues with the Pandas tag on the Stack Overflow, so you have an endless database to build your knowledge of Pandas.

If you've never answered a question on Stack Overflow, I suggest you look at the questions that already have answers and try to answer them only through the documentation. When you think you can integrate high-quality answers, I suggest you answer questions that are not answered. Answering questions at Stack Overflow is the best way to exercise my Pandas skills.

Complete your own project

Kaggle kernel is great, but in the end you need to deal with a unique task. The first step is finding the data. There are many data resources, such as:

Data.gov
Data.world
New York Public data, Houston Public data, Denver public data--most big cities in the United States have opened data portals.

Once you find the dataset you want to explore, continue to create Jupyter notebook the same way, and when you have a good final result, you can post it on GitHub.

Summarize

In short, as a beginner, we need to use the main mechanism of document learning Pandas operations, using real datasets, starting with Kaggle kernel to do data analysis, and finally, testing your knowledge on Stack Overflow.

Tutorials | An introductory Python data analysis Library pandas

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More