Python Data Analysis (i) This experiment will learn pandas basics, data loading, storage and file formats, data normalization, mapping and visualization knowledge

Source: Internet
Author: User
Tags arithmetic stock prices

Section 1th Pandas Review section 2nd read and Write Text Format 3rd section using the HTML and Web API 4th section using the database 5th section merging datasets 6th section reshaping and axial rotation 7th section Data Conversion 8th section String Operations 9th section drawing and visualization

Pandas review First, the experimental introduction

To learn the course of data analysis, students need to master the Language foundation of Python, and have some knowledge of basic libraries such as Numpy and Matplotlib. Students can refer to the Basic Python language course for learning the experimental building and the Python Science Computing program.

Pandas is the library of choice behind our study of data analysis, which contains advanced data structures and operational tools to make our data analysis work faster and easier.

Pandas combines NumPy's array computing capabilities, spreadsheets, and relational databases with flexible data processing capabilities, while also providing more sophisticated indexing capabilities. For users in the financial industry, pandas offers a wealth of high-performance time-series capabilities and tools for financial data.

Here we first on the introduction of Pandas and NumPy to make a contract, which is also an editorial habit of Mumu.

In [1]: import pandas as pdIn [2]: import numpy as np

Therefore, as long as the students see PD in the code, it is necessary to know that this is pandas, see NP refers to NumPy. Mumu with the Ipython, the front In [] and the Out [] system comes with, representing the input and output, we do not need to enter. Refer to the basic Python tutorials and the scientific calculations of Python

Second, pandas data structure introduction

To use pandas, we first have to familiarize ourselves with its two main data structures: Series and DataFrame.

We import these two data structures in the Pandas library:

In [2]: from pandas import Series,DataFrame
1. Series

A Series is an object that is similar to a one-dimensional array, consisting of a set of data and the associated data labels (that is, indexes).

In the above we can see that we can create simple Series objects by just a set of data. The string representation of the Series is: The index is on the left and the value is on the right. When we created the Series object without attaching the specified index to the data, the system automatically creates an 0 n-1 integer index to (n is the length of the data). For each Series object, there is a value and Index property that represents the values and indexes of the array:

Usually we want to specify the index value we need when we create the Series object:

This allows us to not only access specific values within the Series object by the system-specified index values, but also through our own defined index values:

Note: The above u is referred to as Unicode

Of course, in addition, if the data is stored in a Python dictionary, we can also create a Series directly from this dictionary:

If only one dictionary is passed in, the index in the result Series is the key of the original dictionary (ordered):

In this example, the three values that match the sdata and states indexes are found and placed in the corresponding position, and because the deloren prettytable corresponding sdata values are not found, the result is NaN, which represents the not a number, or "non-numeric". In pandas, it is used to represent missing data.

Pandas's isnull and notnull functions can be used to detect missing data, and there are similar instance methods in Series:

Series can automatically align different indexes in arithmetic operations, such as:

The object of Series, in addition to owning values and index attributes, has a name property that we can simply understand the table header and index also have a name property that we can simply understand as the name of the index.

Of course, the index of a Series can be modified by assigning values:

2, DataFrame

DataFrame is a tabular data structure that has both a row index and a column index. There are many ways to build DataFrame, and one of the most common is to pass directly to a dictionary of equal-length lists or NumPy arrays:

DataFrame instance in the creation process, the front is automatically added to the row index, the top is the column index, we can columns index work with these two properties on the instance. For example, we have specified a column sequence, and we can arrange the column sequence in the order we want it:

Similarly, when we define a column, the missing data is generated if the incoming column is not found in the data:

We can call the column or row index to see the value of the DataFrame instance, but the row index adds a field ix :

A column can be modified by assigning a value:

If we assign a value to the DataFrame column by Series, it will be matched exactly to the DataFrame index, and the vacant position is filled with the missing value.

We can also use keywords del to delete columns:

Creating a DataFrame object with a nested dictionary (Dictionary of Dictionaries) is also a common way:

Summarize and expand it, the following construction DataFrame method, the above mentioned students can try it yourself:

type Description
Two-dimensional Ndarray Data matrix, you can also pass in row and column labels
A dictionary consisting of arrays, lists, or tuples Each sequence becomes a column of DataFrame. All sequences must be of the same length
structured/recorded array of NumPy Similar to "dictionaries made up of arrays"
A dictionary made up of Series Each Series becomes a column. If an index is not explicitly specified, the index of each Series is merged into the row index of the result
Dictionaries made up of dictionaries Each inner-layer dictionary will be a column. The key will be merged into the row index of the result
A list of dictionaries or Series The items will become a DataFrame line. The DataFrame of the dictionary key or Series index will be the column label of the
A list that consists of a list or a tuple Similar to two-dimensional ndarray
Another DataFrame The index of the DataFrame will be followed unless the specified other index is displayed
NumPy's Maskedarray Similar to the case of two-dimensional ndarray, only the mask value in the result DataFrame becomes na/missing value

The same DataFrame index and columns have the name attribute:

The values property of DataFrame is returned as a two-dimensional ndarray:

There are many ways to pick and rearrange in DataFrame, and here's a list for everyone:

type Description
Obj[val] Select a single column or set of columns for DataFrame. Convenient in some special cases: Boolean array (filter row), slice (row slice), Boolean DataFrame (set value by condition)
Obj.ix[val] Select a single row or group of rows for DataFrame
Obj.ix[:,val] Select a single column or subset of columns
OBJ.IX[VAL1,VAL2] Selecting Rows and columns at the same time
Reindex method Match one or more axes to a new index
xs method Select a single row or column based on the label and return a Series
Icol, IRow method Selects a single or single row based on an integer position and returns a Series
Get_value, Set_value method Select a single value based on row labels and column labels
Iii. Summary and calculation description

Pandas objects have a common set of mathematical and statistical methods. Most of them belong to the reduction and summary statistics that are used to extract a single value from a series or to extract a series from a DataFrame row or column.

Let's start by looking at some of the methods that describe statistics:

Method Description
Count Number of non-NA values
Describe Calculate summary statistics for a Series or each DataFrame column
Min, max Calculate minimum and maximum values
Argmin, Argmax Calculates the index position (integer) at which the minimum and maximum values can be obtained
Idxmin, Idxmax Calculates the index value that can get to the minimum and maximum values
Quantile Calculate the number of sub-positions (0 to 1) for a sample
Sum The sum of the values
Mean Average number of values
Median The arithmetic median of the value (50%-digit number)
Mad Average absolute deviation based on average
Var Variance of sample values
Std Standard deviation of sample values
Skew The skewness of the sample value (third-order moment)
Kurt Kurtosis of sample values (four-order moment)
Cumsum The cumulative sum of the sample values
Cummin, Cummax Cumulative maximum and cumulative minimum values for sample values
Cumprod Cumulative product of sample values
Diff Calculate first-order difference (useful for time series)
Pct_change Calculate percent Change

Next, give the students a list of common options for a minimalist approach:

Options Description
Axis The simple axis. DataFrame 0 for 1.
Skipna Exclude missing values, the default value is True
Level If the axis is a hierarchical index (that is, multiindex), the reduction is grouped by level

Then I'll use some of the above methods to analyze Yahoo! Finance some of the company's stock prices and turnover (experimental building environment is no net, the following examples of students if the conditions can be tested on their own computer):

(There's a lot of it, just a little bit. Data is the stock trading information from January 1, 2010 to December 31, 2014, students can lose their own code to view, may buffer the longer time, do not worry)

With this data we can process and analyze it.

The correct method of the series is used to calculate the correlation coefficients of overlapping, non-NA-valued, indexed-aligned values in two series. Similarly, CoV is used to calculate the covariance:

Using the Corrwith method of DataFrame, you can calculate the correlation coefficients between its columns or rows with another Series or DataFrame. Passing in a series will return a series of correlation coefficient values, and passing in a DataFrame will calculate the correlation coefficients by column name pairs:

Well, there's a lot of stats and summaries out there, and in the code behind us, we're going to tell you about the methods or forms we haven't talked about. and data analysis is the need to have linear algebra, probability statistics and other aspects of mathematical knowledge, Mumu will not take the time to explain these mathematical knowledge, so there is no understanding of the students must come down to seize the time to review oh.

Iv. processing of missing data

In data analysis, the data we get most of the time may be missing, so what should we do with the missing data? Pandas uses the floating-point value NaN (not a number) to represent missing data in floating-point and non-floating-point groups. It is just a sign that is easy to detect:

Here are some of the NA processing methods:

Method Description
Dropna Filter the axis labels according to the values of each tag, and adjust tolerance to missing values through thresholds
Fillna Fills the missing data with a specified value or interpolation method, such as Ffill (which is populated with the previous values) or Bfill (padding the front with the following values)
IsNull Returns an object that contains a Boolean value that indicates which value is the missing value/na, which is the same type as the source type
Notnull The negation of IsNull

Next, let's look at the code.

Mumu in the process of writing the example used reset , clear all the objects and libraries, so to re-import, the students know that it is OK.

And for DataFrame objects, things are a little more complicated. We may want to discard NA NA rows or columns that are full or contained. dropnadefault discards any rows that contain missing values:

Incoming how=‘all‘ will discard only those rows that are all NA:

In the above we want to discard the column, just pass in Axis=1 (behavior 0), sometimes we just leave a part of the data observation data, you can use the Thresh parameter:

Sometimes we don't want to filter the data directly, but we want to fill the missing values in other ways. The following is my main description in the code:

  

The parameter description of the Fillna function:

Parameters Description
Value A scalar value or Dictionary object used to populate missing values
Method Interpolation method. The default is ' Ffill ' if no other parameters are specified when the function is called
Axis The axis to be filled by default0
Inpalce Modify the caller object without generating a copy
Limit (for rows) the maximum number of padding that can be filled forward and backward
Five, hierarchical index

Hierarchical indexing is an important feature of pandas, which is that we can have multiple (more than two) index levels on one axis. Simply put, it is the ability to handle high-dimensional data in a low-dimensional dimension.

Example one can understand:

Vi.. Homework

Students learn so many pandas basic functions introduced, I hope students can go down and review it again. In addition to the study of data analysis needs a certain mathematical basis, but also hope that students can be a good review of linear algebra and statistics for the needs of the future.

Python Data Analysis (i) This experiment will learn pandas basics, data loading, storage and file formats, data normalization, mapping and visualization knowledge

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.