Section 1th Pandas Review section 2nd read and Write Text Format 3rd section using the HTML and Web API 4th section using the database 5th section merging datasets 6th section reshaping and axial rotation 7th section Data Conversion 8th section String Operations 9th section drawing and visualization
Pandas review First, the experimental introduction
To learn the course of data analysis, students need to master the Language foundation of Python, and have some knowledge of basic libraries such as Numpy and Matplotlib. Students can refer to the Basic Python language course for learning the experimental building and the Python Science Computing program.
Pandas is the library of choice behind our study of data analysis, which contains advanced data structures and operational tools to make our data analysis work faster and easier.
Pandas combines NumPy's array computing capabilities, spreadsheets, and relational databases with flexible data processing capabilities, while also providing more sophisticated indexing capabilities. For users in the financial industry, pandas offers a wealth of high-performance time-series capabilities and tools for financial data.
Here we first on the introduction of Pandas and NumPy to make a contract, which is also an editorial habit of Mumu.
In [1]: import pandas as pdIn [2]: import numpy as np
Therefore, as long as the students see PD in the code, it is necessary to know that this is pandas, see NP refers to NumPy. Mumu with the Ipython, the front In []
and the Out []
system comes with, representing the input and output, we do not need to enter. Refer to the basic Python tutorials and the scientific calculations of Python
Second, pandas data structure introduction
To use pandas, we first have to familiarize ourselves with its two main data structures: Series and DataFrame.
We import these two data structures in the Pandas library:
In [2]: from pandas import Series,DataFrame
1. Series
A Series is an object that is similar to a one-dimensional array, consisting of a set of data and the associated data labels (that is, indexes).
In the above we can see that we can create simple Series objects by just a set of data. The string representation of the Series is: The index is on the left and the value is on the right. When we created the Series object without attaching the specified index to the data, the system automatically creates an 0
n-1
integer index to (n is the length of the data). For each Series object, there is a value and Index property that represents the values and indexes of the array:
Usually we want to specify the index value we need when we create the Series object:
This allows us to not only access specific values within the Series object by the system-specified index values, but also through our own defined index values:
Note: The above u
is referred to as Unicode
Of course, in addition, if the data is stored in a Python dictionary, we can also create a Series directly from this dictionary:
If only one dictionary is passed in, the index in the result Series is the key of the original dictionary (ordered):
In this example, the three values that match the sdata and states indexes are found and placed in the corresponding position, and because the deloren
prettytable
corresponding sdata values are not found, the result is NaN, which represents the not a number, or "non-numeric". In pandas, it is used to represent missing data.
Pandas's isnull and notnull functions can be used to detect missing data, and there are similar instance methods in Series:
Series can automatically align different indexes in arithmetic operations, such as:
The object of Series, in addition to owning values
and index
attributes, has a name
property that we can simply understand the table header and index
also have a name
property that we can simply understand as the name of the index.
Of course, the index of a Series can be modified by assigning values:
2, DataFrame
DataFrame is a tabular data structure that has both a row index and a column index. There are many ways to build DataFrame, and one of the most common is to pass directly to a dictionary of equal-length lists or NumPy arrays:
DataFrame instance in the creation process, the front is automatically added to the row index, the top is the column index, we can columns
index
work with these two properties on the instance. For example, we have specified a column sequence, and we can arrange the column sequence in the order we want it:
Similarly, when we define a column, the missing data is generated if the incoming column is not found in the data:
We can call the column or row index to see the value of the DataFrame instance, but the row index adds a field ix
:
A column can be modified by assigning a value:
If we assign a value to the DataFrame column by Series, it will be matched exactly to the DataFrame index, and the vacant position is filled with the missing value.
We can also use keywords del
to delete columns:
Creating a DataFrame object with a nested dictionary (Dictionary of Dictionaries) is also a common way:
Summarize and expand it, the following construction DataFrame method, the above mentioned students can try it yourself:
type |
Description |
Two-dimensional Ndarray |
Data matrix, you can also pass in row and column labels |
A dictionary consisting of arrays, lists, or tuples |
Each sequence becomes a column of DataFrame. All sequences must be of the same length |
structured/recorded array of NumPy |
Similar to "dictionaries made up of arrays" |
A dictionary made up of Series |
Each Series becomes a column. If an index is not explicitly specified, the index of each Series is merged into the row index of the result |
Dictionaries made up of dictionaries |
Each inner-layer dictionary will be a column. The key will be merged into the row index of the result |
A list of dictionaries or Series |
The items will become a DataFrame line. The DataFrame of the dictionary key or Series index will be the column label of the |
A list that consists of a list or a tuple |
Similar to two-dimensional ndarray |
Another DataFrame |
The index of the DataFrame will be followed unless the specified other index is displayed |
NumPy's Maskedarray |
Similar to the case of two-dimensional ndarray, only the mask value in the result DataFrame becomes na/missing value |
The same DataFrame index and columns have the name attribute:
The values property of DataFrame is returned as a two-dimensional ndarray:
There are many ways to pick and rearrange in DataFrame, and here's a list for everyone:
type |
Description |
Obj[val] |
Select a single column or set of columns for DataFrame. Convenient in some special cases: Boolean array (filter row), slice (row slice), Boolean DataFrame (set value by condition) |
Obj.ix[val] |
Select a single row or group of rows for DataFrame |
Obj.ix[:,val] |
Select a single column or subset of columns |
OBJ.IX[VAL1,VAL2] |
Selecting Rows and columns at the same time |
Reindex method |
Match one or more axes to a new index |
xs method |
Select a single row or column based on the label and return a Series |
Icol, IRow method |
Selects a single or single row based on an integer position and returns a Series |
Get_value, Set_value method |
Select a single value based on row labels and column labels |
Iii. Summary and calculation description
Pandas objects have a common set of mathematical and statistical methods. Most of them belong to the reduction and summary statistics that are used to extract a single value from a series or to extract a series from a DataFrame row or column.
Let's start by looking at some of the methods that describe statistics:
Method |
Description |
Count |
Number of non-NA values |
Describe |
Calculate summary statistics for a Series or each DataFrame column |
Min, max |
Calculate minimum and maximum values |
Argmin, Argmax |
Calculates the index position (integer) at which the minimum and maximum values can be obtained |
Idxmin, Idxmax |
Calculates the index value that can get to the minimum and maximum values |
Quantile |
Calculate the number of sub-positions (0 to 1) for a sample |
Sum |
The sum of the values |
Mean |
Average number of values |
Median |
The arithmetic median of the value (50%-digit number) |
Mad |
Average absolute deviation based on average |
Var |
Variance of sample values |
Std |
Standard deviation of sample values |
Skew |
The skewness of the sample value (third-order moment) |
Kurt |
Kurtosis of sample values (four-order moment) |
Cumsum |
The cumulative sum of the sample values |
Cummin, Cummax |
Cumulative maximum and cumulative minimum values for sample values |
Cumprod |
Cumulative product of sample values |
Diff |
Calculate first-order difference (useful for time series) |
Pct_change |
Calculate percent Change |
Next, give the students a list of common options for a minimalist approach:
Options |
Description |
Axis |
The simple axis. DataFrame 0 for 1. |
Skipna |
Exclude missing values, the default value is True |
Level |
If the axis is a hierarchical index (that is, multiindex), the reduction is grouped by level |
Then I'll use some of the above methods to analyze Yahoo! Finance some of the company's stock prices and turnover (experimental building environment is no net, the following examples of students if the conditions can be tested on their own computer):
(There's a lot of it, just a little bit. Data is the stock trading information from January 1, 2010 to December 31, 2014, students can lose their own code to view, may buffer the longer time, do not worry)
With this data we can process and analyze it.
The correct method of the series is used to calculate the correlation coefficients of overlapping, non-NA-valued, indexed-aligned values in two series. Similarly, CoV is used to calculate the covariance:
Using the Corrwith method of DataFrame, you can calculate the correlation coefficients between its columns or rows with another Series or DataFrame. Passing in a series will return a series of correlation coefficient values, and passing in a DataFrame will calculate the correlation coefficients by column name pairs:
Well, there's a lot of stats and summaries out there, and in the code behind us, we're going to tell you about the methods or forms we haven't talked about. and data analysis is the need to have linear algebra, probability statistics and other aspects of mathematical knowledge, Mumu will not take the time to explain these mathematical knowledge, so there is no understanding of the students must come down to seize the time to review oh.
Iv. processing of missing data
In data analysis, the data we get most of the time may be missing, so what should we do with the missing data? Pandas uses the floating-point value NaN (not a number) to represent missing data in floating-point and non-floating-point groups. It is just a sign that is easy to detect:
Here are some of the NA processing methods:
Method |
Description |
Dropna |
Filter the axis labels according to the values of each tag, and adjust tolerance to missing values through thresholds |
Fillna |
Fills the missing data with a specified value or interpolation method, such as Ffill (which is populated with the previous values) or Bfill (padding the front with the following values) |
IsNull |
Returns an object that contains a Boolean value that indicates which value is the missing value/na, which is the same type as the source type |
Notnull |
The negation of IsNull |
Next, let's look at the code.
Mumu in the process of writing the example used reset
, clear all the objects and libraries, so to re-import, the students know that it is OK.
And for DataFrame objects, things are a little more complicated. We may want to discard NA
NA
rows or columns that are full or contained. dropna
default discards any rows that contain missing values:
Incoming how=‘all‘
will discard only those rows that are all NA:
In the above we want to discard the column, just pass in Axis=1 (behavior 0), sometimes we just leave a part of the data observation data, you can use the Thresh parameter:
Sometimes we don't want to filter the data directly, but we want to fill the missing values in other ways. The following is my main description in the code:
The parameter description of the Fillna function:
Parameters |
Description |
Value |
A scalar value or Dictionary object used to populate missing values |
Method |
Interpolation method. The default is ' Ffill ' if no other parameters are specified when the function is called |
Axis |
The axis to be filled by default0 |
Inpalce |
Modify the caller object without generating a copy |
Limit |
(for rows) the maximum number of padding that can be filled forward and backward |
Five, hierarchical index
Hierarchical indexing is an important feature of pandas, which is that we can have multiple (more than two) index levels on one axis. Simply put, it is the ability to handle high-dimensional data in a low-dimensional dimension.
Example one can understand:
Vi.. Homework
Students learn so many pandas basic functions introduced, I hope students can go down and review it again. In addition to the study of data analysis needs a certain mathematical basis, but also hope that students can be a good review of linear algebra and statistics for the needs of the future.
Python Data Analysis (i) This experiment will learn pandas basics, data loading, storage and file formats, data normalization, mapping and visualization knowledge