Pandas basics, pandas
Pandas is a data analysis package built based on Numpy that contains more advanced data structures and tools.
Similar to Numpy, the core is ndarray, and pandas is centered around the two core data structures of Series and DataFrame. Series and DataFrame correspond to one-dimensional sequences and two-dimensional table structures respectively. Pandas uses the following methods to import data:
from pandas import Series,DataFrameimport pandas as pd
Series
Series can be seen as an ordered dictionary with a fixed length. Almost any one-dimensional data can be used to construct a Series object:
>>> s = Series([1,2,3.0,'abc'])>>> s0 11 22 33 abcdtype: object
Althoughdtype:object
It can contain a variety of basic data types, but it always seems to affect the performance. It is best to maintain a simple dtype.
The Series object contains two main attributes: index and values, which are the left and right columns in the preceding example. Because the list is passed to the constructor, the index value is an integer that increases progressively from 0. If the input is a dictionary-like key-Value Pair structure, A Series corresponding to index-value is generated; or an index object is explicitly specified with a keyword parameter during initialization:
>>> s = Series(data=[1,3,5,7],index = ['a','b','x','y'])>>> sa 1b 3x 5y 7dtype: int64>>> s.indexIndex(['a', 'b', 'x', 'y'], dtype='object')>>> s.valuesarray([1, 3, 5, 7], dtype=int64)
The elements of the Series object are constructed strictly according to the given index, which means that if the data parameter has a key-value pair, only the keys contained in the index will be used; and if the response key is missing from data, the key will be added even if the NaN value is given.
Note that although there is a correspondence between the index of Series and the values elements, this is different from the dictionary ing. Index and values are actually independent ndarray arrays, so the performance of Series objects is completely OK.
The biggest benefit of using a data structure such as a key-value pair is that the index is automatically aligned during arithmetic operations between Series.
In addition, both the Series object and its index containname
Attribute:
>>> s.name = 'a_series'>>> s.index.name = 'the_index'>>> sthe_indexa 1b 3x 5y 7Name: a_series, dtype: int64
DataFrame
DataFrame is a table-type data structure that contains a group of ordered columns (similar to index). Each column can be of different value types (unlike ndarray, it can only have one dtype ). DataFrame can be considered as a set of Series that share the same index.
The construction method of DataFrame is similar to that of Series, except that multiple one-dimensional data sources can be accepted at the same time. Each data source is a separate column:
>>> data = {'state':['Ohino','Ohino','Ohino','Nevada','Nevada'], 'year':[2000,2001,2002,2001,2002], 'pop':[1.5,1.7,3.6,2.4,2.9]}>>> df = DataFrame(data)>>> df pop state year0 1.5 Ohino 20001 1.7 Ohino 20012 3.6 Ohino 20023 2.4 Nevada 20014 2.9 Nevada 2002[5 rows x 3 columns]
Although the data parameter looks like a dictionary, the dictionary key does not act as the index of DataFrame, but is the "name" attribute of Series. The index generated here is still "01234 ".
The complete DataFrame constructor parameters are:DataFrame(data=None,index=None,coloumns=None)
, Columns is "name ":
>>> df = DataFrame(data,index=['one','two','three','four','five'], columns=['year','state','pop','debt'])>>> df year state pop debtone 2000 Ohino 1.5 NaNtwo 2001 Ohino 1.7 NaNthree 2002 Ohino 3.6 NaNfour 2001 Nevada 2.4 NaNfive 2002 Nevada 2.9 NaN[5 rows x 4 columns]
The missing value is also supplemented by NaN. Let's take a look at the index, columns, and index types:
>>> df.indexIndex(['one', 'two', 'three', 'four', 'five'], dtype='object')>>> df.columnsIndex(['year', 'state', 'pop', 'debt'], dtype='object')>>> type(df['debt'])<class 'pandas.core.series.Series'>
The row-oriented and column-oriented operations of DataFrame are basically balanced. Any column extracted is a Series.
Re-indexing object properties
Re-indexing a Series object through its.reindex(index=None,**kwargs)
Method implementation.**kwargs
Common parameters include:method=None,fill_value=np.NaN
:
ser = Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])>>> a = ['a','b','c','d','e']>>> ser.reindex(a)a -5.3b 7.2c 3.6d 4.5e NaNdtype: float64>>> ser.reindex(a,fill_value=0)a -5.3b 7.2c 3.6d 4.5e 0.0dtype: float64>>> ser.reindex(a,method='ffill')a -5.3b 7.2c 3.6d 4.5e 4.5dtype: float64>>> ser.reindex(a,fill_value=0,method='ffill')a -5.3b 7.2c 3.6d 4.5e 4.5dtype: float64
.reindex()
The method returns a new object whose index strictly follows the given parameter,method:{'backfill', 'bfill', 'pad', 'ffill', None}
The parameter is used to specify the interpolation (fill) method. When no interpolation (fill) method is providedfill_value
Fill. The default value is NaN (ffill = pad and bfill = back fill, which respectively indicate the forward or backward values during interpolation)
The method for re-indexing A DataFrame object is as follows:.reindex(index=None,columns=None,**kwargs)
. Only one optional columns parameter is added to the column index. The usage is similar to the previous example, except that the interpolation methodmethod
The parameter can only be applied to rows, that is, axis 0.
>>> state = ['Texas','Utha','California']>>> df.reindex(columns=state,method='ffill') Texas Utha Californiaa 1 NaN 2c 4 NaN 5 d 7 NaN 8[3 rows x 3 columns]>>> df.reindex(index=['a','b','c','d'],columns=state,method='ffill') Texas Utha Californiaa 1 NaN 2b 1 NaN 2c 4 NaN 5d 7 NaN 8[4 rows x 3 columns]
Howeverfill_value
Still valid. Smart friends may have thought of it.df.T.reindex(index,method='**').T
In this way, it is feasible to implement column interpolation. Note thatreindex(index,method='**')
The index must be monotonous; otherwise,ValueError: Must be monotonic for forward fill
For example, if you useindex=['a','b','d','c']
It won't work.
Deletes an item on a specified axis.
That is, to delete the meaning of a Series element or a row (column) of DataFrame.drop(labels, axis=0)
Method:
>>> serd 4.5b 7.2a -5.3c 3.6dtype: float64>>> df Ohio Texas Californiaa 0 1 2c 3 4 5d 6 7 8[3 rows x 3 columns]>>> ser.drop('c')d 4.5b 7.2a -5.3dtype: float64>>> df.drop('a') Ohio Texas Californiac 3 4 5d 6 7 8[2 rows x 3 columns]>>> df.drop(['Ohio','Texas'],axis=1) Californiaa 2c 5d 8[3 rows x 1 columns]
.drop()
The returned object is a new object, and the meta object will not be changed.
Indexing and Slicing
Like Numpy, pandas also supportsobj[::]
And filter by using a Boolean array.
Note that the index of the pandas object is not limited to integers, so when a non-integer is used as the slice index, it is included at the end.
>>> fooa 4.5b 7.2c -5.3d 3.6dtype: float64>>> bar0 4.51 7.22 -5.33 3.6dtype: float64>>> foo[:2]a 4.5b 7.2dtype: float64>>> bar[:2]0 4.51 7.2dtype: float64>>> foo[:'c']a 4.5b 7.2c -5.3dtype: float64
Here, foo and bar are only different from index -- the index of bar is an integer sequence. It can be seen that when an integer index is used for slicing, the result is the same as that of the Python list or Numpy by default.'c'
In such a string index, the result contains this boundary element.
Another special feature is the indexing method of the DataFrame object because it has two axial indexes (dual indexes ).
It can be understood that the standard slicing syntax of the DataFrame object is:.ix[::,::]
. The ix object can accept two sets of slices, which are the direction of the row (axis = 0) and column (axis = 1:
>>> df Ohio Texas Californiaa 0 1 2c 3 4 5d 6 7 8[3 rows x 3 columns]>>> df.ix[:2,:2] Ohio Texasa 0 1c 3 4[2 rows x 2 columns]>>> df.ix['a','Ohio']0
Without ix, direct switch is special:
- When indexing, columns are selected
- When slicing, the row is selected
This seems a bit illogical, but the author explains that "this syntax setting comes from practice" and we trust him.
>>> df['Ohio']a 0c 3d 6Name: Ohio, dtype: int32>>> df[:'c'] Ohio Texas Californiaa 0 1 2c 3 4 5[2 rows x 3 columns]>>> df[:2] Ohio Texas Californiaa 0 1 2c 3 4 5[2 rows x 3 columns]
When using a Boolean array, note the different cut methods of rows and columns (column cut method:
It cannot be saved ):
>>> df['Texas']>=4a Falsec Trued TrueName: Texas, dtype: bool>>> df[df['Texas']>=4] Ohio Texas Californiac 3 4 5d 6 7 8[2 rows x 3 columns]>>> df.ix[:,df.ix['c']>=4] Texas Californiaa 1 2c 4 5d 7 8[3 rows x 2 columns]
Arithmetic Operations and Data Alignment
One of pandas's most important functions is that it can perform arithmetic operations on objects with different indexes. When an object is added, the index of the result is the union of the index pair. Automatic Data Alignment introduces null values in non-overlapping indexes. The default value is NaN.
>>> foo = Series({'a':1,'b':2})>>> fooa 1b 2dtype: int64>>> bar = Series({'b':3,'d':4})>>> barb 3d 4dtype: int64>>> foo + bara NaNb 5d NaNdtype: float64
The alignment operation of DataFrame occurs simultaneously on rows and columns.
If you do not want the NA value to appear in the calculation result, you can use the previously mentioned reindexfill_value
Parameter, but to pass this parameter, you need to use the object method instead of the operator:df1.add(df2,fill_value=0)
. Other arithmetic methods include:sub(), div(), mul()
.
Arithmetic Operations between Series and DataFrame involve broadcasting.
Function Application and ing
Numpy's ufuncs (element-level array method) can also be used to operate pandas objects.
When you want to apply a function to a row or column of A DataFrame object, you can use.apply(func, axis=0, args=(), **kwds)
Method.
f = lambda x:x.max()-x.min()>>> df Ohio Texas Californiaa 0 1 2c 3 4 5d 6 7 8[3 rows x 3 columns]>>> df.apply(f)Ohio 6Texas 6California 6dtype: int64>>> df.apply(f,axis=1)a 2c 2d 2dtype: int64
Sorting and ranking
Seriessort_index(ascending=True)
The method can be used to sort the index. The ascending parameter is used to control the ascending or descending order. The default value is ascending.
To sort Series by value.order()
Method. All missing values are placed at the end of the Series by default.
On DataFrame,.sort_index(axis=0, by=None, ascending=True)
The method has an axial selection parameter and A by parameter. The by parameter is used to sort a certain column (The by parameter cannot be used for rows ):
>>> df.sort_index(by='Ohio') Ohio Texas Californiaa 0 1 2c 3 4 5d 6 7 8[3 rows x 3 columns]>>> df.sort_index(by=['California','Texas']) Ohio Texas Californiaa 0 1 2c 3 4 5d 6 7 8[3 rows x 3 columns]>>> df.sort_index(axis=1) California Ohio Texasa 2 0 1c 5 3 4d 8 6 7[3 rows x 3 columns]
Rank (Series.rank(method='average', ascending=True)
) Is used to replace the value of an object with the rank (from 1 to n ). At this time, the only problem is how to deal with level items.method
The parameter serves this purpose. It has four optional values:average, min, max, first
.
>>> ser=Series([3,2,0,3],index=list('abcd'))>>> sera 3b 2c 0d 3dtype: int64>>> ser.rank()a 3.5b 2.0c 1.0d 3.5dtype: float64>>> ser.rank(method='min')a 3b 2c 1d 3dtype: float64>>> ser.rank(method='max')a 4b 2c 1d 4dtype: float64>>> ser.rank(method='first')a 3b 2c 1d 4dtype: float64
Note the ranking of different method parameters on the level items of ser [0] = ser [3.
DataFrame.rank(axis=0, method='average', ascending=True)
The method has multiple axis parameters. You can select to rank by row or column. Currently, it seems that there is no ranking method for all elements.
Statistical methods
Pandas objects have some statistical methods. Most of them belong to reduction and summary statistics, which are used to extract a single value from Series or extract a Series from a row or column of DataFrame.
For exampleDataFrame.mean(axis=0,skipna=True)
Method. When there are NA values in the dataset, these values will be skipped, unless the entire slice (row or column) is all NA. If you do not want this, you can useskipna=False
To disable this function:
>>> df one twoa 1.40 NaNb 7.10 -4.5c NaN NaNd 0.75 -1.3[4 rows x 2 columns]>>> df.mean()one 3.083333two -2.900000dtype: float64>>> df.mean(axis=1)a 1.400b 1.300c NaNd -0.275dtype: float64>>> df.mean(axis=1,skipna=False)a NaNb 1.300c NaNd -0.275dtype: float64
Other common statistical methods include:
######################## |
**************************************** ** |
Count |
Number of non-NA values |
Describe |
Column calculation summary statistics for Series or DF |
Min, max |
Minimum and maximum |
Argmin, argmax |
Index location of the minimum and maximum values (integer) |
Idxmin, idxmax |
Index value of the minimum and maximum values |
Quantile |
Sample quantile (0 to 1) |
Sum |
Sum |
Mean |
Mean Value |
Median |
Median |
Mad |
Calculate the mean absolute deviation based on the mean value. |
Var |
Variance |
Std |
Standard Deviation |
Skew |
Skewness of sample values (third moment) |
Kurt |
Kurtosis of sample values (fourth moment) |
Cumsum |
Sum of sample values |
Cummin, cummax |
Cumulative maximum and cumulative minimum values of sample values |
Cumprod |
Cumulative product of sample values |
Diff |
Calculate the first-order difference (useful for time series) |
Pct_change |
Calculated percentage change |
Process Missing Data
In pandas, NA is mainly expressed as np. nan. In addition, the built-in None of Python will also be processed as NA.
There are four methods to process NA:dropna , fillna , isnull , notnull
.
Is (not) null
This method applies the object at the Element Level, and then returns a Boolean array, which can be used for Boolean indexes.
Dropna
For a Series, dropna returns a Series containing only non-null data and index values.
The problem lies in the way DataFrame is processed, because once dropped, at least one row (column) should be dropped ). The solution here is similar to the previous one, or through an additional parameter:dropna(axis=0, how='any', thresh=None)
, The optional value of the "how" parameter is any or all. All discards this row (column) only when all slice elements are NA ). Another interesting parameter is thresh, which is of the integer type. Its function is to, for example, thresh = 3, it is retained when there are at least three non-NA values in a row.
Fillna
fillna(value=None, method=None, axis=0)
In addition to the basic type, you can also use the dictionary to fill different values in different columns. Method.reindex()
The methods are the same.
Inplace Parameter
I didn't talk about the previous point. I found it very important to write down the entire example of the result. In the methods of Series and DataFrame objects, any method that modifies the array and returns a new array usually hasreplace=False
. If it is set to True manually, the original array can be replaced.