Pandas is a data analysis package built on Numpy that contains more advanced structures and tools
The core of the Numpy is that Ndarray,pandas also revolves around the Series and DataFrame two core data structures. Series and DataFrame correspond to one-dimensional sequences and two-dimensional table structures, respectively. The following are the conventional methods of importing pandas:
From pandas import Series,dataframeimport pandas as PD
Series
A Series can be seen as a fixed-length ordered dictionary. Basic arbitrary one-dimensional data can be used to construct Series objects:
>>> s = Series ([1,2,3.0, ' abc ']) >>> s0 abcdtype:object
Although it dtype:object
can contain a variety of basic data types, the overall feeling will affect the performance of the appearance, it is best to maintain a simple dtype.
The Series object contains two main attributes: Index and values, respectively, of the left and right columns in the previous example. Because a list is passed to the constructor, the value of index is an integer incremented from 0, and if a key-value pair structure of a dictionary is passed in, a index-value corresponding Series is generated, or an Index object is explicitly specified with the keyword argument at initialization time:
>>> s = Series (Data=[1,3,5,7],index = [' A ', ' B ', ' x ', ' y ']) >>> SA 1b 3x 5y 7dtype: Int64>>> S.indexindex ([' A ', ' B ', ' x ', ' y '], dtype= ' object ') >>> S.valuesarray ([1, 3, 5, 7], Dtype=int64 )
The elements of a Series object are constructed strictly according to the index given, which means that if the data parameter is a key-value pair, only the keys contained in index are used, and if the key for the response is missing in data, the key is added even if the NaN value is given.
Note that there is a correspondence between the index of the Series and the elements of values, but this is different from the dictionary mapping. Index and values are actually still separate ndarray arrays, so the performance of the Series object is completely OK.
The greatest benefit of this series of data structures using key-value pairs is that index is automatically aligned when arithmetic operations are performed between series.
In addition, the Series object and its index contain a name
property:
>>> s.name = ' a_series ' >>> s.index.name = ' the_index ' >>> sthe_indexa 1b 3x 5y 7name:a_series, Dtype:int64
DataFrame
DataFrame is a tabular data structure that contains a set of ordered columns (similar to index), each of which can be a different value type (unlike Ndarray can have only one dtype). You can basically think of DataFrame as a collection of Series that shares the same index.
DataFrame is constructed in a similar way to Series, except that it can accept multiple one-dimensional data sources at the same time, and each one becomes a separate column:
>>> data = {' state ': [' Ohino ', ' Ohino ', ' Ohino ', ' Nevada ', ' Nevada '], ' year ': [2000,2001,2002,2001,2002], ' Pop ':[1.5,1.7,3.6,2.4,2.9]}>>> df = DataFrame (data) >>> DF pop State year0 1.5 Ohino 20001 1.7 Ohino 20012 3.6 Ohino 20023 2.4 Nevada 20014 2.9 Nevada 2002[5 rows x 3 columns]
Although the parameter data appears to be a dictionary, the key of the dictionary is not the role of the index of the DataFrame, but the "name" property of the Series. The index generated here is still "01234".
The more complete DataFrame constructor parameter is: DataFrame(data=None,index=None,coloumns=None)
, Columns is "name":
>>> df = DataFrame (data,index=[' one ', ' one ', ' three ', ' four ', ' five '), columns=[' year ', ' state ', ' Pop ', ' Debt ']) >>> DF Year State pop debtone ohino 1.5 nantwo 2001 Ohino 1.7 nanthree 2002 Ohino 3.6 nanfour 2001 Nevada 2.4 Nanfive 2002 Nevada 2.9 nan[5 rows x 4 columns]
The same missing value is made up of NaN. Take a look at the types of index, columns, and indexes:
>>> Df.indexindex ([' One ', ' one ', ' one ', ' three ', ' four ', ' five '], dtype= ' object ') >>> df.columnsindex ([' Year ', ' state ', ' pop ', ' Debt '], dtype= ' object ') >>> type (df[' debt ']) <class ' pandas.core.series.Series ' >
DataFrame line-oriented and column-oriented operations are basically balanced, and any column that is drawn out is a series.
Object Properties Re-index
The re-indexing of a Series object is implemented by its .reindex(index=None,**kwargs)
method. **kwargs
there are two common parameters: method=None,fill_value=np.NaN
Ser = Series ([4.5,7.2,-5.3,3.6],index=[' d ', ' B ', ' A ', ' C ']) >>> a = [' A ', ' B ', ' C ', ' d ', ' E ']>>> Ser.reindex (a) a -5.3b 7.2c 3.6d 4.5e nandtype:float64>>> ser.reindex (a,fill_value= 0) A -5.3b 7.2c 3.6d 4.5e 0.0dtype:float64>>> ser.reindex (a,method= ' Ffill ') a -5.3b 7.2c 3.6d 4.5e 4.5dtype:float64>>> ser.reindex (a,fill_value=0,method= ' Ffill ') a -5.3b 7.2c 3.6d 4.5e 4.5dtype:float64
.reindex()
The Ffill method returns a new object whose index strictly follows the given parameter, which method:{‘backfill‘, ‘bfill‘, ‘pad‘, ‘ffill‘, None}
specifies the interpolation (padding) method, and when not given, is automatically populated with the fill_value
default NaN (= = Pad,bfill = back fill, respectively, referring to the interpolated value forward or Post-fetch value)
The DataFrame object's re-indexing method is: .reindex(index=None,columns=None,**kwargs)
. There is an optional columns parameter for the column index, which is more than the Series. The usage is similar to the previous example except that the interpolation method method
parameter can only be applied to rows, that is, Axis 0.
>>> state = [' Texas ', ' Utha ', ' California ']>>> df.reindex (columns=state,method= ' Ffill ') Texas Utha CALIFORNIAA 1 nan 2c 4 nan 5 D 7 nan 8[3 rows x 3 columns]>> > Df.reindex (index=[' A ', ' B ', ' C ', ' d '],columns=state,method= ' Ffill ') Texas Utha CALIFORNIAA 1 Nan 2b 1 nan 2c 4 nan 5d 7 NaN 8[4 rows x 3 columns]
But it fill_value
's still valid. A smart little partner might have thought that it would be possible to df.T.reindex(index,method=‘**‘).T
achieve interpolation on the column in such a way that the answer was feasible. Also note that when used reindex(index,method=‘**‘)
, index must be monotonous, otherwise it will be thrown ValueError: Must be monotonic for forward fill
, such as the last call in the previous example, if index=[‘a‘,‘b‘,‘d‘,‘c‘]
the use of the words will not.
Delete an item on a specified axis
That is, delete the elements of a Series or the meaning of a row (column) of DataFrame, by means of the object .drop(labels, axis=0)
:
>>> serd 4.5b 7.2a -5.3c 3.6dtype:float64>>> df Ohio Texas CALIFORNIAA 0 1 2c 3 4 5d 6 7 8[3 rows x 3 columns]>>> ser.drop ( ' C ') d 4.5b 7.2a -5.3dtype:float64>>> df.drop (' a ') Ohio Texas californiac 3 4 5d 6 7 8[2 rows x 3 columns]>>> df.drop ([' Ohio ', ' Texas '],axis=1) CALIFORNIAA 2c 5d 8[3 rows x 1 Columns]
.drop()
A new object is returned, and the meta-object is not changed.
Indexes and slices
Like Numpy,pandas also supports obj[::]
indexing and slicing in a way, and filtering through a Boolean array.
However, it is important to note that because the index of the Pandas object is not limited to integers, it is included at the end when using a non-integer as the tile index.
>>> fooa 4.5b 7.2c -5.3d 3.6dtype:float64>>> bar0 4.51 7.22 -5.33 3.6dtype:float64>>> foo[:2]a 4.5b 7.2dtype:float64>>> bar[:2]0 4.51 7.2dtype:float64>>> foo[: ' C ']a 4.5b 7.2c -5.3dtype:float64
Here foo and bar only index different--bar index is an integer sequence. Visible when you use an integer index slice, the result is the same as the default for a Python list or Numpy, and ‘c‘
when you change to such a string index, the result contains the boundary element.
Another special thing is the way the DataFrame object is indexed, because he has two axes (double-indexed).
You can understand this: the standard slicing syntax for DataFrame objects is: .ix[::,::]
. The IX object can accept two sets of slices, respectively, in the direction of rows (axis=0) and Columns (Axis=1):
>>> DF Ohio Texas CALIFORNIAA 0 1 2c 3 4 5d 6 7 8[3 rows x 3 columns]>>> df.ix[:2,:2] Ohio Texasa 0 1c 3 4[2 rows x 2 columns]>>> df.ix[' A ', ' Ohio ']0
Without the use of IX, the direct cut case is special:
- Index, the column is selected
- When slicing, the row is selected
This may seem illogical, but the author explains that "This grammatical setting comes from practice" and we believe him.
>>> df[' Ohio ']a 0c 3d 6name:ohio, dtype:int32>>> df[: ' C '] Ohio Texas CALIFORNIAA 0 1 2c 3 4 5[2 rows x 3 columns]>>> df[:2] Ohio Texas CALIFORNIAA 0 1 2c 3 4 5[2 rows x 3 columns]
With the case of a Boolean array, note the different tangent of rows and columns (Liecefa :
cannot be saved):
>>> df[' Texas ']>=4a falsec trued truename:texas, dtype:bool>>> df[df[' Texas '] >=4] Ohio Texas californiac 3 4 5d 6 7 8[2 rows x 3 columns]>>> df.ix[:,df.ix[' C ']>=4] Texas CALIFORNIAA 1 2c 4 5d 7 8[3 rows x 2 Columns
Arithmetic operations and data alignment
One of the most important features of pandas is that it can perform arithmetic operations on objects of different indexes. When an object is added, the index of the result is the set of the index pair. Automatic data alignment introduces null values at non-overlapping indexes, and defaults to NaN.
>>> foo = series ({' A ': 1, ' B ': 2}) >>> FOOA 1b 2dtype:int64>>> bar = series ({' B ': 3, ' d ': 4}) >>> Barb 3d 4dtype:int64>>> foo + bara nanb 5d Nandtype:float64
DataFrame alignment operations occur on both rows and columns.
When you do not want the NA value to appear in the result of the operation, you can use the arguments mentioned in the previous reindex, fill_value
but in order to pass this parameter, you need to use the object's method instead of the operator: df1.add(df2,fill_value=0)
. Other arithmetic methods are: sub(), div(), mul()
.
The arithmetic operation between Series and DataFrame involves broadcasting, but not speaking for the time being.
function Application and Mapping
Numpy's Ufuncs (element progression group method) can also be used to manipulate pandas objects.
You can use a method when you want to apply a function to a row or column of a DataFrame object .apply(func, axis=0, args=(), **kwds)
.
f = Lambda X:x.max ()-x.min () >>> DF Ohio Texas CALIFORNIAA 0 1 2c 3 4 5d 6 7 8[3 rows x 3 columns]>>> df.apply (f) Ohio 6Texas 6California 6dtype:int64>>> df.apply (F,axis=1) a 2c 2d 2dtype:int64
Sort and rank
The Series sort_index(ascending=True)
method can sort the index, and the ascending parameter is used to control ascending or descending, and the default is ascending.
To sort the series by value, .order()
any missing values are placed at the end of the series by default when the method is used.
On DataFrame, the .sort_index(axis=0, by=None, ascending=True)
method has an axial selection parameter and a by parameter, and the by parameter is sorted against a (some) column (the by parameter cannot be used on the row):
>>> df.sort_index (by= ' Ohio ') Ohio Texas CALIFORNIAA 0 1 2c 3 4 5d 6 7 8[3 rows x 3 columns]>>> df.sort_index (by=[' California ', ' Texas ']) Ohio Texas CALIFORNIAA 0 1 2c 3 4 5d 6 7 8[3 rows x 3 columns]>>> Df.sort_index (Axis=1) California Ohio texasa 2 0 1c 5 3 4d 8 6 7[3 rows x 3 columns]
The difference between the role of a rank ( Series.rank(method=‘average‘, ascending=True)
) and the sort is that he will replace the values of the object with the rank (from 1 to N). The only problem at this point is how to deal with a peer, the parameter in the method method
is the function, he has four values to choose: average, min, max, first
.
>>> ser=series ([3,2,0,3],index=list (' ABCD ')) >>> sera 3b 2c 0d 3dtype:int64> >> Ser.rank () a 3.5b 2.0c 1.0d 3.5dtype:float64>>> ser.rank (method= ' min ') a 3b 2c 1d 3dtype:float64>>> ser.rank (method= ' Max ') a 4b 2c 1d 4dtype : Float64>>> Ser.rank (method= ' first ') a 3b 2c 1d 4dtype:float64
Note that the different method parameters show different positions on the ser[0]=ser[3.
The DataFrame .rank(axis=0, method=‘average‘, ascending=True)
method has a number of axis parameters, optionally ranked by row or column, as if there is no ranking method for all elements at the moment.
Statistical methods
There are some statistical methods for pandas objects. Most of them are reduction and summary statistics, used to extract a single value from a series, or to extract a series from a DataFrame row or column.
For example DataFrame.mean(axis=0,skipna=True)
, when an NA value exists in a dataset, these values are simply skipped, unless the entire slice (row or column) is all Na, and if you don't want to, you can skipna=False
disable this feature by:
>>> DF One twoa 1.40 nanb 7.10-4.5c NaN NaNd 0.75-1.3[4 rows x 2 columns] >>> Df.mean () one 3.083333two -2.900000dtype:float64>>> Df.mean (Axis=1) a 1.400b 1.300c NAND -0.275dtype:float64>>> Df.mean (axis=1,skipna=false) a nanb 1.300c NaNd - 0.275dtype:float64
Other commonly used statistical methods are:
######################## |
****************************************** |
Count |
Number of non-NA values |
Describe |
Calculate summary statistics for columns of series or DF |
Min, max |
Minimum value and maximum value |
Argmin, Argmax |
Index position (integer) of minimum and maximum values |
Idxmin, Idxmax |
Index values for minimum and maximum values |
Quantile |
Sample sub-positions (0 to 1) |
Sum |
Sum |
Mean |
Mean value |
Median |
Number of Median |
Mad |
Average absolute deviation based on mean value |
Var |
Variance |
Std |
Standard deviation |
Skew |
The skewness of the sample value (third-order moment) |
Kurt |
Kurtosis of sample values (four-order moment) |
Cumsum |
The cumulative sum of the sample values |
Cummin, Cummax |
Cumulative maximum and cumulative minimum values for sample values |
Cumprod |
Cumulative product of sample values |
Diff |
Calculate first-order difference (useful for time series) |
Pct_change |
Calculate percent Change |
Processing missing data
The main performance of Na in Pandas is Np.nan, and None of Python's built-in will be treated as NA.
There are four ways to deal with NA: dropna , fillna , isnull , notnull
.
is (not) null
This pair of methods makes an element-level application to the object, and then returns a Boolean array, which is typically used for Boolean indexes.
Dropna
Returns a Series that contains only non-null data and index values for a series,dropna.
The problem is how to deal with DataFrame, because if you drop, you will lose at least one row (column). The workaround here is similar to the previous one, or with an additional parameter: dropna(axis=0, how=‘any‘, thresh=None)
The How parameter can optionally have a value of any or all. All discards the row (column) only if the slice element is all NA. Another interesting parameter is Thresh, whose type is an integer, which is, for example, Thresh=3, which is retained when there are at least 3 non-NA values in a row.
Fillna
fillna(value=None, method=None, axis=0)
In addition to the base type, the value parameter can use a dictionary, which enables different values to be populated for different columns. The method is used in the .reindex()
same way as before, and is not mentioned here.
InPlace parameters
There is a point in the front that has not been said, the results of the entire example written down to find it is very important. Is the method of Series and DataFrame object, usually has an optional parameter that modifies the array and returns a new one replace=False
. If set to True manually, then the original array can be replaced.
Python Data Analysis Package: Pandas basics