Python pandas common functions, pythonpandas

Source: Internet
Author: User

Python pandas common functions, pythonpandas

This article focuses on pandas common functions.

1 import Statement
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport datetimeimport re
2. File Reading

Df = pd.read_csv(path+'file.csv ')
Parameter: header = None use the default column name, 0, 1, 2, 3...
Names = ['A', 'B', 'C'...] Custom column name
Index_col = 'A' | ['A', 'B'...] specifies the name of the index column. If Multiple indexes exist, you can pass the list
Skiprows = [0, 1, 2] The row number to be skipped, starting from the file header 0, and skip_footer starting from the end of the file
Nrows = N number of rows to be read, the first N rows
Chunksize = M: Return iteration type TextFileReader, Which is iterated once per M. It is used when data occupies a large amount of memory.
Sep = ': 'Data separation defaults to', '. Select a proper separator Based on the file. If no parameter is specified, the system automatically resolves
Skip_blank_lines = False the default value is True. Skip empty rows. If you do not select Skip, the NaN
Converters = {'col1', func} converts selected columns using the function func, which usually indicates the number of columns (avoid converting to int)

Dfjs = pd. read_json ('file. json') can be passed in a json string
Dfex = pd.read_excel('file.xls ', sheetname = [0, 1 ..]) read multiple sheet pages and return the dictionary of multiple df

3. data preprocessing

Df. duplicated () returns whether the row is a duplicate row of the previous row.
Df. drop_duplicates () deletes duplicate rows. If you need to filter by column, set the parameter to ['col1', 'col2',...].
Df. fillna (0) fills na with real number 0
Df. dropna () axis = 0 | 1 0-index 1-column
How = 'all' | 'any' all-all is NA to delete any-all is deleted as long as there is NA
Del df ['col1'] directly deletes a column
Df. drop (['col1',...], aixs = 1) delete a specified column or delete a row
Df. column = col_lst re-create the column name
Df. rename (index = {'row1': 'A'}, rename the index name and column name
Columns = {'col1': 'a1 '})
Df. replace (dict) to replace the df value. You can use the dictionary table before and after the value. {1: 'A', '2': 'B '}

Def get_digits (str ):
M = re. match (R' (\ d + (\. \ d + )?) ', Str. decode ('utf-8 '))
If m is not None:
Return float (m. groups () [0])
Else:
Return 0
Df. apply (get_digits) DataFrame. apply. Only the fractional part is obtained. You can select a column or row.
Df ['col1']. map (func) Series. map, Function Conversion only for columns

Pd. merge (df1, df2, on = 'col1 ',
How = 'inner ', sort = True) merge two dataframes and perform an inner join (intersection) based on a common column. outter is an outer join (union) and the result is sorted.

Pd. merge (df1, df2, left_on = 'col1 ',
Right_on = 'col2') df1 df2 does not have a public column name. Therefore, you must specify the reference columns on both sides for merging.


Pd. concat ([sr1, sr2, sr3,...], axis = 0) multiple Series are stacked into multiple rows, and the result is still a Series
Pd. concat ([sr1, sr2, sr3,...], axis = 1) multiple Series are combined into multiple rows and multiple columns. The result is a DataFrame. The index is set and the position without intersection is filled with the default NaN.

Df1.combine _ first (df2) uses the data of df2 to supplement the default NaN value of df1. If df2 has more rows, add them together.

The df. stack () column is rotated in a row, that is, the column name is changed to the index name, and the original index is changed to a multi-layer index. The result is a Series with a multi-layer index, which is actually to lengthen the dataset.

Df. unstack () converts a Series with multiple-layer indexes to a DataFrame. In fact, the dataset is squashed. If a column has fewer classes, the classes are pulled out as columns.
Df. Flatten () is actually an unstack application that squashes a dataset.

Pd. get_dummies (df ['col1'], prefix = 'key') a column contains a limited number of values, and these values are generally strings, such as countries. For reference to the concept of bitmap, k columns can be quantified into k columns, each of which is represented by 0 and 1.

4. Data Filtering

Df. columns column name, returns the set of Index columns
Df. index Index name, returns the set of index types
Df. shape returns tuple, row x Column
Df. head (n = N) returns the first N records
Df. tail (n = M) returns the last M results
A two-dimensional array of df. values, which is returned as a numpy. ndarray object.
The index of df. index DataFrame. The index cannot be directly assigned with a value.
Df. reindex (index = ['row1', 'row2',...]
Columns = ['col1', 'col2',...]) reorder based on the new index
Df [m: n] slice, select m ~ N-1 rows
Df [df ['col1']> 1] Select rows that meet the conditions
Df. query ('col1> 1') Select rows that meet the conditions
Df. query ('col1 = [v1, v2,...] ')
Df. ix [:, 'col1'] Select a column
Df. ix ['row1', 'col2'] Select an element
Df. ix [:,: 'col2'] Slice Selects all columns before a column (including col2)
Df. loc [m: n] Get from m ~ N rows (recommended)
Df. iloc [m: n] Get from m ~ N-1 rows
Df. loc [m: n-1, 'col1': 'coln'] Get from m ~ Col1 ~ of n rows ~ Coln Column


Sr = df ['col'] retrieves a column and returns Series
Sr. values Series value, returned as a numpy. ndarray object
Sr. index Series index, which is returned as an index object

5. Data operation and sorting

Df. T DataFrame transpose
Df1 + df2 are added according to the index and column to obtain the union, and NaN is filled.
Df1.add (df2, fill_value = 0) is filled with other values
Df1.add/sub // mul/div Arithmetic Operation Method
All rows of df-sr DataFrame minus Series
Df * N multiply all elements by N
Df. add (sr, axis = 0) All columns of DataFrame minus Series


Sr. order () Series in ascending order
Df. sort_index (aixs = 0, ascending = True) in ascending order of the row Index
Df. sort_index (by = ['col1', 'col2'...]) are prioritized by specified Columns
Df. rank () calculates the rank value.

6. Mathematical Statistics

Sr. unique Series deduplication
Sr. value_counts () Series is used to calculate the frequency and sort the data from large to small. DataFrame does not use this method.
Sr. describe () returns the basic statistics and quantiles.

Df. describe () returns the basic statistics and quantiles based on each column
Df. count () calculate the number of non-NA values
Df. max () calculates the maximum value.
Df. min () calculates the maximum value.
Df. sum (axis = 0) are summed by each column
Df. mean () calculates the average value based on each column
Df. median () returns the median.
Df. var () returns the variance
Df. std () evaluate standard deviation
Df. mad () calculates the mean absolute spread based on the average value.
Df. cumsum () calculates the sum
Sr1.corr (sr2) returns the correlation coefficient
Df. cov () returns the covariance matrix.
Df1.0000with (df2) Correlation Coefficient

Pd. cut (array1, bins) for interval distribution of One-Dimensional Data
Pd. qcut (array1, 4) divides intervals by specified quantiles, and 4 can be replaced with a custom quantile list.

Df ['col1']. groupby (df ['col2']) Column 1 is grouped by column 2, that is, column 2 is used as the key
Df. groupby ('col1') DataFrame groups by column 1
Grouped. aggreagte (func) is grouped and aggregated based on the input function.
Grouped. aggregate ([f1, f2,...]) aggregates multiple functions into multiple columns. The function name is the column name.
Grouped. aggregate ([('f1 _ name', f1), ('f2 _ name', f2)]) Rename the aggregated column name
Grouped. aggregate ({'col1': f1, 'col2': f2,...}) applies the aggregation of different functions to different columns. The function can also be multiple


Df. effect_table (['col1', 'col2'],
Rows = ['row1', 'row2'],
Aggfunc = [np. mean, np. sum]
Fill_value = 0,
Margins = True) perform grouping aggregation on col1 and col2 Based on row1 and row2. You can specify multiple aggregation methods and replace the default values with the specified values.


Pd. crosstab (df ['col1'], df ['col2']) cross tabulation to calculate the group frequency

Summary

The above is all the details about the common functions of Python pandas in this article, and I hope to help you. If you are interested, you can continue to refer to other related topics on this site. If you have any shortcomings, please leave a message. Thank you for your support!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.