Pandas Library introduction of Dataframe basic operations

Source: Internet
Author: User
How do I delete the list hollow character?
Easiest way: New_list = [x for x in Li if x! = ']

Today is number No. 5.1.

This section mainly learns the basic operations of pandas based on the previous two data structures.

Data A with dataframe results is shown below:           a  b  cone    4  1  1two    6  2  0three  6  1  6

First, view the data (the method of viewing the object is also applicable for series)

1. View Dataframe before XX line or after XX line
A=dataframe (data);
A.head (6) indicates that the first 6 rows of data are displayed, and all data is displayed if no parameters are present in head ().
A.tail (6) indicates that 6 rows of data are displayed, and all data is displayed if no parameters are present in tail ().

2. View Dataframe's Index,columns and values
A.index; A.columns; A.values can

3.describe () function for fast statistical summarization of data
A.describe () counts each column of data, including the count, mean, STD, and the number of individual bits.

4. Transpose the data
A.t

5. Sort the axes
A.sort_index (Axis=1,ascending=false);
Where Axis=1 indicates that all the columns are sorted, the following numbers also move along. The following ascending=false indicate descending order, and the default ascending when the parameter is missing.

6. Sort the values in the Dataframe
A.sort (columns= ' x ')
That is, the X in a column, from small to large to sort. Note that it is only the column x, and all columns are manipulated when the upper axis is sorted.

Ii. Selection of objects

1. Select data for specific columns and rows
a[' x '] then the column columns to X will be returned, Note that this method can only return one column at a time. a.x the same as a[' X ').

Fetching rows of data, selecting by slicing []
such as: A[0:3] Returns the first three rows of data.

2.loc is the choice of data by tag
a.loc[' One ' will default to the row that selects the behavior ' one ';

A.loc[:,[' A ', ' B ']] indicates that all rows are selected and that the columns is a A, b column;

a.loc[[' One ', ' two '],[' a ', ' B '] means to select the ' One ' and ' both ' lines and columns as a-b column;

a.loc[' One ', ' a '] has the same effect as a.loc[[' a '],[' a '], but the former only displays the corresponding values, and the latter displays the corresponding row and column labels.

3.iloc selects the data directly from the location.
This is similar to selecting by label
A.iloc[1:2,1:2] Displays the data for the first column of the first row (the value after the slice is not taken)

A.iloc[1:2] That is, when the value of the column is not there, the default is to pick the row position of 1 data ;

a.iloc[[0,2],[1,2]] is the freedom to select the row position, and the data corresponding to the column position.

4. Use conditions to select
Use a separate column to select data
A[A.C>0] means to select data greater than 0 in column C

Use where to select data
A[A>0] Table directly selects all data greater than 0 in a

Use Isin () to select rows that contain specific values in a specific column
A1=a.copy ()
a1[a1[' One '].isin ([' 2 ', ' 3 '])] table display satisfies the criteria: the value in column one contains all rows of ' 2 ', ' 3 '.

Three, set the value (Assignment)

The assignment operation is directly assigned based on the above selection operation.
Example a.loc[:,[' A ', ' C ']]=9 will set the value in all rows of a and C columns to 9
A.iloc[:,[1,3]]=9 also indicates that the values in all rows of columns A and C are set to 9

You can also use conditions to assign values directly.
A[a>0]=-a = converts all numbers greater than 0 to negative values in a

Iv. processing of missing values

In pandas, use Np.nan instead of missing values, which are not included in the calculation by default.

1.reindex () method
Used to change/increment/delete an index on a specified axis , which returns a copy of the original data.
A.reindex (Index=list (a.index) +[' five '],columns=list (a.columns) +[' d '])

A.reindex (index=[' One ', ' Five '],columns=list (a.columns) +[' d '])

That is, index=[] indicates the operation of index, and the columns table operates on the column.

2. Fill in missing values
A.fillna (value=x)
Indicates that a missing value is populated with a number with a value of X

3. Remove rows that contain missing values
A.dropna (how= ' any ')
Indicates that all rows containing missing values are removed

V. Merger

1.contact
Contact (a1,axis=0/1,keys=[' xx ', ' xx ', ' xx ',...]), where A1 represents the list data to be connected, axis=1 the table is connected across the data. When axis=0 or not specified, the table connects the data vertically. Several keys are associated with the data to be concatenated in A1, and the keys are set to differentiate data from each of the original A1 after the data connection.

Example: a1=[b[' a '],b[' C ']
result=pd.concat(a1,axis=1,keys=[' 1 ', ' 2 ')

2.Append connect one or more rows of data to a dataframe
A.append (A[2:],ignore_index=true)
Indicates that the third row of data in a is all added to a, and if you do not specify the Ignore_index parameter, the index of the added data will be preserved, and if Ignore_index=ture will re-index all rows automatically.

3.merge similar to join in SQL
Set A1,A2 to two dataframe, the same key value exists in the two, and there are several ways to connect two objects:
(1) Internal connection, Pd.merge (A1, A2, on= ' key ')
(2) Left connection, Pd.merge (A1, A2, on= ' key ', how= ' Ieft ')
(3) Right connection, Pd.merge (A1, A2, on= ' key ', how= ' R ')
(4) Outer connection, Pd.merge (A1, A2, on= ' key ', how= ' outer ')
As for the specific differences of the four, learn to refer to the corresponding syntax in SQL.

Vi. Grouping (groupby)

Use the Pd.date_range function to generate a date for a specified number of consecutive days
Pd.date_range (' 20000101 ', periods=10)

Def shuju ():    data={        ' Date ':p d.date_range (' 20000101 ', periods=10),        ' Gender ': np.random.randint (0,2, size=10),        ' height ': np.random.randint (40,50,size=10),        ' weight ': Np.random.randint (150,180,size=10)    } A=dataframe (data) print (a)        date  gender  height  weight0 2000-01-01       0     1651 2000-01-02       0     1792 2000-01-03       1     1723 2000-01-04       0     1734 2000-01-05       1     1515 2000-01-06       0     1726 2000-01-07       0     1677 2000-01-08       0     1578 2000-01-09       1     1579 2000-01-10       1     164 with A.groupby (' Gender '). SUM () The result is:  add sum () after #注意在python中groupby (' xx), otherwise the data object cannot be displayed. Gender     height  weight               0     9891     643

Also use A.groupby (' Gender '). Size () to count the number of individual gender.

So you can see that the GroupBy function is equivalent to:
Gender are categorized by gender, columns that correspond to numbers are automatically summed, and columns of type string are not displayed , and of course can be groupby simultaneously ([' X1 ', ' x2 ',...]) Multiple fields, which work like above.

Vii. categorical by a column Recode classification

such as six to the gender in a re-coding classification, the corresponding 0,1 into Male,female, the process is as follows:

a[' gender1 ']=a[' Gender '].astype (' category ') a[' gender1 '].cat.categories=[' male ', ' female ']  #即将0, 1 Convert to category type before encoding. Print (a) results from:      date    gender  height  weight gender10 2000-01-01       1     163  Female1 2000-01-02       0     177    male2 2000-01-03       1      167       female3 2000-01-04 0     161    male4 2000-01-05       0     177    male5 2000-01-06       1     179  female6 2000-01-07       1     154  female7 2000-01-08       1  female8 2000-01-09       0     158    male9 2000-01-10       1     168  Female

So it can be seen that the recoding encoding is automatically incremented to Dataframe last as a column.

VIII. related Operations

Descriptive statistics:
1.a.mean () The data of each column is averaged by default, and the parameter A.mean (1) is averaged for each row;

2. Count the occurrences of each value in a column x: a[' x '].value_counts ();

3. Applying functions to data
A.apply (Lambda X:x.max ()-x.min ())
Represents the difference between the maximum-minimum value of all columns returned.

4. String-related operations
a[' Gender1 '].str.lower () converts all uppercase letters in gender1 to lowercase, noting that Dataframe does not have a str attribute and only series has, so select the Gender1 field in a.

Nine, Time series

Use Pd.date_range (' xxxx ', periods=xx,freq= ' d/m/y ... ') function in six to generate a list of dates for a specified number of consecutive days.
For example Pd.date_range (' 20000101 ', periods=10), where periods indicates a continuous frequency;
Pd.date_range (' 20000201 ', ' 20000210 ', freq= ' D ') can also not specify a frequency, specifying only the starting date.

Also, if you do not specify Freq, the default starts from the start date and the frequency is day. Other frequencies are indicated as follows:


1.png

Ten, drawing (plot)

First in Pycharm: Import Matplotlib.pyplot as Plta=series (NP.RANDOM.RANDN (+), Index=pd.date_range (' 20100101 ', periods =1000)) B=a.cumsum () B.plot () plt.show ()    #最后一定要加这个plt. Show (), or the graph will not appear.


2.PNG


You can also use the following code to generate multiple time series diagrams:

A=dataframe (Np.random.randn (1000,4), Index=pd.date_range (' 20100101 ', periods=1000), Columns=list (' ABCD ')) b= A.cumsum () B.plot () plt.show ()


3.png

Xi. Importing and exporting files

Writing and reading Excel files
Although there are two types of writing to the Excel table, XLS and CSV, but it is recommended to use less CSV, otherwise when you adjust the data format in the table, it is difficult to keep asking if you want to save the new format. When reading the data, if you specify which sheet, then the Pycharm will appear in the format is not aligned.

And when you write data to a table, Excel automatically adds a field to the top of the table, numbering the data rows.

a.to_excel (R ' C:\\users\\guohuaiqi\\desktop\\2.xls ', sheet_name= ' Sheet1 ') a= Pd.read_excel (R ' C:\\users\\guohuaiqi\\desktop\\2.xls ', ' Sheet1 ', na_values=[' na ') note the initial capitalization in sheet_name behind Sheet1 ; When reading data, you can specify which table the data is read from, and Na is used for missing values. Finally, we enclose the code written and read in CSV format: a.to_csv (R ' c:\\users\\guohuaiqi\\desktop\\1.csv ', sheet_name= ' Sheet1 ') a=pd.read_csv (R ' c:\\ Users\\guohuaiqi\\desktop\\1.csv ', na_values=[' na ']) 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.