Pandas common knowledge required for data analysis and mining in Python

Source: Internet
Author: User

Pandas common knowledge required for data analysis and mining in Python

Objective
Pandas is based on two types of data: series and Dataframe.
A series is a one-dimensional data type in which each element has a label. The series is similar to an array of elements tagged in numpy. Where the label can be either a number or a string.
A dataframe is a two-dimensional table structure. Pandas's Dataframe can store many different data types, and each axis has its own label. You can think of it as a series dictionary item.

Pandas common knowledge

One, read the CSV file as Dataframe
Ii. Data profile of Dataframe
Third, take the column data
Iv. fetching rows of data
V. Take a cell data
Vi. processing of missing values
Vii. Normalization of treatment
Eight, sort
Nine, index re-numbering
Ten, seek the mean value
Xi. vectorization Operations (bulk operations)
12. Pivot Table

One, read the CSV file as Dataframe

Pandas a good point is that you can manipulate the table file. The output is in dataframe format, which is nice. Use Pandas.read_csv () to read the CSV file and output the data in dataframe format. Here data data.csv data set downloaded from Baidu Map.

as = R'c:/users/lenovo/desktop/20180108-Baidu Map/20,180,108-100-degree map/data.csv'  = pd.read_csv (filepath) #为了方便, I only display three lines, actually the result is not so print (DF)

Detect the Data format

#检测下数据格式是否为DataFrameprint (Type (DF))
' Pandas.core.frame.DataFrame
ii. Overview of Dataframe data

We would like to know the following knowledge of data:

    • Show a few records before and after Dataframe

    • Show column names for Dataframe

    • View the dimensions of a dataframe (several rows and columns)

2.1 Show Dataframe front and back lines
#展示前两条记录 (show number of bars as needed) Df.head (2) print (Df.head (2)) #展示后三条记录df. Tail (  3) print (Df.tail (3))
2.2 Show Dataframe column names
== col_names.tolist () col_list
iii. fetching data from Dataframe

Use Dataframe[column_name] to return the series format data. Series series data is similar to list, you can approximate the list. Only one more column in the returned data will be indexed by index. As the left number below

3.1 Fetching a column of data
#这里我们一列, such as Name column data df['name'] [:5]print (df['Name'] [:5])
3.2 Fetching multiple columns of data
  #这里返回的数据还是dataframe格式, for convenience only display the first few records cols  = [ '  name  " ,  " province_name  , "  city_name  " ,  city_code   ",  area  , "  addr  "  ]df[cols] Print (Df[cols])  
iv. fetching rows of data from Dataframe (record)

Ix[row, col] the first parameter in parentheses is the row parameter, the number of rows of data you want to select. The second parameter, col, is the column parameter, and the column data item that you want is selected.

4.1 Fetching one row of data
#第一行所有数据df. ix[0,:]print (df.ix[0= ['survived'  'Pclass'Sex']df.ix[0 , Col]print (df.ix[0, col])
2 fetching multiple rows of data
#取多行数据, all columns. Here I select the first 5 rows, all columns. #这里是不是很像切片操作. Python basics are important df.ix[:5,:]print (df.ix[:5,:]) #取多行, some columns df.ix[:5  , Col]print (df.ix[:5, col])
v. Take a cell data

The first column of the first row is taken. df.ix[0,0] The seventh column of the third row. df.ix[2,6]

vi. processing of missing values

Missing values are generally marked as Nan, and are handled as follows

Df.dropna (axis)  uses Df.dropna ()  axis = 1 By default, withmissing values handled by rows  Axis=0  , df.dropna (Axis=0, subset)  axis=0by column for missing value processing, column missing values in subset are handled in column orientation  subset=[column]   subset A list that contains one or more column names
6.1 Missing value processing by row
#按照列处理缺失值 (for display convenience, only the first 5 rows are displayed) Df.dropna (Axis=0) #对指定列进行缺失值处理df. Dropna (Axis=0, subset=['Sex','age'])
Vii. Normalization of treatment

data sets, different columns of data may be at different levels, if analyzed directly. The model will consider the large influence of the digital, small influence small. The end result may result in a small magnitude variable being stripped out of the model. Therefore, it is necessary to normalized the data into the same magnitude of data, which is the normalization of the operation. Here we have only one column of operations, the remaining columns also need to operate, but for convenience, here only one column of normalization processing.

Process steps: 1  = Df[col].max ()2. All values in this column are divided by max_value

It is important to note that we will use the Pandas feature, the vectorization operation, that is, we can do the same thing in bulk for a list.

#这里我们选Fare列进行归一化, first look at the fare data # for easy display, showing only the first 10 df['Fare'] #这里我们选Fare列进行归一化max_value= df['Fare'].max () max_value# Here we select the Fare column for normalization Max_value= df['Fare'].max () max_value# normalization and passing data to the new column new_faredf['New_fare']=df['Fare']/max_valuedf['New_fare']
Eight, sort
df.sort_values (col,inplace,ascending) Col          sorts the col column inplace      boolean value, whether in-situ operation.             true, the result of the operation overwrites the original data and the original data is modified             False when a new data is created, the original data is not modified ascending    boolean value. Ascending descending. False Descending, True ascending # to the Age column in descending order, do not modify the original data df.sort_values ('age', Inplace=false, Ascending=false)
Nine, index re -
Reorder The sorted indexes df.reset_index (drop) Drop    is a Boolean value, and True indicates that the index of the original data is modified.                  false preserves the original data index sequence. Df.reset_index (drop=false)
10. Averaging10.1 Average information for all columns
Df.mean ()
10.2 Average of a single column
df['age'].mean ()
11. Vectorization Operations (Bulk operations)In general, batch operations such as list-style data require write loops, but this is time consuming and laborious. Pandas based on NumPy, can be vectorized operation, a line can complete the complex loop statement, and the efficiency is very high.
#对Age列批量加10df ['age']+]. head# 20df[' for age column bulk  Age '] -Ten
12. Pivot Table
Df.pivot_table (index=col1,values=col2,aggfunc='numpy function ')

Around the index parameter column, the analysis of each col2,aggfunc is an NP function, of course, the aggfunc here can also be a custom function.

#分析平均年龄对对生存率的影响. #0为死亡, 1 for survival. #这里我们发现年龄对生存率有影响. Import NumPy asnpdf.pivot_table (Index='survived', values=' Age', aggfunc=Np.mean) #分析仓位等级对生存率影响. 0 for death, 1 for survival. #仓位为一等二等三等分别取值1,2,3#一等舱最高级. We find that the position level has an impact on survival. Df.pivot_table (Index='survived', values='Pclass', Aggfunc=np.mean)
pandas extracting tabular data from HTML

Pandas will look for any data in the Web page that matches the form of an HTML table and convert it to Wiedataframe object as the return result.

Code

Pandas how to use

 as Pd#header=1 displays the column name; header=0, pd.read_html (Url,header) is not displayed

Actual combat code started

as "http://hz.house.ifeng.com/detail/2014_10_28/50087618_1.shtml"   = pd.read_html (url,header=1) print (data)

Note that the data format you get here is list.

[Number of real estate name City contracted units of the contract area (㎡) contracted average price (Yuan/㎡)0    1.0Longhu Spring Jiang Li City Riverside -     0    2178.61    23757.01    2.0Hai Wei Qian Tong star Riverside -     0    629.55㎡17398.02    3.0Everyone in the canal star Arch Villa A     0   1052.72㎡10457.03    4.0Poly City Greens Xiasha8     0    743.05㎡10457.0    .. ...           ...  ...   ...   ...        ...        ... -  86.0Guangyu Splendid Taoyuan Arch Villa1     0     86.44㎡12473.0 the  87.0Kingrex Shenhua one courtyard Arch Villa1     0     89.18㎡21529.0 the  88.0Forte Huanglong and Shanxi Lake0     1         0㎡0.0 the  89.0Middle of Cofco Fangyuan province0     1         0㎡0.0 the  90.0East Ming Xia sha0     -         0㎡0.0 -NaN Total contract: main city216     +  21755.55㎡nan[ theRows X7Columns],2
Dataframe Object Df.to_json ()

And as long as you know that the data is stored in dataframe, everything becomes simpler. For example, I would like the data to be output in JSON format, very simple! It's just a matter of one line of code.

as = pd.read_html (url,header=1= PD. DataFrame (data) Df.to_json (Orient='records')
df.to_csv ()

Dataframe object, you can also save the data output as a CSV file

as = pd.read_html (url,header=1= PD. DataFrame (data) #encoding为gbk编码, can you see Chinese in Office Excel df.to_csv ('data.csv', encoding='gbk')

Turn from: Jane Book column: https://www.jianshu.com/u/c1ab741ef52e

Pandas common knowledge required for data analysis and mining in Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.