Using Python to work with Excel data __python

Source: Internet
Author: User
DirectoryRead data display data display rows and columns view data format dtpyes display column name add default column name Display data 5 rows after display of data unique values skipped line I of the file does not read the missing value recognition data cleaning processing null value change data format change column name remove duplicate value replace value in list Data preprocessing to data sorting data data extraction by label extraction by position extraction by label and position extraction by condition extraction from merged values extract specified numerical data filter by conditional filter Data Summary Subtotal data statistical data sampling description statistical covariance correlation points Analysis reading Data
Import pandas as PD
import numpy as NP
import Matplotlib.pyplot as PLT from
datetime import datetime
from Pandas import Series, dataframe

Read X.xlsx file
Excel file Method 1. Introduction of XLRD Module

Import xlrd
Reading data
DF = Xlrd.open_workbook (' x.xlsx ')
Method 2 Using Pandas module
Import  Pandas as PD
Reading data
DF = pd.read_excel (' x.xlsx ')
This paper uses the method 2 Display Data

Display the number of rows and columns of data


(24247, 17)
There are 24247 rows, 17 columns of data

View data Formats Dtpyes


Data format to describe a file
Image.png Display column names

Column name if the data does not have a header row, use Pandas to add the default column name
DF = pd.read_excel (' x.xlsx ', header = None) 
df.head (5)

Add default column Name
This data has column names, so you don't need to add

5 rows after displaying data

Df.tail (5)

5 Lines after data

Show data unique value (unique function)

df[' Broker level '].unique ()

Broker level

Level 0 is because the data missing values have been populated

Do not read where data, available skiprows=[i], skip the file line I do not read

# no reading broker number 20124403 line
df = pd.read_excel (' x.xlsx ', skiprows=[2]) 
Df.head ()

Do not read the second row

Identify missing values

# All missing values are displayed as True
# df.isnull ()
pd.insull (DF)

Show missing values Data Cleaning

Handling Null values
Delete null value (Dropna function)

Df.dropna (how= ' any ')

Delete null value
Delete null value

fill null value (Fillna function)

# empty values are filled with 0
Df.fillna (value=0)

Fill empty value

to populate a null value with a mean value

# using the broker's response time-long mean to populate the missing values
df[' Broker response time Long '].fillna (df[' Broker response time Long '].mean ())

padding mean value

Change data Format

Show large area data format as int64 by Dtypes function, now change to float format

df[' da qu '].astype (' float64 ')


Change Column Name

Df.rename (columns={' im Channel ': ' Channel '})

Change Column Name

Delete duplicate values (Drop_duplicates () function)
First occurrence of the reservation, the remainder of the deletion

#使用默认第一次出现的被保留, back appears to be deleted
df[' Store '].drop_duplicates ()

First occurrence of a reservation

last occurrence of the reservation, the rest deleted

df[' Store '].drop_duplicates (keep = ' last ')

Last occurrence of the reservation to replace a value in a list (replace function)
df[' Customer ucid '].replace (' 10531975 ', ' 110 ')

Replace Data preprocessing

Sort the data

Df.sort_values (by=[' The number of messages sent by the customer on the Day '])


PivotTable report in data grouping --excel
* * Group Customer chat Records

#如果price列的值 >3000,group column shows high, otherwise show low
df[' group ' = Np.where (df[' customer sends messages on the day '] > 5, ' High ', ' low ')


grouping to meet multiple criteria

# >24 in sign column with broker-level A1 and broker response length shown as 1
df.loc[(df[' broker level '] = = ' A1 ') & (df[' Broker response length ']>= 24.0), ' sign ']=1

Meet multiple criteria Data disaggregation
Pd. Dataframe (X.split (' net ') for X in df[' Customer Registration channel '),
              index=df.index,columns=[' customer registration channel ', ' Size '])

Disaggregated Data ExtractionExtract by label Loc Function

Extract 0-3 rows of data Extract by Date
# Reset Index
df.reset_index ()
df=df.set_index (' date ')
df[' 2016-11-2 ': ' 2016-11-02 ']

November 2 Data extracted by location (Iloc function) by region
Df.iloc[:4,: 5]

4 rows, 5 columns, extracted by position
#[0, 2, 5] represents the specified row, [4, 5] represents the specified column

Image.png by label and position IX
# rows are arranged by date, columns are set by position
df.ix[' 2016-11-03 ': ' 2016-11-03 ', 4:6]

Label and location

Conditional extraction (LOC and ISIN functions) to judge using the Isin function

# determine if the broker level is A3
df[' Broker level '].isin ([' A3 '])

To judge the result first, to extract the result as true
#先判断经纪人级别列里是否包含A3和M4, and then extract the data from the composite condition.
df.loc[df[' Broker level '].isin ([' A3 ', ' M4 '])


To propose a specified value from a merged data

# Extract Chain Home network three word
= df[' Customer registration channel ']
PD. Dataframe (Data.str[:3])

Extract the specified number Data Filtering

Filter by criteria to meet multiple criteria and display the specified criteria (with)

#级别为M4, the number of messages sent is greater than
df.loc[(df[' brokers send messages on the day '] > "& (df[' broker level '] = = ' M4 '), 
        [' Broker response Time ', ' 5 minutes ' response ', ' Broker system number ']]

Level is M4, sending message greater than 110 matches one of several conditions
df.loc[(df[' brokers send messages on the day '] > 400) | (df[' Broker response time Long '] > 60000.0), 
      [' Broker response time ', ' Broker system # ']].sort ([' Broker response time ']]

df.loc[(df[' brokers send messages on the day '] > 400) | (df[' Broker response time Long '] > 60000.0),
        [' Broker response time ', ' Broker system # ']].sort ([' Broker response Time ']]. Broker response time Long. SUM ()
686046.0 is not equal to
df.loc[(df[' broker sends message number ']!= ') & (df[' broker level ' = = ' A4 '),
       [' Broker response time ', ' Broker system ']].sort ([' Broker response time ']]

Non -
# count () Count
df.loc[(df[' broker sends message number ']!= ') & (df[' broker level ' = = ' A4 '),
        [' Broker response time ', ' Broker system ']].sort (' Broker response length ']. Broker system number. COUNT ()

Using the query function

Df.query (' broker level = = [' A4 ', ' M4 '] ')

A4 or M4 Data RollupSubtotal GroupBy subtotals for all columns
Df.groupby (' Broker Level '). COUNT ()

Broker-level rollup for specific columns
Df.groupby (' Broker level ') [' Broker response length '].count ()

Increase the classification criteria by summarizing the long columns in response
Df.groupby ([' Broker level ', ' broker responds ']) [' Broker Response Time '].count ()

Inaccurate data because filter fills are grouped and arithmetic is performed
# The broker responds to the length of a subtotal and calculates the mean
df.groupby (' Broker level ') [' Broker Response Time '].agg ([Np.mean])

Classification, mean-value data perspective
Pd.pivot_table (df,index=["Brokers send messages on the day"],values=["broker Response Time"],
          columns=["broker level"],aggfunc[len,np.sum],fill_ Value=0,margins=true)

Data perspective Data StatisticsData sampling simple random sample extraction
Df.sample (n=3)

Randomly extract 3 rows to set the sampling weight
You need to set weights for each row, the number of rows in the list is less than feasible, too low
Suppose you have 4 rows of data, set the sampling weight
weights = [0, 0, 0.5, 0.5]
df.sample (n=4, weights=weights)
Determine if the sample is put back
# back after sampling, True
df.sample (n=6, Replace=true)

Descriptive statistic describe function
Automatically generate data in quantity, mean, standard poor data

#round (2), showing 2 digits after the decimal point, T transpose
Df.describe (). Round (2). T

Descriptive statistical standard deviation STD ()
df[' Broker response time Long '].std ()

Covariance CoV

df[' brokers send messages on the same day '].cov (df[' The number of messages sent by the customer on the day ')]


Relevance Analysis Corr
Correlation coefficient between 1 to 1, close to 1 is positive correlation, close to 1 is negative correlation, 0 is irrelevant

df[' Customers send messages on the day '].corr (df[' brokers send messages on the day ')


Author: ghost-yu Scholar
Source: Jianshu
Copyright belongs to the author. Commercial reprint please contact the author to obtain authorization, non-commercial reprint please indicate the source.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.