International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

Use Python for data analysis notes

Last Update:2018-07-24 Source: Internet

Author: User

Tags vlookup function

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Pandas Foundation Stream Processing

Stream processing, sounds very tall, ah, in fact, is the block read. There are so many cases, there is a very large number of G files, no way to deal with, then the batch processing, processing 1 million lines at a time, and then deal with the next 1 million lines, slowly always can be processed.

# using a similar iterator approach
data=pd.read_csv (file, chunksize=1000000) for
sub_df in data:
    print (' does something in SUB_DF Here ')

Index

Series and Dataframe are indexed, the advantage of the index is fast positioning, when it comes to two series or dataframe can be automatically aligned to the index, such as the date automatically aligned, so that can save a lot of things. Missing Value

Pd.isnull (obj)
obj.isnull ()

turn the dictionary into a data box and give the column name, index

Dataframe (data, columns=[' col1 ', ' col2 ', ' col3 ' ...],
            index = [' I1 ', ' i2 ', ' i3 ' ...])

View column names

Dataframe.columns View Index

Dataframe.index Rebuild Index

Obj.reindex ([' A ', ' B ', ' C ', ' d ', ' e ' ...], fill_value=0]
#按给出的索引顺序重新排序, instead of replacing the index. If the index does not have a value, fill it with 0

#就地修改索引
data.index=data.index.map (str.upper)

column order rearrangement (also rebuild index)

dataframe.reindex[columns=[' col1 ', ' col2 ', ' col3 ' ...] '

#也可以同时重建index和columns

dataframe.reindex[index=[' A ', ' B ', ' C ' ...],columns=[' col1 ', ' col2 ', ' col3 ' ...]

shortcut keys for rebuilding indexes

Dataframe.ix[[' A ', ' B ', ' C ' ...],[' col1 ', ' col2 ', ' col3 ' ...]

Renaming an axis index

Data.rename (index=str.title,columns=str.upper)

#修改某个索引和列名 can be passed through the dictionary
data.rename (index={' old_index ': ' New _index '},
            columns={' old_col ': ' New_col '})

View a column

dataframe[' state '] or dataframe.state

View a row

need to use index

dataframe.ix[' index_name ']

Add or remove a column

dataframe[' new_col_name '] = ' char_or_number '
#删除行
dataframe.drop ([' index1 ', ' index2 ' ...])
#删除列
dataframe.drop ([' col1 ', ' col2 ' ...],axis=1)
#或
del dataframe[' col1 ']

Dataframe Select Subset

type	Description
Obj[val]	Select one or more columns
Obj.ix[val]	Select one or more rows
Obj.ix[:,val]	Select one or more columns
OBJ.IX[VAL1,VAL2]	Select rows and columns at the same time
Reindx	Re-indexing rows and columns
Icol,irow	Select a single column or single line based on the integer position
Get_value,set_value	Select a single value based on row and column labels

for series

Obj[[' A ', ' B ', ' C ' ...]
obj[' B ': ' E ']=5

for Dataframe

#选择多列
dataframe[[' col1 ', ' col2 ' ...]

#选择多行
dataframe[m:n]

#条件筛选
dataframe[dataframe[' col3 ' >5]]

#选择子集
Dataframe.ix[0:3,0:5]

operations of Dataframe and series

will be automatically aligned according to index and columns and then the operation, very convenient AH

Method	Description
Add	Addition
Sub	Subtraction
Div	Division
Mul	Multiplication

#没有数据的地方用0填充空值
Df1.add (df2,fill_value=0)

# Dataframe and series Operation
dataframe-series
rule is:
------ --   --------  |
|      |   |      |  |
|      |   --------  |
|      |             |
|      |             V
--------
#指定轴方向
dataframe.sub (series,axis=0)
rules are:
--------   ---  
|   | |   ----->   | 
| | | | | --------   ---

Apply function

F=lambda X:x.max ()-x.min ()

#默认对每一列应用
dataframe.apply (f)

#如果需要对每一行分组应用
dataframe.apply (F, Axis=1)

Sort and Rank

#默认根据index排序, Axis = 1 is sorted according to columns
dataframe.sort_index (axis=0, Ascending=false)

# sorted
by value Dataframe.sort_index (by=[' col1 ', ' col2 ' ...])

#排名, the #如果出现重复值 of rank value

Series.rank (ascending=false) is given,
and the average rank

#在行或列上面的排名 Dataframe.rank is taken
(axis= 0)

Descriptive Statistics Index position of the Index value for

Method	Description
Count	count
describe	Show common statistics for columns
Min,max	Maximum Minimum
argmin,argmax	Maximum minimum value (integer)
idxmin,idxmax	Maximum minimum value
quantile	calculates the number of sample decimal places
sum,mean	To sum the columns, the mean value
Mediam	median number
Mad	calculates average absolute deviation by average
var,std	Variance, standard deviation
Skew	skewness (third-order moment)
Kurt	kurtosis (four-order moment)
cumsum	cumulative and
Cummins,cummax	Cumulative Group approximate and cumulative minimum
cumprod	Cumulative product
diff	First-order differential
Pct_change	calculates percent change

unique values, value counts, membership

Obj.unique ()
obj.value_count ()
obj.isin ([' B ', ' C '])

Handling Missing values

# filter Missing Value

# Discard this line as long as there is a missing value
Dataframe.dropna ()
#要求全部为缺失才丢弃这一行
Dataframe.dropna (how= ' All ")
# based on columns
Dataframe.dropna (how= ' all ', Axis=1)

# fills the missing value

#1. Fills the
Df.fillna (0)

#2 with 0. Different columns are populated
with different values Df.fillna ({1:0.5, 3:-1})

#3. Populate
the Df.fillna with the mean (Df.mean ())

# At this point, the axis parameter is the same as the front,

to turn a column into rows index

Df.set_index ([' col1 ', ' col2 ' ...])

data cleansing, reshaping Merging data sets

# take DF1,DF2 all the parts, discard no
# default is inner connection way
Pd.merge (DF1,DF2, how= ' inner ')

#如果df1, the DF2 connection field name is different, you need to specifically specify
Pd.merge (df1,df2,left_on= ' L_key ', right_on= ' R_key ')

#其他的连接方式有 left,right, outer, etc.

# If the dataframe is a multiple index, merge the
Pd.merge (left, right, on=[' key1 ', ' key2 '],how = ' outer ')

#合并后如果有重复的列名 by multiple keys, Need to add suffix
pd.merge (left, right, on= ' Key1 ', suffixes= (' _left ', ' _right '))

merging on an index

#针对dataframe中的连接键不是列名, but the case of the index name.
Pd.merge (left, right, left_on = ' Col_key ', right_index=true)
#即左边的key是列名, key is index.

#多重索引
Pd.merge (left, right, left_on=[' key1 ', ' Key2 '), right_index=true)

Join method of Dataframe

#实现按索引合并.
#其实这个join方法和数据库的join函数是以一样的理解
Left.join (right, how= ' outer ')

#一次合并多个数据框
left.join ([Right1, right2],how= ' outer ')

Axial connections (more commonly used)

Connection: Concatenation
Binding: Binding
Stacking: connections on stacking columns

Np.concatenation ([Df1,df2],axis=1)  #np包
pd.concat ([DF1,DF2], Axis=1) #pd包

#和R语言中的 Cbind is the same

If axis=0, then Rbind is the same
#索引对齐, nothing is empty

# join= ' inner ' gets the intersection
pd.concat ([DF1,DF2], Axis=1, join= ' Innner ')

# keys parameters, still don't see

# ignore_index=true, if it's just a simple merge stitch without considering indexing.
Pd.concat ([df1,df2],ignore_index=true)

Merging duplicate data

For two datasets that may have all or part of the index overlap

Padding because of missing values for index Zhao when merging

where function

#where即if-else function
np.where (IsNull (a), b,a)

Combine_first Method

#如果a中值为空, use the value in B to fill the
A[:-2].combine_first (b[2:])

#combine_first函数即对数据打补丁, fill the missing value
in DF1 with the DF2 data Df1.combine_first (DF2)

Reshape Hierarchical Indexes

Stact: Convert data to long format, column rotation to row
Unstack: Turn to wide format, rotate row to column

Result=data.stack ()
result.unstack ()

long format to wide format

pivoted = Data.pivot (' Date ', ' Item ', ' value ')

#前两个参数分别是行和列的索引名, the last parameter is the column name used to populate the Dataframe data column. If you omit the last argument, the resulting dataframe will have a hierarchical column.

Pivot Table

Table = df.pivot_table (values=["Price", "Quantity"],
            index=["Manager", "Rep"],
            Aggfunc=[np.sum,np.mean],
            margins=true))

#values: Which fields you want to apply the function to
#index: Row index of the pivot table (row)
#columns: The column index of the PivotTable report (column)
#aggfunc: what function is applied
#fill_ Value: null fill
#margins: Add Summary item

#然后可以对透视表进行筛选
table.query (' Manager = = [Debra Henley] ')
Table.query (' Status = = [Pending ', ' won '] ')

Remove Duplicate Data

# to determine whether
to repeat data.duplicated () '

#移除重复数据
data.drop_duplicated ()

#对指定列判断是否存在重复值, and then delete duplicate data
data.drop_duplicated ([' Key1 '])

Cross Table

is a special pivot table for calculating the frequency of a group.
Note that only the discrete, type, and character types are useful, and continuous data is not the frequency of such things.

Pd.crosstab (Df.col1, Df.col2, Margins=true)

similar to VLOOKUP function

Data conversion using functions or mappings

#1. First define a dictionary
meat_to_animal={
    ' bacon ': ' Pig ',
    ' pulled pork ': ' Pig ',
    ' Honey ham ': ' Cow '
}

#2. Apply a function to a column, or a dictionary, and, incidentally, create a new column based on the results of this column
data[' new_col ']=data[' food '].map (str.lower). Map (Meat_to_animal)

Replacement Value

Data.replace ( -999,np.na)

#多个值的替换
data.replace ([ -999,-1000],np.na)

#对应替换
data.replace ([-999, -1000],[np.na,0])
#对应替换也可以传入一个字典
data.replace ({ -999:np.na,-1000:0})

of Discretization

# define split Point
# Simple split (equal width compartment)
s=pd. Series (range)
pd.cut (S, bins=10, Labels=range ())


bins=[20,40,60,80,100]

#切割
cats = Pd.cut ( Series,bins)

#查看标签
cats.labels

#查看水平 (factor)
cats.levels

#区间计数
Pd.value_count (CATS)

#自定义分区的标签
group_names=[' Youth ', ' Youngadult ', ' middleage ', ' Senior ']
pd.cut (ages,bins,labels= Group_names)

Split -digit Division

DATA=NP.RANDOM.RANDN (1000)
pd.qcut (data,4)  #四分位数

#自定义分位数, including endpoint
Pd.qcut (data,[ 0,0.3,0.5,0.9,1])

Exception Value

#查看各个统计量
data.describe ()

#对某一列
col=data[3]
col[np.abs (col) >3]

#选出全部含有 "rows with values exceeding 3 or 3
data[(np.abs (data) >3). Any (1)]

#异常值替换
data[np.abs (data) >3]=np.sign

sampling

#随机抽取k行
Df.take (np.random.permutation (len (DF)) [: K])

#随机抽取k行, but K may be greater than the number of rows in DF
#可以理解为过抽样了
Df.take (Np.random.randint (0,len (DF), size=k))

Data Leveling processing

is equivalent to converting a category attribute to a factor type. For example, whether there is a car, this field has 3 different values, there is, no, over time to buy, then will be coded into 3 fields, there is a car, no car, over time to buy a car, each field with 0-1 two value fill into a numeric type.

#对摊平的数据列增加前缀
dummies = pd.get_dummies (df[' key '],prefix= ' key ')

#将摊平产生的数据列拼接回去
df[[' data1 ']].join ( Dummies)

string Manipulation

# split
strings.split (', ')

#根据正则表达式切分
re.split (' \s+ ', strings)


# Connect
' a ' + ' B ' + ' C ' ...
or
' + '. Join (series)

# to determine whether there is
a ' s '
in Strings ' Strings.find (' s ')


# count
strings.count (', ')

# replace
strings.replace (' old ', ' new ')

# Remove white space character
S.strip ()

Regular Expressions

Regular expressions need to compile matching patterns before they match lookups, which can save a lot of CPU time.

Re.complie: Compiling
FindAll: Match All
Search: Returns only the starting and ending addresses of the first occurrence
Match: Value matches the header of a string
Sub: Match replacement, if found, replace

#原始字符串
strings = ' sdf@153.com,dste@qq.com,sor@gmail.com '

#编译匹配模式, ignorecase can be insensitive to case sensitivity when used
R ' [a-z0-9._%+-]+@[a-z0-9.-]+\\. [A-z] {2,4} '
regex = Re.compile (pattern,flags=re. IGNORECASE)

#匹配所有
Regex.findall (strings)

#使用search
m = Regex.search (strings)  #获取匹配的地址
Strings[m.start (): M.end ()]

#匹配替换
regex.sub (' new_string ', strings)

re-segmentation according to pattern

The pattern segmentation, which is the further segmentation of the match, is achieved through the parentheses in patterns.

Pattern = R ' ([a-z0-9._%+-]+) @ ([a-z0-9.-]+) \ \. ([a-z]{2,4}) '
regex = re.compile (pattern)
Regex.findall (strings)

#如果使用match
M=regex.match ( String)
m.groups ()

#效果是这样的
suzyu123@163.com--> [(suzyu123, 163, com)]

#获取 list-tuple One of the columns
Matches.get (i)

group aggregation, calculating group_by Technology

# is grouped according to multiple indexes, and then the mean value
means = df[' data1 '].groupby ([df[' index1 ' "],df['"]). Index2 ()

# expand into pivot table style
Means.unstack ()

Group after price a fragment into a dictionary

Pieces = Dict (List (df.groupby (' index1 '))

pieces[' B ']

GroupBy The default is to group columns (axis=0) or to group syntactic sugars on Rows (Axis=1) , groupby shortcut functions

Df.groupby (' index1 ') [' col_names ']
df.groupby (' index1 ') [[' Col_names ']]

#是下面代码的语法糖
df[' col_names ']. GroupBy (df[' index1 '])

df.groupby ([' index1 ', ' index2 ']) [' Col_names ']].mean ()

Group by dictionary or series

People = Dataframe (Np.random.randn (5, 5),
                    columns=[' A ', ' B ', ' C ', ' d ', ' e '],
                    index=[' Joe ', ' Steve ', ' Wes ', ' Jim ' , ' Travis '])

# The selection section is set to Na
people.ix[2:3,[' b ', ' C ']]=np.na

mapping = {' A ': ' Red ', ' B ': ' Red ', ' C ': ' Blue ', c13/> ' d ': ' Blue ', ' e ': ' Red ', ' f ': ' orange '}

people.groupby (Mapping,axis=1). SUM ()

Grouping by Functions

#根据索引的长度进行分组
people.groupby (len). SUM ()

Data Aggregation using Custom Functions

# # Use custom functions for all data columns
df.groupby (' index1 '). Agg (myfunc)

#使用系统函数
df.groupby (' index1 ') [' Data1 ']describe ()

apply multiple functions based on column grouping

#分组
grouped = df.groupby ([' col1 ', ' col2 '])

#选择多列, apply multiple functions to each column
grouped[' data1 ', ' data2 ' ...]. AGG ([' mean ', ' std ', ' myfunc '])

use different functions for different columns

grouped = Df.groupby ([' col1 ', ' col2 '])

#传入一个字典, using different functions for different columns
#不同的列可以应用不同数量的函数
grouped.agg ({' Data1 ': [' min ', ' max ', ' mean ', ' std '],
                ' data2 ': ' Sum '})

Rename column names after group calculation

grouped = Df.groupby ([' col1 ', ' col2 '])

Grouped.agg ({' data1 ': [' min ', ' max ', ' mean ', ' std '), (' D_min ', ' D_max ', '] d_ Mean ', ' d_std ')],
                ' data2 ': ' Sum '}

the aggregated data returned is not indexed

Df.groupby ([' Sex ', ' smoker '], as_index=false). Mean ()

add a prefix to a grouped calculation result

#对计算后的列名添加前缀
df.groupby (' index1 '). mean (). Add_prefix (' Mean_ ')

replaces a grouped value with the original data box

#将函数应用到各分组, the result of the grouped calculation is substituted for the value of the original data box
#也可以使用自定义函数
df.groupby ([' index1 ', ' index2 ' ...]). Transform (Np.mean)

the more generalized apply function

Df.groupby ([' col1 ', ' col2 ' ...]). Apply (MyFunc)

df.groupby ([' col1 ', ' col2 ' ...]). Apply ([' min ', ' max ', ' mean ', ' std '])

Disable grouping keys

The grouping key forms the hierarchical index in the result object, together with the original object's index

Df.groupby (' Smoker ', group_keys=false). Apply (Mean)

A column that turns the grouped index into DF

In some cases, the GroupBy as_index=false parameters are not used, and the resulting is a series, this situation is generally in spite of grouping, but the calculation needs to involve several columns, and finally get the Series,series index is a hierarchical index. This turns the series into a dataframe,series hierarchical index into a dataframe column.

def fmean (DF): "" "
    requires two columns to calculate the final result" "
    Skus=len (df[' SKU '].unique ())
    sums=df[' Salecount '].sum ()
    Return Sums/skus

#尽管禁用分组键, get still series
Salemean=data.groupby (by=[' season ', ' syear ', ' Smonth '],as_index= False). Apply (Fmean)

# Converts series to Dataframe and sets index
SUB_DF = PD. Dataframe (Salemean.index.tolist (), Columns=salemean.index.names,index=salemean.index)
# combines groupby results with SUB_DF
sub_df[' Salemean ']=salemean

bucket Analysis and the number of points

Slice the data, and then apply the function to each fragment

frame = Dataframe ({' col1 ': Np.random.randn (1000),
                    ' col2 ': Np.random.randn (1000)})

#数据分段, creating a segmented factor
# Returns which partition interval each element belongs
to factor = Pd.cut (Frame.col1, 4)

#分组计算, and then into the form of a data box
grouped = Frame.col2.groupby (factor)
grouped.apply (MyFunc). Unstack ()

to populate a missing value with the mean value of a group

#自定义函数
fill_mean= Lambda X:x.fillna (X.mean ())

#分组填充
df.groupby (group_key). Apply (Fill_mean)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

python for data analysis 2nd edition python for data analysis 2nd edition download wes mckinney python for data analysis best python book for data analysis python for data analysis 2nd edition pdf python data analysis coursera udemy python data analysis

Python thread pause, resume, exit detail and Example _python 01-18

Python design mode-UML-Package diagrams (Package Diagram) 09-09

Python abstract class (ABC module) 09-18

Python: send emails 12-08

Python: Database Operations 12-08

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use Python for data analysis notes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support