Use Python for data analysis notes

Source: Internet
Author: User
Tags vlookup function
Pandas Foundation Stream Processing

Stream processing, sounds very tall, ah, in fact, is the block read. There are so many cases, there is a very large number of G files, no way to deal with, then the batch processing, processing 1 million lines at a time, and then deal with the next 1 million lines, slowly always can be processed.

# using a similar iterator approach
data=pd.read_csv (file, chunksize=1000000) for
sub_df in data:
    print (' does something in SUB_DF Here ')
Index

Series and Dataframe are indexed, the advantage of the index is fast positioning, when it comes to two series or dataframe can be automatically aligned to the index, such as the date automatically aligned, so that can save a lot of things. Missing Value

Pd.isnull (obj)
obj.isnull ()
turn the dictionary into a data box and give the column name, index
Dataframe (data, columns=[' col1 ', ' col2 ', ' col3 ' ...],
            index = [' I1 ', ' i2 ', ' i3 ' ...])
View column names

Dataframe.columns View Index

Dataframe.index Rebuild Index

Obj.reindex ([' A ', ' B ', ' C ', ' d ', ' e ' ...], fill_value=0]
#按给出的索引顺序重新排序, instead of replacing the index. If the index does not have a value, fill it with 0

#就地修改索引
data.index=data.index.map (str.upper)
column order rearrangement (also rebuild index)
dataframe.reindex[columns=[' col1 ', ' col2 ', ' col3 ' ...] '

#也可以同时重建index和columns

dataframe.reindex[index=[' A ', ' B ', ' C ' ...],columns=[' col1 ', ' col2 ', ' col3 ' ...]
shortcut keys for rebuilding indexes
Dataframe.ix[[' A ', ' B ', ' C ' ...],[' col1 ', ' col2 ', ' col3 ' ...]
Renaming an axis index
Data.rename (index=str.title,columns=str.upper)

#修改某个索引和列名 can be passed through the dictionary
data.rename (index={' old_index ': ' New _index '},
            columns={' old_col ': ' New_col '})
View a column
dataframe[' state '] or dataframe.state
View a row

need to use index

dataframe.ix[' index_name ']
Add or remove a column
dataframe[' new_col_name '] = ' char_or_number '
#删除行
dataframe.drop ([' index1 ', ' index2 ' ...])
#删除列
dataframe.drop ([' col1 ', ' col2 ' ...],axis=1)
#或
del dataframe[' col1 ']
Dataframe Select Subset
type Description
Obj[val] Select one or more columns
Obj.ix[val] Select one or more rows
Obj.ix[:,val] Select one or more columns
OBJ.IX[VAL1,VAL2] Select rows and columns at the same time
Reindx Re-indexing rows and columns
Icol,irow Select a single column or single line based on the integer position
Get_value,set_value Select a single value based on row and column labels

for series

Obj[[' A ', ' B ', ' C ' ...]
obj[' B ': ' E ']=5

for Dataframe

#选择多列
dataframe[[' col1 ', ' col2 ' ...]

#选择多行
dataframe[m:n]

#条件筛选
dataframe[dataframe[' col3 ' >5]]

#选择子集
Dataframe.ix[0:3,0:5]
operations of Dataframe and series

will be automatically aligned according to index and columns and then the operation, very convenient AH

Method Description
Add Addition
Sub Subtraction
Div Division
Mul Multiplication
#没有数据的地方用0填充空值
Df1.add (df2,fill_value=0)

# Dataframe and series Operation
dataframe-series
rule is:
------ --   --------  |
|      |   |      |  |
|      |   --------  |
|      |             |
|      |             V
--------
#指定轴方向
dataframe.sub (series,axis=0)
rules are:
--------   ---  
|   | |   ----->   | 
| | | | | --------   ---
Apply function
F=lambda X:x.max ()-x.min ()

#默认对每一列应用
dataframe.apply (f)

#如果需要对每一行分组应用
dataframe.apply (F, Axis=1)
Sort and Rank
#默认根据index排序, Axis = 1 is sorted according to columns
dataframe.sort_index (axis=0, Ascending=false)

# sorted
by value Dataframe.sort_index (by=[' col1 ', ' col2 ' ...])

#排名, the #如果出现重复值 of rank value

Series.rank (ascending=false) is given,
and the average rank

#在行或列上面的排名 Dataframe.rank is taken
(axis= 0)
Descriptive Statistics Index position of the Index value for
Method Description
Count count
describe Show common statistics for columns
Min,max Maximum Minimum
argmin,argmax Maximum minimum value (integer)
idxmin,idxmax Maximum minimum value
quantile calculates the number of sample decimal places
sum,mean To sum the columns, the mean value
Mediam median number
Mad calculates average absolute deviation by average
var,std Variance, standard deviation
Skew skewness (third-order moment)
Kurt kurtosis (four-order moment)
cumsum cumulative and
Cummins,cummax Cumulative Group approximate and cumulative minimum
cumprod Cumulative product
diff First-order differential
Pct_change calculates percent change
unique values, value counts, membership
Obj.unique ()
obj.value_count ()
obj.isin ([' B ', ' C '])
Handling Missing values
# filter Missing Value

# Discard this line as long as there is a missing value
Dataframe.dropna ()
#要求全部为缺失才丢弃这一行
Dataframe.dropna (how= ' All ")
# based on columns
Dataframe.dropna (how= ' all ', Axis=1)

# fills the missing value

#1. Fills the
Df.fillna (0)

#2 with 0. Different columns are populated
with different values Df.fillna ({1:0.5, 3:-1})

#3. Populate
the Df.fillna with the mean (Df.mean ())

# At this point, the axis parameter is the same as the front,
to turn a column into rows index
Df.set_index ([' col1 ', ' col2 ' ...])
data cleansing, reshaping Merging data sets
# take DF1,DF2 all the parts, discard no
# default is inner connection way
Pd.merge (DF1,DF2, how= ' inner ')

#如果df1, the DF2 connection field name is different, you need to specifically specify
Pd.merge (df1,df2,left_on= ' L_key ', right_on= ' R_key ')

#其他的连接方式有 left,right, outer, etc.

# If the dataframe is a multiple index, merge the
Pd.merge (left, right, on=[' key1 ', ' key2 '],how = ' outer ')

#合并后如果有重复的列名 by multiple keys, Need to add suffix
pd.merge (left, right, on= ' Key1 ', suffixes= (' _left ', ' _right '))
merging on an index
#针对dataframe中的连接键不是列名, but the case of the index name.
Pd.merge (left, right, left_on = ' Col_key ', right_index=true)
#即左边的key是列名, key is index.

#多重索引
Pd.merge (left, right, left_on=[' key1 ', ' Key2 '), right_index=true)
Join method of Dataframe
#实现按索引合并.
#其实这个join方法和数据库的join函数是以一样的理解
Left.join (right, how= ' outer ')

#一次合并多个数据框
left.join ([Right1, right2],how= ' outer ')
Axial connections (more commonly used)

Connection: Concatenation
Binding: Binding
Stacking: connections on stacking columns

Np.concatenation ([Df1,df2],axis=1)  #np包
pd.concat ([DF1,DF2], Axis=1) #pd包

#和R语言中的 Cbind is the same

If axis=0, then Rbind is the same
#索引对齐, nothing is empty

# join= ' inner ' gets the intersection
pd.concat ([DF1,DF2], Axis=1, join= ' Innner ')

# keys parameters, still don't see

# ignore_index=true, if it's just a simple merge stitch without considering indexing.
Pd.concat ([df1,df2],ignore_index=true)
Merging duplicate data

For two datasets that may have all or part of the index overlap

Padding because of missing values for index Zhao when merging

where function

#where即if-else function
np.where (IsNull (a), b,a)

Combine_first Method

#如果a中值为空, use the value in B to fill the
A[:-2].combine_first (b[2:])

#combine_first函数即对数据打补丁, fill the missing value
in DF1 with the DF2 data Df1.combine_first (DF2)
Reshape Hierarchical Indexes

Stact: Convert data to long format, column rotation to row
Unstack: Turn to wide format, rotate row to column

Result=data.stack ()
result.unstack ()
long format to wide format
pivoted = Data.pivot (' Date ', ' Item ', ' value ')

#前两个参数分别是行和列的索引名, the last parameter is the column name used to populate the Dataframe data column. If you omit the last argument, the resulting dataframe will have a hierarchical column.
Pivot Table
Table = df.pivot_table (values=["Price", "Quantity"],
            index=["Manager", "Rep"],
            Aggfunc=[np.sum,np.mean],
            margins=true))

#values: Which fields you want to apply the function to
#index: Row index of the pivot table (row)
#columns: The column index of the PivotTable report (column)
#aggfunc: what function is applied
#fill_ Value: null fill
#margins: Add Summary item

#然后可以对透视表进行筛选
table.query (' Manager = = [Debra Henley] ')
Table.query (' Status = = [Pending ', ' won '] ')
Remove Duplicate Data
# to determine whether
to repeat data.duplicated () '

#移除重复数据
data.drop_duplicated ()

#对指定列判断是否存在重复值, and then delete duplicate data
data.drop_duplicated ([' Key1 '])
Cross Table

is a special pivot table for calculating the frequency of a group.
Note that only the discrete, type, and character types are useful, and continuous data is not the frequency of such things.

Pd.crosstab (Df.col1, Df.col2, Margins=true)
similar to VLOOKUP function

Data conversion using functions or mappings

#1. First define a dictionary
meat_to_animal={
    ' bacon ': ' Pig ',
    ' pulled pork ': ' Pig ',
    ' Honey ham ': ' Cow '
}

#2. Apply a function to a column, or a dictionary, and, incidentally, create a new column based on the results of this column
data[' new_col ']=data[' food '].map (str.lower). Map (Meat_to_animal)
Replacement Value
Data.replace ( -999,np.na)

#多个值的替换
data.replace ([ -999,-1000],np.na)

#对应替换
data.replace ([-999, -1000],[np.na,0])
#对应替换也可以传入一个字典
data.replace ({ -999:np.na,-1000:0})
of Discretization
# define split Point
# Simple split (equal width compartment)
s=pd. Series (range)
pd.cut (S, bins=10, Labels=range ())


bins=[20,40,60,80,100]

#切割
cats = Pd.cut ( Series,bins)

#查看标签
cats.labels

#查看水平 (factor)
cats.levels

#区间计数
Pd.value_count (CATS)

#自定义分区的标签
group_names=[' Youth ', ' Youngadult ', ' middleage ', ' Senior ']
pd.cut (ages,bins,labels= Group_names)

Split -digit Division

DATA=NP.RANDOM.RANDN (1000)
pd.qcut (data,4)  #四分位数

#自定义分位数, including endpoint
Pd.qcut (data,[ 0,0.3,0.5,0.9,1])
Exception Value
#查看各个统计量
data.describe ()

#对某一列
col=data[3]
col[np.abs (col) >3]

#选出全部含有 "rows with values exceeding 3 or 3
data[(np.abs (data) >3). Any (1)]

#异常值替换
data[np.abs (data) >3]=np.sign
sampling
#随机抽取k行
Df.take (np.random.permutation (len (DF)) [: K])

#随机抽取k行, but K may be greater than the number of rows in DF
#可以理解为过抽样了
Df.take (Np.random.randint (0,len (DF), size=k))
Data Leveling processing

is equivalent to converting a category attribute to a factor type. For example, whether there is a car, this field has 3 different values, there is, no, over time to buy, then will be coded into 3 fields, there is a car, no car, over time to buy a car, each field with 0-1 two value fill into a numeric type.

#对摊平的数据列增加前缀
dummies = pd.get_dummies (df[' key '],prefix= ' key ')

#将摊平产生的数据列拼接回去
df[[' data1 ']].join ( Dummies)
string Manipulation
# split
strings.split (', ')

#根据正则表达式切分
re.split (' \s+ ', strings)


# Connect
' a ' + ' B ' + ' C ' ...
or
' + '. Join (series)

# to determine whether there is
a ' s '
in Strings ' Strings.find (' s ')


# count
strings.count (', ')

# replace
strings.replace (' old ', ' new ')

# Remove white space character
S.strip ()
Regular Expressions

Regular expressions need to compile matching patterns before they match lookups, which can save a lot of CPU time.

Re.complie: Compiling
FindAll: Match All
Search: Returns only the starting and ending addresses of the first occurrence
Match: Value matches the header of a string
Sub: Match replacement, if found, replace

#原始字符串
strings = ' sdf@153.com,dste@qq.com,sor@gmail.com '

#编译匹配模式, ignorecase can be insensitive to case sensitivity when used
R ' [a-z0-9._%+-]+@[a-z0-9.-]+\\. [A-z] {2,4} '
regex = Re.compile (pattern,flags=re. IGNORECASE)

#匹配所有
Regex.findall (strings)

#使用search
m = Regex.search (strings)  #获取匹配的地址
Strings[m.start (): M.end ()]

#匹配替换
regex.sub (' new_string ', strings)
re-segmentation according to pattern

The pattern segmentation, which is the further segmentation of the match, is achieved through the parentheses in patterns.

Pattern = R ' ([a-z0-9._%+-]+) @ ([a-z0-9.-]+) \ \. ([a-z]{2,4}) '
regex = re.compile (pattern)
Regex.findall (strings)

#如果使用match
M=regex.match ( String)
m.groups ()

#效果是这样的
suzyu123@163.com--> [(suzyu123, 163, com)]

#获取 list-tuple One of the columns
Matches.get (i)
group aggregation, calculating group_by Technology
# is grouped according to multiple indexes, and then the mean value
means = df[' data1 '].groupby ([df[' index1 ' "],df['"]). Index2 ()

# expand into pivot table style
Means.unstack ()
Group after price a fragment into a dictionary
Pieces = Dict (List (df.groupby (' index1 '))

pieces[' B ']

GroupBy The default is to group columns (axis=0) or to group syntactic sugars on Rows (Axis=1) , groupby shortcut functions

Df.groupby (' index1 ') [' col_names ']
df.groupby (' index1 ') [[' Col_names ']]

#是下面代码的语法糖
df[' col_names ']. GroupBy (df[' index1 '])

df.groupby ([' index1 ', ' index2 ']) [' Col_names ']].mean ()
Group by dictionary or series
People = Dataframe (Np.random.randn (5, 5),
                    columns=[' A ', ' B ', ' C ', ' d ', ' e '],
                    index=[' Joe ', ' Steve ', ' Wes ', ' Jim ' , ' Travis '])

# The selection section is set to Na
people.ix[2:3,[' b ', ' C ']]=np.na

mapping = {' A ': ' Red ', ' B ': ' Red ', ' C ': ' Blue ', c13/> ' d ': ' Blue ', ' e ': ' Red ', ' f ': ' orange '}

people.groupby (Mapping,axis=1). SUM ()
Grouping by Functions
#根据索引的长度进行分组
people.groupby (len). SUM ()
Data Aggregation using Custom Functions
# # Use custom functions for all data columns
df.groupby (' index1 '). Agg (myfunc)

#使用系统函数
df.groupby (' index1 ') [' Data1 ']describe ()
apply multiple functions based on column grouping
#分组
grouped = df.groupby ([' col1 ', ' col2 '])

#选择多列, apply multiple functions to each column
grouped[' data1 ', ' data2 ' ...]. AGG ([' mean ', ' std ', ' myfunc '])
use different functions for different columns
grouped = Df.groupby ([' col1 ', ' col2 '])

#传入一个字典, using different functions for different columns
#不同的列可以应用不同数量的函数
grouped.agg ({' Data1 ': [' min ', ' max ', ' mean ', ' std '],
                ' data2 ': ' Sum '})
Rename column names after group calculation
grouped = Df.groupby ([' col1 ', ' col2 '])

Grouped.agg ({' data1 ': [' min ', ' max ', ' mean ', ' std '), (' D_min ', ' D_max ', '] d_ Mean ', ' d_std ')],
                ' data2 ': ' Sum '}
the aggregated data returned is not indexed
Df.groupby ([' Sex ', ' smoker '], as_index=false). Mean ()
add a prefix to a grouped calculation result
#对计算后的列名添加前缀
df.groupby (' index1 '). mean (). Add_prefix (' Mean_ ')
replaces a grouped value with the original data box
#将函数应用到各分组, the result of the grouped calculation is substituted for the value of the original data box
#也可以使用自定义函数
df.groupby ([' index1 ', ' index2 ' ...]). Transform (Np.mean)
the more generalized apply function
Df.groupby ([' col1 ', ' col2 ' ...]). Apply (MyFunc)

df.groupby ([' col1 ', ' col2 ' ...]). Apply ([' min ', ' max ', ' mean ', ' std '])
Disable grouping keys

The grouping key forms the hierarchical index in the result object, together with the original object's index

Df.groupby (' Smoker ', group_keys=false). Apply (Mean)
A column that turns the grouped index into DF

In some cases, the GroupBy as_index=false parameters are not used, and the resulting is a series, this situation is generally in spite of grouping, but the calculation needs to involve several columns, and finally get the Series,series index is a hierarchical index. This turns the series into a dataframe,series hierarchical index into a dataframe column.

def fmean (DF): "" "
    requires two columns to calculate the final result" "
    Skus=len (df[' SKU '].unique ())
    sums=df[' Salecount '].sum ()
    Return Sums/skus

#尽管禁用分组键, get still series
Salemean=data.groupby (by=[' season ', ' syear ', ' Smonth '],as_index= False). Apply (Fmean)

# Converts series to Dataframe and sets index
SUB_DF = PD. Dataframe (Salemean.index.tolist (), Columns=salemean.index.names,index=salemean.index)
# combines groupby results with SUB_DF
sub_df[' Salemean ']=salemean
bucket Analysis and the number of points

Slice the data, and then apply the function to each fragment

frame = Dataframe ({' col1 ': Np.random.randn (1000),
                    ' col2 ': Np.random.randn (1000)})

#数据分段, creating a segmented factor
# Returns which partition interval each element belongs
to factor = Pd.cut (Frame.col1, 4)

#分组计算, and then into the form of a data box
grouped = Frame.col2.groupby (factor)
grouped.apply (MyFunc). Unstack ()
to populate a missing value with the mean value of a group
#自定义函数
fill_mean= Lambda X:x.fillna (X.mean ())

#分组填充
df.groupby (group_key). Apply (Fill_mean)
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.