"Data analysis using Python" reading notes--first to second chapter preparation and examples

Source: Internet
Author: User
Tags diff

Http://www.cnblogs.com/batteryhp/p/4868348.html

Chapter I preparatory work

Starting today the book-"Data analysis using Python". Both R and Python have to be used, which is the reason for the code book. First, according to the book said to install, Google downloaded Epd_free-7.3-1-win-x86.msi, the translator proposed to follow the author's version of the installation, epdfree including Numpy,scipy,matplotlib,chaco, IPython. Pandas need to be installed on their own, the corresponding version is pandas-0.9.0.win32-py2.7.exe. Data: Github.com/pydata/pydata-book. The following is a document:

Welcome to Python for Data analysis ' s documentation!http://pda.readthedocs.org/en/latest/

Chapter II Introduction

This chapter is a few examples.

1.1.usa.gov Data from bit.ly

First of all, the problem encountered is pycharm Chinese coding problem, note ideencoding changed to Utf-8, while the file is the first to add #-*-encoding:utf-8-*-, while containing Chinese strings remember plus u.

Here's the code:

#-*-Encoding:utf-8-*-import jsonimport numpy as Npimport pandas as Pdimport Matplotlib.pyplot as Pltfrom collections I Mport Defaultdictfrom Collections Import counter# Note here the Chinese path = U ' d:\\ Hello \\usagov_bitly_ Data2012-03-16-1331923249.txt ' Print open (path). ReadLine () #注意这里的json模块中的loads函数将字符串转换为字典, very useful! #注意这里的缩略循环形式records = [Json.loads (line) to line in open (path)]print Records[0]print type (records) print type (Records[0] Print records[0][' tz '] #注意这里的判断条件time_zones = [rec[' TZ '] for rec. Records if ' TZ ' in Rec]print Time_zones[:10] #下面定义函数对时区 Counting statistics, note the way of counting here, note the dictionary initialization method here def get_counts (squence): counts = defaultdict (int) for x in squence:counts[x] + = 1 Return countscounts = get_counts (time_zones) print counts[' america/new_york ']def top_counts (count_dict,n = ten): V Alue_key_pairs = [(Count,tz) for Tz,count in Count_dict.items ()] Value_key_pairs.sort () #请注意这里的索引方式, good return Val Ue_key_pairs[-n:] #这里是打印最后面的十个数, it's worth noting that from the tenth to the last print top_counts (counts) #这里的Counter是一个Artifact, author real powerful counts = Counter (time_zones) print Counts.most_common (10) 

The above is the use of functions in the Python standard library for data analysis. A few things to note:

1. A description of the list index:

A = range (0,10,1)
The
A[0] >>>0
A[-1] >>> 9
A[:5] >>> [0,1,2,3,4]
A[0:2] >>> [0,1]
A[-3:-1] >>> [7,8]
A[-3:] >>> [7,8,9]
A[-1:-3:-1] >>> [9,8]
A[::2] >>> [0,2,4,6,8]
Description
1, the index contains the first, does not contain the part after the first colon
2, the symbol indicates from the back to start counting
3, the second colon is followed by an interval, if there is a minus sign, indicating that the count starts from behind, for example a[-1:-3] This representation gets an empty list.

2, about the application of module collections, see the following address:

http://www.zlovezl.cn/articles/collections-in-python/

Collections mainly consists of the following "Data types": Namedtuple () generates a tuple subclass that can access the content of the element using the name, and the Deque () double-ended queue, its greatest benefit is the rapid increase and removal of objects from the head of the queue; Counter used to count numbers, dictionaries, lists, strings can be used, very convenient; ordereddict generate an ordered dictionary; defaultdict is useful for example, defaultdict (int) means that each value in the dictionary is int, defaultdict ( List) indicates that each value in the dictionary is a listing. For more detailed information, see:

Https://docs.python.org/2/library/collections.html#module-collections.

The following is the time zone is counted with pandas

Dataframe is the most important data structure in pandas and should be the data frame in the R language. Let's take a look at how to implement:

#-*-Encoding:utf-8-*-import jsonimport numpy as Npimport pandas as Pdfrom pandas import Dataframe,seriesimport Matplot Lib.pyplot as plt# note here the Chinese path = U ' d:\\ Hello \\usagov_bitly_data2012-03-16-1331923249.txt ' # Note that the loads function in the JSON module here converts the string into a dictionary, which is very useful! #注意这里的缩略循环形式records = [Json.loads (line) to line in open (path)] #注意这里的DataFrame可以将每个元素都是字典的列表自动整理为数据框的形式, Each column is a dictionary of keyframe = DataFrame (records) #数据太多只是会显示缩略图 #print frame# Below is the first 10 elements of the column named TZ #print frame[' TZ '][:10] #下面是用value_ Counts method for different TZ count, too convenient! #print type (frame[' TZ ']) tz_counts = frame[' tz '].value_counts () #print tz_counts[:10] #下面想画一个茎叶图, first fill in the missing value na Clean_tz = frame[' tz '].fillna (' Missing ') #下面是对空白符通过布尔型数组索引加以替换 # It is worth noting that the whitespace and Na missing values are not the same as in R, Clean_tz[clean_tz =  = ' Unknown ' tz_counts = clean_tz.value_counts () print tz_counts[:10] #书上说下面这条语句在pylab中打开才管用, in fact add a sentence plt.show () can be tz_ Counts[:10].plot (kind = ' Barh ', rot = 0) plt.show ()

Here's what to do with strings and expressions in the data (the last few days were directed beautiful soup is a reptile pack):

#-*-Encoding:utf-8-*-import jsonimport numpy as Npimport pandas as Pdfrom pandas import Dataframe,seriesimport Matplot Lib.pyplot as Pltfrom collections import Defaultdictfrom Collections Import counter# Note here the Chinese path = U ' d:\\ Hello \\usagov_ Bitly_data2012-03-16-1331923249.txt ' #print open (path). ReadLine () #注意这里的json模块中的loads函数将字符串转换为字典, very useful! #注意这里的缩略循环形式records = [Json.loads (line) to line in open (path)]frame = DataFrame (records) #对于一个 Series,dropna Returns a seriesresults = Series containing only non-null data and index values ([X.split () [0] for X in Frame.a.dropna ()]) #print results.value_counts () Cframe = Frame[frame.a.notnull ()] #np. The WHERE function is a vectorized ifelse function Operating_system = Np.where (cframe[' a '].str.contains (' Windows ') , ' windows ', ' not Windows ') #print Operating_system[:5] #下面是将tz按照operating_system进行分组并计数并用unstack进行展开并填充na为0by_tz_os = Cframe.groupby ([' TZ ', Operating_system]) agg_counts = By_tz_os.size (). Unstack (). Fillna (0) #print agg_counts# under note The SUM function defaults to Axis = 0, is normal Plus, Axis = 1 is in line Plus, Argsort is from small to large and returns to the following table indexer = agg_counts.sum (1). Argsort () #下面是Remove the most time zone value, note the Take function, then subscript count_subset = Agg_counts.take (indexer) [ -10:]print count_subset# The following figure is good, is the cumulative bar chart Count_ Subset.plot (kind = ' Barh ', stacked = True) plt.show () #下面进行比例展示normed_subset = Count_subset.div (count_subset.sum (1), Axis = 0) normed_subset.plot (kind = ' Barh ', stacked = True) plt.show ()

The above example has been completed, see the next example.

Grouplens studied a set of film scoring data from the late 1990s to early 21st century by Movielens users. The goal here is to slice and analyze the data.

#-*-coding:utf-8-*-import pandas as Pdimport NumPy as Npimport matplotlib.pyplot as Pltpath1 = ' E:\\pyprojects\\usepython _2.2\\movielens\\users.dat ' path2 = ' e:\\pyprojects\\usepython_2.2\\movielens\\ratings.dat ' path3 = ' E:\\Pyprojects\ \usepython_2.2\\movielens\\movies.dat ' unames = [' user_id ', ' gender ', ' age ', ' occupation ', ' zip ']users = pd.read_table ( Path1,sep = ':: ', header = None,names = unames) rnames = [' user_id ', ' movie_id ', ' rating ', ' timestamp ']ratings = Pd.read_ Table (Path2,sep = ':: ', header = None,names = rnames) mnames = [' movie_id ', ' title ', ' genres ']movies = pd.read_table (Path3, Sep = ':: ', header = None,names = mnames) #print users.head () #下面是对三个数据集合进行merge操作, the final row count is determined by ratings, the reason is obviously data = Pd.merge ( Pd.merge (ratings,users), movies) #print data.ix[0] #下面按照性别计算每部电影的平均得分, honestly, this pivot table function is really easy to understand mean_ratings = Data.pivot_ Table (' rating ', rows = ' title ', cols = ' gender ', Aggfunc = ' mean ') #print mean_ratings.head () #下面是按照title对data分组并计数ratings _by_title = Data.groupby (' title '). Size () #下面的index返回的下标active_titles = RAtings_by_title.index[ratings_by_title >= 251] #下面之所以可以这样做是因为groupby函数和透视表都是按照相同是顺序排序的mean_ratings = Mean_ Ratings.ix[active_titles] #print mean_ratingstop_female_ratings = mean_ratings.sort_index (by = ' F ', ascending = False) # Print Top_female_ratings.head () #下面一部分计算男性和女性分歧最大的电影 # Note that the following statement adds a list of diff directly, so that you get a favorite movie for women, and notice how Sort_index is applied mean_ ratings[' diff '] = mean_ratings[' M ']-mean_ratings[' F ']sorted_by_diff = mean_ratings.sort_index (by = ' diff ') # Here's the inverse of the row for the data frame and the first 15 rows, but how do you reverse the row? (Oh, just follow the original line of the opposite line) #print sorted_by_diff[::-1][:15] #下面考虑分歧最大的电影, regardless of gender factor rating_std_by_title = Data.groupby (' title ') [ ' Rating '].std () Rating_std_by_title = Rating_std_by_title.ix[active_titles] #对Series对象进行排序, need orderprint RATING_STD _by_title.order (Ascending=false) [: 10]

In the above example, there are a lot of places to pay attention to, the amount of information is relatively large (for beginners). Here are some more examples:

#-*-Encoding:utf-8-*-import jsonimport numpy as Npimport pandas as Pdfrom pandas import Dataframe,seriesimport Matplot Lib.pyplot as Pltfrom collections import Defaultdictfrom Collections Import counterpath_base = U ' e:\\baiduyun\\ computer \ \ python\\ data analysis using python \\pydata-book-master\ch02\\names\\ ' #下面读入多个文件到同一个DataFrame中years = Range (1880,2011) pices = [] columns = [' name ', ' sex ', ' births ']for year in Years:path = Path_base + ' Yob%d.txt '% year frame = Pd.read_csv (path,n ames=columns) frame[' year ' = Year Pices.append (frame) break# Note Pd.concat is the default merge by row, is a outer outer join, by index as the connection key names = P D.concat (pices,ignore_index=true) #下面进行一下聚合, note the pivot_table here is very useful! Total_births = names.pivot_table (' births ', rows = ' year ', cols = ' sex ', aggfunc=sum) #print total_births.tail () #total_     Births.plot (title = ' Total births by sex and year ') #3plt. Show () #下面要插入一列, the proportion of births accounted for by birth is Def Add_prop (group): #下面将数据换为float类型 Births =group.births.astype (float) group[' prop '] = Births/births.sum () return groupnames = Names.groupby ([' Year ', ' sex ']). Apply (Add_prop) #下面对prop列进行加和看是不是等于1, because it is a floating-point data, with the Allclose function, to determine whether and 1 close enough #print Np.allclose (Names.groupby ([' Year ', ' sex ']). Prop.sum (), 1) #现在要取一个子集, is the first 1000 def get_top1000 ( Group): Return Group.sort_index (by = ' births ', ascending=false) [: 1000]grouped = Names.groupby ([' Year ', ' sex ']) top1000 = Grouped.apply (get_top1000) #print Top1000.head ()

Here is the full complement of the second half:

#-*-Encoding:utf-8-*-import osimport jsonimport numpy as Npimport pandas as Pdfrom pandas import Dataframe,seriesimpor T matplotlib.pyplot as Pltpath_base = U ' d:\\pydata-book-master\\ch02\\names\\ ' #下面读入多个文件到同一个DataFrame中years = Range ( 1880,2011) pices = []columns = [' name ', ' sex ', ' births ']for year in Years:path = Path_base + ' Yob%d.txt '% year frame = Pd.read_csv (path,names=columns) frame["Year" = Year Pices.append (frame) #注意pd. Concat is the default merge by row, is a outer outer join, by index as Connection key names = Pd.concat (pices,ignore_index=true) #下面进行一下聚合, note the pivot_table here is really useful! Total_births = names.pivot_table (' births ', rows = ' year ', cols = ' sex ', aggfunc=sum) #print total_births.tail () #total_     Births.plot (title = ' Total births by sex and year ') #3plt. Show () #下面要插入一列, the proportion of births accounted for by birth is Def Add_prop (group): #下面将数据换为float类型 Births =group.births.astype (float) group[' prop '] = Births/births.sum () return groupnames = Names.groupby ([' Yea R ', ' Sex ']). Apply (Add_prop) #下面对prop列进行加和看是不是等于1, because it is a floating-point data, with the Allclose function, determine whether and 1 close enough #priNT Np.allclose (NAMES.GROUPBY ([' Year ', ' sex ']). Prop.sum (), 1) #现在要取一个子集, is the first 1000 def get_top1000 ( Group): Return Group.sort_index (by = ' births ', ascending=false) [: 1000]grouped = Names.groupby ([' Year ', ' sex ']) top1000 = Grouped.apply (get_top1000) #print top1000.head () #下面是分析命名趋势boys = Top1000[top1000.sex = = ' M ']girls = top1000[ Top1000.sex = = ' F '] #下面做一个透视表total_births = top1000.pivot_table (' births ', rows = ' year ', cols = ' name ', Aggfunc = sum) subset = total_births[[' John ', ' Harry ', ' Mary ', ' Marilyn ']] #下面的subplots是用来标明是否将几个图画在一起, figsize is used to indicate size, grid is to indicate if there is a grid line # Subset.plot (subplots = True,figsize = (12,10), Grid = True,title = ' Number of births per year ') #plt. Show () #下面评估明明多样性的增长, calculating the most The popular 1000 names account for the proportions of #table = top1000.pivot_table (' prop ', rows = ' year ', cols = ' sex ', aggfunc = sum) #table. Plot (title = ' Sum of Ta ' Ble1000.prop by year and sex ', yticks = Np.linspace (0,1.2,13), xticks = Range (1880,2020,10)) #plt. Show () #另一个方式是计算总出生人数前50 The number of different names of% #df = Boys[boys.year = =] #下面就要找到prop的和是0.5 of the position, the book says that the loop is OK, but numThere are also cunsum functions in py, which are also available in the R language, which is of course excellent. #prop_cumsum = Df.sort_index (by = ' prop ', ascending = False). Prop.cumsum () #print prop_cumsum[:10] #下面这个函数简直太方便, Searchsorted#print prop_cumsum.searchsorted (0.5) #注意下面的函数, all years are calculated with the Def get_quantile_count (group,q = 0.5): Group = Group.sort_index (by = ' prop ', ascending= False) return Group.prop.cumsum (). searchsorted (q) + 1diversity = TOP1000.GROUPB Y ([' Year ', ' sex ']). Apply (get_quantile_count) diversity = diversity.unstack (' sex ') #print diversity.head () Diversity.plot (title = ' Number of popular names in Top 50% ') plt.show () #最后一个字母的变革 # Remove the last letter from the Name column, Note LAMDA This statement is used to create an anonymous function Get_last_letter = lambda x:x[-1] #注意这里的map函数是一种 "parallel" function, the following function for each element of name last_letters = Names.name.map (get_last_letter) last_letters.name = ' Last_letter ' #下面的语句让我感到了奇怪, why is last_latters not in names but still able to generate PivotTable report smoothly? Destroy the three-view table = names.pivot_table (' births ', rows = Last_letters,cols = [' Sex ', ' year '],aggfunc = sum) subtable = Table.reindex ( columns = [1910,1960,2010],level = ' year ') #print subtable.head () Letter_prop = SubTable/subtable.sum (). Astype (float) fig,axes = Plt.subplots (2,1,figsize= (10,8)) letter_prop[' M '].plot (kind = ' bar ', Rot = 0,ax = Axes[0],title = ' Male ') letter_prop[' F '].plot (kind = ' bar ', rot = 0,ax = Axes[1],title = ' Female ', legend = Fals e) plt.show () Letter_prop = Table/table.sum (). Astype (float) dny_ts = letter_prop.ix[[' d ', ' n ', ' y '], ' M ']. Tdny_ts.plot () plt.show () #下面是最后一项, the name of the boy who became the girl's name (and the opposite case) All_names = Top1000.name.unique () #这里的in函数应该是一个部分匹配函数, In addition, the above statement is familiar, and the R language also has mask = Np.array ([' Lesl ' in X.lower () for x in all_names]) Lesley_like = all_names[mask]# Then use this result to filter out the other names and calculate the number of births by name to see the relative frequency # This isin function is very convenient flitered = Top1000[top1000.name.isin (lesley_like)] Flitered.groupby (' name '). Births.sum () Table = flitered.pivot_table (' births ', rows = ' year ', cols = ' sex ', aggfunc = ' sum ') #print table.head () #注意这里的div函数是做一个归一化table = Table.div (table.sum (1), Axis = 0) Print table.head () #print table.tail () Table.plot (style = {' M ': ' K ', ' F ': ' k--'}) plt.show ()

The two chapters are finished, the next chapter is Ipython, and the third chapter is a little bit less, this chapter is a boon

"Data analysis using Python" reading notes--first to second chapter preparation and examples

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.