In the introduction section, an example of processing an Movielens 1M dataset is presented. The data set is presented in the book from Grouplens Research (HTTP://WWW.GROUPLENS.ORG/NODE/73), which jumps directly to https://grouplens.org/datasets/ movielens/, which provides a variety of evaluation data from the Movielens website, can download the corresponding compression package, we need the Movielens 1M data set is also inside.
These three DAT tables are all used in the example, but the Chinese version of the Python for Data analysis (PDF) I read is the first edition of 2014, and all of the examples are based on Python 2.7 and pandas 0.8.2. And I installed Python 3.5.2 and pandas 0.20.2, some of the functions and methods will be very different, some of the new version of the parameters have changed, and some of the new version deprecated some of the old version of the function, which led me to run according to the sample code in the book, Will encounter some error and warning. When testing the Movielens 1M dataset code, I encountered four parameter setup problems in the same configuration environment.
- when the DAT data is read into the Pandas Dataframe object, the code given in the book is:
Users = pd.read_table ('Ml-1m/users.dat', sep='::', Header=none, names=unames) Rnames= ['user_id','movie_id','rating','timestamp']ratings= Pd.read_table ('Ml-1m/ratings.dat', sep='::', Header=none, names=rnames) Mnames= ['movie_id','title','Genres']movies= Pd.read_table ('Ml-1m/movies.dat', sep='::', Header=none, Names=mnames)
Warning will appear when running directly:
F:/python/helloworld/dataanalysisbypython-1.py:4: Parserwarning:falling back to the'python'Engine because the'C'Engine does notSupport regex separators (Separators > 1 char andDifferent from '\s+'is interpreted as regex); Can avoid this warning by specifying engine='python'. Users= Pd.read_table ('Ml-1m/users.dat', sep='::', Header=none, names=unames) F:/python/helloworld/dataanalysisbypython-1.py:7: Parserwarning:falling back to the'python'Engine because the'C'Engine does notSupport regex separators (Separators > 1 char andDifferent from '\s+'is interpreted as regex); Can avoid this warning by specifying engine='python'. Ratings= Pd.read_table ('Ml-1m/ratings.dat', sep='::', Header=none, names=rnames) F:/python/helloworld/dataanalysisbypython-1.py:10:parserwarning:falling back to the'python'Engine because the'C'Engine does notSupport regex separators (Separators > 1 char andDifferent from '\s+'is interpreted as regex); Can avoid this warning by specifying engine='python'. Movies= Pd.read_table ('Ml-1m/movies.dat', sep='::', Header=none, Names=mnames)
Although it works, I still want to solve this warning as a perfect obsessive-compulsive disorder. The warning is that because the ' C ' engine is not supported and can only be returned to the ' Python ' engine, there is an engine parameter in the Pandas.read_table method that sets the options for which parsing engine to use, with both ' C ' and ' python '. Since the ' C ' engine is not supported, we just need to set the engine to ' Python '.
Users = pd.read_table ('Ml-1m/users.dat', sep='::', Header=none, names=unames, engine ='python') Rnames= ['user_id','movie_id','rating','timestamp']ratings= Pd.read_table ('Ml-1m/ratings.dat', sep='::', Header=none, names=rnames, engine ='python') Mnames= ['movie_id','title','Genres']movies= Pd.read_table ('Ml-1m/movies.dat', sep='::', Header=none, names=mnames, engine ='python')
Use the Pivot_table method to calculate the average score for each movie by gender in the aggregated data, the code given in the book is:
Mean_ratings = data.pivot_table ('rating', rows='title', cols ='gender', aggfunc='mean')
Direct operation will be an error, this code can not run:
Traceback (most recent): File"f:/python/helloworld/dataanalysisbypython-1.py", line 19,inch<module>mean_ratings= Data.pivot_table ('rating', rows='title', cols='Gender', aggfunc='mean') typeerror:pivot_table () got an unexpected keyword argument'rows'
typeerror that the ' rows ' parameter here is not the keyword argument that is available in the method, so what is this? Went to the official website to check the next Pandas API use document (http://pandas.pydata.org/pandas-docs/stable/api.html), found that because 0.20.2 version of the Pandas.pivot _table keyword parameter changed, in order to achieve the same effect, just put the rows into index can be, at the same time there is no cols parameters, to use columns instead.
Mean_ratings = data.pivot_table ('rating', index='title', columns='gender', aggfunc='mean')
To understand the favorite movies of female viewers, use the Dataframe method to sort the F column in descending order, the sample code in the book is:
Top_female_ratings = Mean_ratings.sort_index (by='F', Ascending=false)
Here also just gives a warning, does not interfere with the program:
is deprecated, pls use. sort_values (by= ...) = Mean_ratings.sort_index (by='F', Ascending=false)
This means that the Sort_index method for sorting may change in future languages or libraries, and it is recommended to use Sort_values instead. In the API usage documentation, for pandas. Dataframe.sort_index is described as "sort object by labels (along-an axis)" while on pandas. Dataframe.sort_values is described as "sort by the values along either axis", both of which can achieve the same effect, then I will directly replace the sort_values. In the later " calculate the score divergence " also will use the Sort_index, also may replace Sort_values.
Top_female_ratings = mean_ratings.sort_values (by='F', Ascending=false)
The last error is also related to sorting. After calculating the standard deviation of the score data in calculate score divergence , the series is sorted in descending order based on the filtered value, and the code in the book is:
Print (Rating_std_by_title.order (Ascending=false) [: 10])
The errors here are:
Traceback (most recent): File"f:/python/helloworld/dataanalysisbypython-1.py", line 47,inch<module>Print(Rating_std_by_title.order (Ascending=false) [: 10]) File"E:\Program files\python35\lib\site-packages\pandas\core\generic.py", Line 2970,inch __getattr__ returnObject.__getattribute__(self, name) Attributeerror:'Series'object has no attribute'Order'
Actually already did not have this order method, had to go to the API document to look for the alternative method to use. There are two, Sort_index and Sort_values, which, like the methods in Dataframe, I choose to use sort_values for the sake of insurance:
Print (Rating_std_by_title.sort_values (Ascending=false) [: 10]
The results are the same as the results of the data presentation and can be used with confidence.
The difference between the different versions of the third-party library is quite obvious, it is recommended to use the latest version, in conjunction with the official website on the site of the API to use the document, easy to solve various problems ~