When the DAT data is read into the Pandas Dataframe object, the code given in the book is:
Users = pd.read_table (' Ml-1m/users.dat ', sep= ':: ', Header=none, names=unames) rnames = [' user_id ', ' movie_id ', ' rating ', ' timestamp ']ratings = pd.read_table (' Ml-1m/ratings.dat ', sep= ':: ', Header=none, names=rnames) mnames = [' movie_id ', ' Title ', ' genres ']movies = pd.read_table (' Ml-1m/movies.dat ', sep= ':: ', Header=none, Names=mnames)
Warning will appear when running directly:
F:/python/helloworld/dataanalysisbypython-1.py:4: Parserwarning:falling back to the ' Python ' engine because the ' C ' engi NE does not support regex separators (Separators > 1 chars and different from ' \s+ ' is interpreted as regex); You can avoid this warning by specifying engine= ' Python '. Users = pd.read_table (' Ml-1m/users.dat ', sep= ':: ', Header=none, Names=unames) f:/python/helloworld/ Dataanalysisbypython-1.py:7: Parserwarning:falling back to the ' Python ' engine because the ' C ' engine does not a support re Gex separators (Separators > 1 char and different from ' \s+ ' is interpreted as regex); You can avoid this warning by specifying engine= ' Python '. Ratings = pd.read_table (' Ml-1m/ratings.dat ', sep= ':: ', Header=none, Names=rnames) f:/python/helloworld/ Dataanalysisbypython-1.py:10:parserwarning:falling back to the ' Python ' engine because the ' C ' engine does is not a support R Egex separators (Separators > 1 char and different from ' \s+ ' is interpreted as regex); Can avoid this WARning by specifying engine= ' Python '. Movies = pd.read_table (' Ml-1m/movies.dat ', sep= ':: ', Header=none, Names=mnames)
Although it works, I still want to solve this warning as a perfect obsessive-compulsive disorder. The warning is that because the ' C ' engine is not supported and can only be returned to the ' Python ' engine, there is an engine parameter in the Pandas.read_table method that sets the options for which parsing engine to use, with both ' C ' and ' python '. Since the ' C ' engine is not supported, we just need to set the engine to ' Python '.
Users = pd.read_table (' Ml-1m/users.dat ', sep= ':: ', Header=none, names=unames, engine = ' python ') rnames = [' user_id ', ' mov ie_id ', ' rating ', ' timestamp ']ratings = pd.read_table (' Ml-1m/ratings.dat ', sep= ':: ', Header=none, Names=rnames, engine = ' python ') mnames = [' movie_id ', ' title ', ' genres ']movies = pd.read_table (' Ml-1m/movies.dat ', sep= ':: ', Header=none, Names=mnames, engine = ' python ')
Use the Pivot_table method to calculate the average score for each movie by gender in the aggregated data, the code given in the book is:
Mean_ratings = data.pivot_table (' rating ', rows= ' title ', cols= ' gender ', aggfunc= ' mean ')
Direct operation will be an error, this code can not run:
Traceback (most recent): File "f:/python/helloworld/dataanalysisbypython-1.py", line, in <module >mean_ratings = data.pivot_table (' rating ', rows= ' title ', cols= ' gender ', aggfunc= ' mean ') typeerror:pivot_table () Got an unexpected keyword argument ' rows '
typeerror that the ' rows ' parameter here is not the keyword argument that is available in the method, so what is this? Went to the official online Pandas API using the document (), found that because the 0.20.2 version of the pandas.pivot_table keyword parameter has changed, in order to achieve the same effect, just change the rows to index on it, At the same time there is no cols parameter, to use columns instead.
Mean_ratings = data.pivot_table (' rating ', index= ' title ', columns= ' gender ', aggfunc= ' mean ')
To understand the favorite movies of female viewers, use the Dataframe method to sort the F column in descending order, the sample code in the book is:
Top_female_ratings = Mean_ratings.sort_index (by= ' F ', ascending=false)
Here also just gives a warning, does not interfere with the program:
F:/python/helloworld/dataanalysisbypython-1.py:32:futurewarning:by argument to Sort_index are deprecated, pls use. sort _values (by= ...) Top_female_ratings = Mean_ratings.sort_index (by= ' F ', ascending=false)
This means that the Sort_index method for sorting may change in future languages or libraries, and it is recommended to use Sort_values instead. In the API usage documentation, for pandas. Dataframe.sort_index is described as "sort object by labels (along-an axis)" while on pandas. Dataframe.sort_values is described as "sort by the values along either axis", both of which can achieve the same effect, then I will directly replace the sort_values. In the later " calculate the score divergence " also will use the Sort_index, also may replace Sort_values.
Top_female_ratings = mean_ratings.sort_values (by= ' F ', ascending=false)
The last error is also related to sorting. After calculating the standard deviation of the score data in calculate score divergence , the series is sorted in descending order based on the filtered value, and the code in the book is:
Print (Rating_std_by_title.order (ascending=false) [: 10])
The errors here are:
Traceback (most recent call last): File "f:/python/helloworld/dataanalysisbypython-1.py", line, in <module >print (Rating_std_by_title.order (Ascending=false) [: ten]) File "E:\Program files\python35\lib\site-packages \pandas\core\generic.py ", line 2970, in __getattr__return object.__getattribute__ (self, name) Attributeerror: ' Series ' object has no attribute ' order '
Actually already did not have this order method, had to go to the API document to look for the alternative method to use. There are two, Sort_index and Sort_values, which, like the methods in Dataframe, I choose to use sort_values for the sake of insurance:
Print (Rating_std_by_title.sort_values (ascending=false) [: 10]
The results are the same as the results of the data presentation and can be used with confidence.