Python for Data Analysis Learning Path

Last Update:2017-06-23 Source: Internet

Author: User

Tags deprecated

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

in the Introduction section, an example of processing an Movielens 1M dataset is presented. The book describes the data set from Grouplens research (), the address will jump directly to, which provides a variety of evaluation data from the Movielens website, can download the corresponding compression package, we need the Movielens 1M dataset is also inside.

Download the extracted folder as follows:

These three DAT tables are all used in the example. The Chinese version of the Python for Data analysis (PDF) I read was the first edition of 2014, and all of the examples are based on Python 2.7 and pandas 0.8.2 and I installed Python 3.5.2 with pandas 0.20.2, some of the functions and methods inside will be very different, some of the new version of the parameters have changed, and some of the new version deprecated some of the old version of the function, which led me to run according to the sample code in the book, I will encounter some error and warning. When testing the Movielens 1M dataset code, in the same environment as my configuration, I encounter several problems.

When the DAT data is read into the Pandas Dataframe object, the code given in the book is:

Users = pd.read_table (' Ml-1m/users.dat ', sep= ':: ', Header=none, names=unames) rnames = [' user_id ', ' movie_id ', ' rating ', ' timestamp ']ratings = pd.read_table (' Ml-1m/ratings.dat ', sep= ':: ', Header=none, names=rnames) mnames = [' movie_id ', ' Title ', ' genres ']movies = pd.read_table (' Ml-1m/movies.dat ', sep= ':: ', Header=none, Names=mnames)

Warning will appear when running directly:

F:/python/helloworld/dataanalysisbypython-1.py:4: Parserwarning:falling back to the ' Python ' engine because the ' C ' engi NE does not support regex separators (Separators > 1 chars and different from ' \s+ ' is interpreted as regex);  You can avoid this warning by specifying engine= ' Python '. Users = pd.read_table (' Ml-1m/users.dat ', sep= ':: ', Header=none, Names=unames) f:/python/helloworld/ Dataanalysisbypython-1.py:7: Parserwarning:falling back to the ' Python ' engine because the ' C ' engine does not a support re Gex separators (Separators > 1 char and different from ' \s+ ' is interpreted as regex);  You can avoid this warning by specifying engine= ' Python '. Ratings = pd.read_table (' Ml-1m/ratings.dat ', sep= ':: ', Header=none, Names=rnames) f:/python/helloworld/ Dataanalysisbypython-1.py:10:parserwarning:falling back to the ' Python ' engine because the ' C ' engine does is not a support R Egex separators (Separators > 1 char and different from ' \s+ ' is interpreted as regex); Can avoid this WARning by specifying engine= ' Python '. Movies = pd.read_table (' Ml-1m/movies.dat ', sep= ':: ', Header=none, Names=mnames)

Although it works, I still want to solve this warning as a perfect obsessive-compulsive disorder. The warning is that because the ' C ' engine is not supported and can only be returned to the ' Python ' engine, there is an engine parameter in the Pandas.read_table method that sets the options for which parsing engine to use, with both ' C ' and ' python '. Since the ' C ' engine is not supported, we just need to set the engine to ' Python '.

Users = pd.read_table (' Ml-1m/users.dat ', sep= ':: ', Header=none, names=unames, engine = ' python ') rnames = [' user_id ', ' mov ie_id ', ' rating ', ' timestamp ']ratings = pd.read_table (' Ml-1m/ratings.dat ', sep= ':: ', Header=none, Names=rnames, engine = ' python ') mnames = [' movie_id ', ' title ', ' genres ']movies = pd.read_table (' Ml-1m/movies.dat ', sep= ':: ', Header=none, Names=mnames, engine = ' python ')

Use the Pivot_table method to calculate the average score for each movie by gender in the aggregated data, the code given in the book is:

Mean_ratings = data.pivot_table (' rating ', rows= ' title ', cols= ' gender ', aggfunc= ' mean ')

Direct operation will be an error, this code can not run:

Traceback (most recent):  File "f:/python/helloworld/dataanalysisbypython-1.py", line, in <module >mean_ratings = data.pivot_table (' rating ', rows= ' title ', cols= ' gender ', aggfunc= ' mean ') typeerror:pivot_table () Got an unexpected keyword argument ' rows '

typeerror that the ' rows ' parameter here is not the keyword argument that is available in the method, so what is this? Went to the official online Pandas API using the document (), found that because the 0.20.2 version of the pandas.pivot_table keyword parameter has changed, in order to achieve the same effect, just change the rows to index on it, At the same time there is no cols parameter, to use columns instead.

Mean_ratings = data.pivot_table (' rating ', index= ' title ', columns= ' gender ', aggfunc= ' mean ')

To understand the favorite movies of female viewers, use the Dataframe method to sort the F column in descending order, the sample code in the book is:

Top_female_ratings = Mean_ratings.sort_index (by= ' F ', ascending=false)

Here also just gives a warning, does not interfere with the program:

F:/python/helloworld/dataanalysisbypython-1.py:32:futurewarning:by argument to Sort_index are deprecated, pls use. sort _values (by= ...)  Top_female_ratings = Mean_ratings.sort_index (by= ' F ', ascending=false)

This means that the Sort_index method for sorting may change in future languages or libraries, and it is recommended to use Sort_values instead. In the API usage documentation, for pandas. Dataframe.sort_index is described as "sort object by labels (along-an axis)" while on pandas. Dataframe.sort_values is described as "sort by the values along either axis", both of which can achieve the same effect, then I will directly replace the sort_values. In the later " calculate the score divergence " also will use the Sort_index, also may replace Sort_values.

Top_female_ratings = mean_ratings.sort_values (by= ' F ', ascending=false)

The last error is also related to sorting. After calculating the standard deviation of the score data in calculate score divergence , the series is sorted in descending order based on the filtered value, and the code in the book is:

Print (Rating_std_by_title.order (ascending=false) [: 10])

The errors here are:

Traceback (most recent call last):  File "f:/python/helloworld/dataanalysisbypython-1.py", line, in <module >print (Rating_std_by_title.order (Ascending=false) [: ten])  File "E:\Program files\python35\lib\site-packages \pandas\core\generic.py ", line 2970, in __getattr__return object.__getattribute__ (self, name) Attributeerror: ' Series ' object has no attribute ' order '

Actually already did not have this order method, had to go to the API document to look for the alternative method to use. There are two, Sort_index and Sort_values, which, like the methods in Dataframe, I choose to use sort_values for the sake of insurance:

Print (Rating_std_by_title.sort_values (ascending=false) [: 10]

The results are the same as the results of the data presentation and can be used with confidence.

The difference between different versions of the third-party library is quite obvious, it is recommended to use the latest version, when used in conjunction with the official website of the API to use the document, easy to solve various problems ~

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More