Python For Data Analysis study notes-1, pythondataanalysis

Source: Internet
Author: User

Python For Data Analysis study notes-1, pythondataanalysis

This section describes how to process a MovieLens 1 Mbit/s dataset. The book introduces this dataset from GroupLens Research (http://www.groupLens.org/node/73), which will jump directly to the very 1 m dataset is also in it.

The downloaded and decompressed folder is as follows:

All three dat tables are used in the example. The Chinese version of Python For Data Analysis (PDF) I read is the first version in 2014. All examples are based on Python 2.7 and pandas 0.8.2, I installed Python 3.5.2 and pandas 0.20.2. Some of the functions and methods are quite different, and some of them are parameters changed in the new version, some functions of earlier versions are discarded in the new version, which leads to some errors and warnings when I run the sample code in the book. When testing the MovieLens 1 Mbit/s dataset code, the following problems may occur in my configuration environment.

  • When reading dat data into a pandas DataFrame object, the following code is provided:
    users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=unames)rnames = ['user_id', 'movie_id', 'rating', 'timestamp']ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rnames)mnames = ['movie_id', 'title', 'genres']movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames)

    If you run the command directly, the following error occurs:

    F:/python/HelloWorld/DataAnalysisByPython-1.py:4: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.  users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=unames)F:/python/HelloWorld/DataAnalysisByPython-1.py:7: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.  ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rnames)F:/python/HelloWorld/DataAnalysisByPython-1.py:10: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.  movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames)

    Although it can also run, as a perfect obsessive-compulsive disorder, I still want to solve this Warning. This warning means that because the 'C' engine does not support this function, it can only be returned to the 'python' engine, but pandas. the read_table method has an engine parameter, which is used to set the parsing engine. The options 'C' and 'python' are available. Since the 'C' engine is not supported, we only need to set the engine to 'python.

    users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=unames, engine = 'python')rnames = ['user_id', 'movie_id', 'rating', 'timestamp']ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rnames, engine = 'python')mnames = ['movie_id', 'title', 'genres']movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames, engine = 'python')

     

  • The pivot_table method is used to calculate the average score of each movie for the aggregated data based on gender. The code in the book is as follows:

    mean_ratings = data.pivot_table('rating', rows='title', cols='gender', aggfunc='mean')

    An error is reported when running the Code directly. This Code cannot be run:

    Traceback (most recent call last):  File "F:/python/HelloWorld/DataAnalysisByPython-1.py", line 19, in <module>    mean_ratings = data.pivot_table('rating', rows='title', cols='gender', aggfunc='mean')TypeError: pivot_table() got an unexpected keyword argument 'rows'

    TypeError indicates that the 'rows 'parameter here is not a keyword parameter available in the method. Is that the case? Go to the official website to check the pandas API reference (http://pandas.pydata.org/pandas-docs/stable/api.html), found that because of version 0.20.2 pandas. the keyword parameter in pivot_table has changed. To achieve the same effect, you only need to replace rows with index, and there is no cols parameter, which should be replaced by columns.

    mean_ratings = data.pivot_table('rating', index='title', columns='gender', aggfunc='mean')

     

  • To understand the favorite movies of female audiences, use the DataFrame method to sort columns in descending order. The sample code in the book is as follows:

    top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)

    Here we only provide a Warning, which will not interfere with the program:

    F:/python/HelloWorld/DataAnalysisByPython-1.py:32: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)  top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)

    The sort_index Method for sorting may change in the language or library in the future. We recommend that you use sort_values instead. In the API reference. dataFrame. sort_index is described as "Sort object by labels (along an axis)", while. dataFrame. sort_values is described as "Sort by the values along either axis". If the two can achieve the same effect, replace them with sort_values. In theScore divergenceYou can also use sort_index or replace it with sort_values.

    top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)

     

  • The last error is related to sorting. In theScore divergenceAfter calculating the standard deviation of the Score data, sort the Series in descending order based on the filtered value. The code in the book is:

    print(rating_std_by_title.order(ascending=False)[:10])

    The error here is:

    Traceback (most recent call last):  File "F:/python/HelloWorld/DataAnalysisByPython-1.py", line 47, in <module>    print(rating_std_by_title.order(ascending=False)[:10])  File "E:\Program Files\Python35\lib\site-packages\pandas\core\generic.py", line 2970, in __getattr__    return object.__getattribute__(self, name)AttributeError: 'Series' object has no attribute 'order'

    The order method is no longer available, so you have to go to the API documentation to find an alternative method. There are two sort_index and sort_values, which are the same as those in DataFrame. For the sake of insurance, I choose to use sort_values:

    print(rating_std_by_title.sort_values(ascending=False)[:10]

    The results are the same as those displayed in the data presentation.

The differences between different versions of third-party libraries are quite obvious. We recommend that you use the latest version. When using the latest version, you can use the API documentation on the official website to easily solve various problems ~

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.