Pandas Quick Start (3) and pandas Quick Start
This section mainly introduces the Pandas data structure, this article cited URL: https://www.dataquest.io/mission/146/pandas-internals-series
The data used in this article comes from: https://github.com/fivethirtyeight/data/tree/master/fandango
This data mainly describes the ratings of rotten tomatoes in some movies.
Data Structure
Pandas has three important data structures:
- Series (set of values)
- DataFrame (Set of Series)
- Panel (Set of DataFrame)
Pandas Series is an upgraded version of the Numpy array (array). Numpy can only be indexed by integers, but Series can also be indexed by strings, you can also use mixed data types and NaN to indicate missing values. A Series object can contain the following data types:
- Float -- string value
- Int -- integer value
- Bool -- Boolean Value
- Datetime64 [ns] -- indicates the date and time (without the time zone)
- Datetime64 [ns, tz] -- indicates the date and time (with time zone)
- Timedelta [ns] -- time in different formats (minutes, seconds, etc.)
- Category -- indicates the category value.
- Object -- string value
DataFrame uses a Series object to represent the data of each column. Therefore, when you select a Time column from a DataFrame, Pandas returns a Series object representing the column, and index the rows of the Series from 0. Of course, you can also use shards to select multiple rows.
# Select the FILM and RottenTomatoes columns respectively, and output the first five rows fandango = done') series_film = fandango ['film'] print (series_film.head (5 )) series_rt = fandango ['rottentomates'] print (series_rt [: 5])
Output:
Print (fandango [fandango ['film'] = 'minions (2015) '] ['fig]. values [0]) print (fandango [fandango ['film'] = 'levioes (2014) '] ['rottentomates']. values [0]) # It is very troublesome to write a statement for each movie. # The best way is to combine series_film and series_rt into a new Series, using movie names as indexes and movie scores as values makes it easier to query multiple movies. film_names = series_film.valuesrt_scores = series_rt.valuesseries_custom = Series (rt_scores, index = film_names) # create a Series, you must specify the data and index parameters.
# In this case, it is easy to query multiple movies. series_custom [['minions (2015) ', 'leviathan (2014)'] # For the Series created above, you can use the sort_index () function to sort the names of movies by letter. If you want to sort the names of movies, use the sort_values () function sc2 = series_custom.sort_index () sc3 = series_custom.sort_values ()
# Perform the addition, subtraction, multiplication, division, and Division operations on a Series. series_custom/10 # This statement is actually a division operation on each value of the Series series_custom. Note, does not perform operations on indexes # You can also use Numpy functions to perform operations on np. max (series_custom) # obtain the maximum score of a movie.
You can also compare and filter
Series_custom> 50 # returns a list containing boolean values. If the score is greater than 50, True is returned. It can be used to filter data series_greater_than_50 = series_custom [series_custom> 50] # It can also be used) and | (or) join several judges series_greater_than_50 _ & _ less_than_80 = \ series_custom [(series_custom> 50) & (series_custom <80)]
Of course, you can perform operations on two Series directly.
Rt_critics = Series (fandango ['rottentomates']. values, index = fandango ['film']) # rating rt_users = Series (fandango ['rottentomatoes _ user']. values, index = fandango ['film']) # user rating rt_mean = (rt_critics + rt_users)/2 # average score