This section mainly introduces the data structure of pandas, this article refers to the URL: https://www.dataquest.io/mission/146/pandas-internals-series
The data that is used in this article is from: Https://github.com/fivethirtyeight/data/tree/master/fandango
This data mainly describes some of the film's rotten tomato scoring situation
Data
There are three major data structures in pandas:
- Series (a collection of values)
- DataFrame (collection of series)
- Panel (collection of Dataframe)
The Pandas series is an upgraded version of an array of NumPy, NumPy can only be indexed using integers, but the series is also indexed using strings, and can be used to represent missing values using mixed data types and Nan. A Series object can contain the following data types:
- Float--Represents a string value
- INT--Represents an integer value
- BOOL--Represents a Boolean value
- Datetime64[ns]--Indicates the date and time (without time zone)
- Datetime64[ns, TZ]--Indicates the date and time (sometimes the area)
- Timedelta[ns]--representing time in different formats (minutes, seconds, etc.)
- Category--represents the classification value
- Object--Represents a string value
Dataframe uses a Series object to represent the data for each column, so when a column is selected from a dataframe, PANDAS returns the series object that represents the column, and the row of the series is indexed starting at 0, but you can also use shards to select multiple rows
# Select Film and RottenTomatoes two columns respectively and output the first 5 rows = pd.read_csv ('fandango_score_comparison.csv'= fandango[' ) FILM ' ]print(series_film.head (5= fandango['rottentomatoes ']print(Series_rt[:5])
Output:
The original data is as follows:
Custom Indexes
The above two series,series_film represent the name of the film, Series_rt represents the score, and I now want to know the score of these two films (Minions), Leviathan (2014), the simplest way to do that
Print(fandango[fandango['FILM']=='Minions ()']['RottenTomatoes'].values[0])Print(fandango[fandango['FILM']=='Leviathan (All)']['RottenTomatoes'].values[0])#It's a lot of trouble to write a statement to every movie.# The best way is to combine series_film and Series_rt into a new series, with the movie name as the index and the movie score as the value, so it's convenient to query multiple movies film_names=Series_film.valuesrt_scores=Series_rt.valuesseries_custom= Series (Rt_scores, Index=film_names)#to create a series, you need to specify the data and index parameters
# it's easier to query multiple movies at this point . series_custom[['Minions'Leviathan' ==series_custom.sort_values ()
Vectorization operations
When you want to manipulate data from a column in a dataset, the series object can quickly vectorize (automatically compute every data value in that column), Pandas uses numpy, and NumPy uses the C language to cycle through the values of an entire column, so it will fly quickly. If you deliberately use a for to loop through a series object, it will actually become very slow.
Examples of vectorization operations
# perform a subtraction operation on a series series_custom/10# This statement actually divides each value in the Series_custom series, noting that the index does not operate on the # You can also use the NumPy function to perform arithmetic # to find the maximum value of a movie score
You can also compare and filter
# Returns a list that contains a Boolean value that is greater than 50 and returns true, which can be used to filter the data = series_custom[series_custom >]# can also use & (and) and | ( OR) connect several judgments series_greater_than_50_&_less_than_80 = >) & (Series_custom < 80)]
Of course, you can also perform a direct operation on two series
Rt_critics = Series (fandango['rottentomatoes'].values, index=fandango[' FILM'# Critic's rating = Series (fandango['rottentomatoes_user ']. Values, index=fandango['FILM'# user ratings # Average score
Pandas easy to get started (iii)