Pandas Simple Introduction (iii)

Source: Internet
Author: User

This section mainly introduces the data structure of pandas, this article refers to the URL: https://www.dataquest.io/mission/146/pandas-internals-series

The data that is used in this article is from: Https://github.com/fivethirtyeight/data/tree/master/fandango

This data mainly describes some of the film's rotten tomato scoring situation

Data

There are three major data structures in pandas:

    • Series (a collection of values)
    • DataFrame (collection of series)
    • Panel (collection of Dataframe)

The Pandas series is an upgraded version of an array of NumPy, NumPy can only be indexed using integers, but the series is also indexed using strings, and can be used to represent missing values using mixed data types and Nan. A Series object can contain the following data types:

    • Float--Represents a string value
    • INT--Represents an integer value
    • BOOL--Represents a Boolean value
    • Datetime64[ns]--Indicates the date and time (without time zone)
    • Datetime64[ns, TZ]--Indicates the date and time (sometimes the area)
    • Timedelta[ns]--representing time in different formats (minutes, seconds, etc.)
    • Category--represents the classification value
    • Object--Represents a string value

Dataframe uses a Series object to represent the data for each column, so when a column is selected from a dataframe, PANDAS returns the series object that represents the column, and the row of the series is indexed starting at 0, but you can also use shards to select multiple rows

# Select Film and RottenTomatoes two columns respectively and output the first 5 rows  = pd.read_csv ('fandango_score_comparison.csv'= fandango[' ) FILM ' ]print(series_film.head (5= fandango['rottentomatoes  ']print(Series_rt[:5])

Output:

The original data is as follows:

Custom Indexes

The above two series,series_film represent the name of the film, Series_rt represents the score, and I now want to know the score of these two films (Minions), Leviathan (2014), the simplest way to do that

Print(fandango[fandango['FILM']=='Minions ()']['RottenTomatoes'].values[0])Print(fandango[fandango['FILM']=='Leviathan (All)']['RottenTomatoes'].values[0])#It's a lot of trouble to write a statement to every movie.# The best way is to combine series_film and Series_rt into a new series, with the movie name as the index and the movie score as the value, so it's convenient to query multiple movies film_names=Series_film.valuesrt_scores=Series_rt.valuesseries_custom= Series (Rt_scores, Index=film_names)#to create a series, you need to specify the data and index parameters

# it's easier to query multiple movies at this point . series_custom[['Minions'Leviathan' ==series_custom.sort_values  ()

Vectorization operations

When you want to manipulate data from a column in a dataset, the series object can quickly vectorize (automatically compute every data value in that column), Pandas uses numpy, and NumPy uses the C language to cycle through the values of an entire column, so it will fly quickly. If you deliberately use a for to loop through a series object, it will actually become very slow.

Examples of vectorization operations

# perform a subtraction operation on a series series_custom/10#  This statement actually divides each value in the Series_custom series, noting that the index does not operate on the #  You can also use the NumPy function to perform arithmetic  # to find the maximum value of a movie score

You can also compare and filter

# Returns a list that contains a Boolean value that is greater than 50 and returns true, which can be used to filter the data  = series_custom[series_custom >]#  can also use & (and) and | ( OR) connect several judgments series_greater_than_50_&_less_than_80 =     >)  & (Series_custom < 80)]

Of course, you can also perform a direct operation on two series

Rt_critics = Series (fandango['rottentomatoes'].values, index=fandango['  FILM'#  Critic's rating  = Series (fandango['rottentomatoes_user ']. Values, index=fandango['FILM'# user ratings  #  Average score

Pandas easy to get started (iii)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.