Data Source acquisition:
Https://www.kaggle.com/datasets
1,
Look at the some basic stats for the ‘imdb_score’ column: data.imdb_score.describe()Select a column: data[‘movie_title’]Select the first 10 rows of a column: data[‘duration’][:10]Select multiple columns: data[[‘budget’,’gross’]]Select all movies over two hours long: data[data[‘duration’] > 120]
data.country = data.country.fillna(‘’)data.duration = data.duration.fillna(data.duration.mean())data = pd.read_csv(‘movie_metadata.csv’, dtype={title_year: str})data[‘movie_title’].str.upper()Similarly, to get rid of trailing whitespace:data[‘movie_title’].str.strip()data = data.rename(columns = {‘title_year’:’release_date’, ‘movie_facebook_likes’:’facebook_likes’})
Discard all items with Nan. dropna () discards the row data where all elements are Nan. dropna (how = 'all') discards the data column where all elements are Nan. dropna (axis = 1, how = 'all') # axis = 0 rows, = 1 column only retains rows with at least three non-nan values. dropna (thresh = 3)
Pandas common data cleansing (1)