Original English: 01-lesson
Create Data -we start by creating our own data. This avoids the need for users reading this tutorial to download any files to reproduce the results. We'll export this data to a text file so you can try to read the data from the file.
get the data -we'll learn how to read data from a text file. This data contains the number of babies born in 1880 and the names they use.
preparing the data -here we'll look at the data and make sure the data is clean, which means we'll look at the data in the file and look for some potentially unusual data. This may include data loss (missing), inconsistent data (inconsistant), or outside the normal range (out of place). If there is such data, we will decide how to dispose of the data.
Analyze Data -we will simply find the hottest name in a given year.
performance Data -the most popular names in a given year are clearly displayed to the user through tables and graphs.
In addition to a small portion of the data presentation, thePandas Library will be used throughout the data analysis process. matplotlib is used only in the Data Presentation section. The first step in the course is to import the libraries you need.
# import all required Libraries
# import a library to make a function general practice:
# #from (library) import (Specific library function) from
Pandas import Dataframe, Read_csv
# The general practice of importing a library:
# #import (library) as (give the library a nickname/alias)
import Matplotlib.pyplot as PLT
import pandas as PD #导入pandas的常规做法
import sys #导入sys库只是为了确认一下Python的版本
import Matplotlib #这样导入matplotlib只是为了显示一下其版本号
# Initialize matplotlib, inline display graphics
%matplotlib inline
Print (' Python version ' + sys.version)
print (' Pandas version ' + pd.__version__)
print (' matplotlib version ' + M ATPLOTLIB.__VERSION__)
Python version 3.6.1 | Packaged by Conda-forge | (Default, Mar 2017, 21:57:00)
[GCC 4.2.1 compatible Apple LLVM 6.1.0 (clang-602.0.53)]
Pandas version 0.19.2
matplotlib version 2.0.2
Creating Data
This simple dataset includes the names of 5 commonly used babies born in 1880, as well as the number of babies they correspond to.
#初始数据集: Infant name and birth rate
names = [' Bob ', ' Jessica ', ' Mary ', ' John ', ' Mel ']
births = [968, 155, 77, 578, 973]
Use the zip function to merge the two lists together.
# Check the zip function's help
zip?
Babydataset = List (zip (names, births))
Babydataset
[(' Bob ', 968), (' Jessica ', "), (' Mary ',), (' John ', 578), (' Mel ', 973)]
We have completed the creation of a basic dataset. We now use Pandas to export this data to a CSV file.
DF is a dataframe object. You can think of this object as containing the content of the Babydataset and the format is very much like a SQL table or Excel datasheet. Let's look at the content in DF .
DF = PD. Dataframe (data = Babydataset, columns=[' Names ', ' births '])
DF
|
Names |
births |
0 |
Bob |
968 |
1 |
Jessica |
155 |
2 |
Mary |
77 |
3 |
John |
578 |
4 |
Mel |
973 |
Export Dataframe to a csv file. We name the exported file births1880.csv. The function to export a CSV file is to_csv. Unless you specify a different file directory, the exported file will be saved in the same location as the notebook file.
# Check out To_csv's help
df.to_csv?
The parameters we will use are index and header. Setting these two arguments to False will prevent the index and column names (header names) from being exported to the file. You can try to change the values of these two parameters to better understand the role of these two arguments.
Df.to_csv (' Births1880.csv ', Index=false, Header=false)
Get Data
We'll use the pandas read_csv function to get the data from the CSV file. Let's take a look at the Help for this function and what parameters it needs.
Read_csv?
This function has a lot of parameters, but we only need the location of the file at the moment.
Note: depending on where you keep the notebook, you may need to modify the location of the file.
Location = R './births1880.csv ' #从 notebook the current location read the CSV file
df = pd.read_csv (Location)
Note the R before the string. Because the slash (slash) is a special character, placing a leading R before the string will escape the entire string (escape).
Df
|
Bob |
968 |
0 |
Jessica |
155 |
1 |
Mary |
77 |
2 |
John |
578 |
3 |
Mel |
973 |
There's a problem here. The read_csv function takes the first row in the CSV file as the column name (head names) for each column. This is obviously not true because the data file does not provide a column name.
To fix this error, we need to pass the header This parameter to the read_csv function and set to None (meaning null in Python).
DF = Pd.read_csv (Location, header=none)
DF
|
0 |
1 |
0 |
Bob |
968 |
1 |
Jessica |
155 |
2 |
Mary |
77 |
3 |
John |
578 |
4 |
Mel |
973 |
If we need to specify a name for each column, we can pass in another parameter namesand remove the header parameter.
DF = Pd.read_csv (Location, names=[' names ', ' births '])
DF
|
Names |
births |
0 |
Bob |
968 |
1 |
Jessica |
155 |
2 |
Mary |
77 |
3 |
John |
578 |
4 |
Mel |
973 |
You can think of the number [0,1,2,3,4] as the row mark in the Excel file (row numbers). In pandas, these are part of the index . You can interpret the index as a primary key (primary key) in an SQL table, but the index can be duplicated.
[Names, births] is the column name and is similar to the column header in the SQL table or Excel datasheet.
Now you can delete this CSV file.
Import OS
os.remove (Location)
Preparing Data
Our data contains the number of babies born in 1880. We already know that we have 5 records and there are no missing values (all values are Non-null non-null).
The Names column is the name of a baby made up of letters and numeric strings. There may be some dirty data in this column but we don't need to worry too much now. The Births column should be an integer number (integers) indicating the birth rate of the specified infant name for a specified year. We can check if the numbers in this column are integral. It does not make sense for this column to appear floating point type (float). But we don't have to worry about any possible outliers (outlier) in this column.
Please note that at this stage, it is sufficient to simply look at the data in the Dataframe to check the "Names" column. In the course of our data analysis, we have many opportunities to discover the problems in the data.
# View data types for each column
df.dtypes
Names Object
births Int64
Dtype:object
# View the data type DF for the Births column
. Births.dtype
Dtype (' Int64 ')
You see that the data type of the Births column is inter64, which means that there is no floating-point type (decimal) or string type appearing in this column. Analyze Data
To find the baby name of the highest birth rate or the hottest baby name, we can do that. Sort the dataframe and find the first row use the max () property to find the maximum value
# method 1:
Sorted = df.sort_values ([' births '], Ascending=false)
sorted.head (1)
# method 2:
df[' births '].max ()
973
Performance Data
We can mark the Births column on the graph to show the user the point with the largest number. The user will have a very intuitive picture against the datasheet. Mel is the hottest baby name in this set of data.
Pandas uses very handy plot () to make it easy to map with the data in Dataframe. We found the maximum value in the Births column, and now we're going to find the name of the baby that corresponds to the 973 value.
explanation of each part:
df[' Names ']-This is the complete list of baby names, complete Names columns
df[' births ']-this is the birth rate of 1880, complete births column
df[' births '].max ()-This is the maximum value in the Births column
[df[' births '] = = df[' births '].max ()] the meaning is [Find all records with a value of 973 in the births column]
df[' Names '][df[' births '] = = df[' births '].max ()] means that the Names column is selected with a value of births column equal to 973 (Select all of the records In the Names column WHERE [The Births column are equal to 973])
An additional method is that we have used the sorted dataframe:sorted[' Names '].head (1). Value
str () can convert an object to a string.
# drawing
# df[' births '].plot ()
df[' births '].plot.bar () #这里改用的条形图更直观
# births maximum
MaxValue = df[' Births '].max ()
# Find the corresponding Names value maxname = df[' Names '][df[' births '
] = = df[' births '].max ()].values
# Prepare the text to display Text
= str (MaxValue) + "-" + maxname
# Displays text in graphics
plt.annotate (text, xy= (1, MaxValue), xytext= (8, 0),
xycoords= (' axes fraction ', ' data '), textcoords= ' offset points ')
print ("The most popular name")
df[df[' Births '] = = df[' births '].max ()]
#Sorted. Head (1) can also be used
The most popular name