1. What is
DataFrame
DataFrame is a tabular data structure, it contains an ordered set of columns, each column can be of different value types (numeric, string, Boolean, etc.). DataFrame has both row index and column index, it can be regarded as a dictionary composed of series (share the same index)
2.
DateFrame features
Row-oriented and column-oriented operations in DataFrame are basically balanced.
The data in the DataFrame is stored in one or more two-dimensional blocks (instead of lists, dictionaries or other one-dimensional data structures).
3. Create DataFrame
The most common one is to directly pass in a dictionary composed of equal-length lists or NumPy arrays:
In [33]: data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],'year':[2000,2001,2002,2001,2002], 'pop':[1.5,1.7,3.6,2.4,2.9]}
In [34]: frame=DataFrame(data)
#Result DataFrame will automatically add an index (same as Series), and all columns will be arranged in order:
In [35]: frame
Out[35]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
4. Specify column order
#Using clolumns to specify the column order
In [36]: DataFrame(data,columns=['year','state','pop'])
Out[36]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5. NA value
Like Series, if the incoming column is not found in the data, it will produce NA value:
In [37]: DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five '])
Out[37]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
6. Similar dictionary (or attribute) tags
You can obtain the columns of the DataFrame as a Series in a way similar to dictionary tags or attributes:
In [39]: frame['state'] #or frame.state
Out[39]:
0 Ohio
1 Ohio
2 Ohio
3 Nevada
4 Nevada
Name: state, dtype: object
7. Index field ix (row)
Note that the returned Series has the same index as the original DataFrame, and its name property has been set accordingly. Rows can also be obtained by location or name, for example, using the index field ix:
In [44]: frame2.ix['one']
Out[44]:
year 2000
state Ohio
pop 1.5
debt NaN
Name: one, dtype: object
8. Modify the column by assignment
Columns can be modified by assignment. For example, you can assign a scalar value or a set of values to that empty ‘debt’ column:
In [45]: frame2['debt']=16.5 # or frame2.debt
In [46]: frame2
Out[46]:
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
In [50]: frame2.debt=np.arange(5.)
In [51]: frame2
Out[51]:
year state pop debt
one 2000 Ohio 1.5 0.0
two 2001 Ohio 1.7 1.0
three 2002 Ohio 3.6 2.0
four 2001 Nevada 2.4 3.0
five 2002 Nevada 2.9 4.0
When assigning a list or array to a column, its length must match the length of the DataFrame. If the value assigned is a Series, it will exactly match the index of the DataFrame, and all gaps will be filled with missing values:
In [52]: val=Series([-1.2,-1.5,-1.7],index=['two','four','five'])
In [53]: frame2['debt']=val
In [54]: frame2
Out[54]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
9. Keyword del delete column
Assigning a value to a non-existent column creates a new column. The keyword del is used to delete columns:
In [55]: frame2['eastern']=frame2.state=='Ohio'
In [56]: frame2
Out[56]:
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
In [57]: del frame2['eastern']
In [58]: frame2.columns
Out[58]: Index(['year','state','pop','debt'], dtype='object')
Warning: The columns returned by indexing are just views of the corresponding data, not copies. Therefore, any in-place modifications made to the returned Series will be reflected on the source DataFrame. You can assign columns explicitly by using the copy method of Series.
10. Nested dictionary
Nested dictionaries (that is, dictionaries of dictionaries):
In [62]: pop={'Nevada':{2001:2.4,2002:2.9},'Ohio':{2000:1.5,2001:17,2002:3.6}}
#If you pass it to the DataFrame, it will be interpreted as: the key of the outer dictionary is used as the column, and the key of the inner layer is used as the row index:
In [63]: frame3=DataFrame(pop)
In [64]: frame3
Out[64]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 17.0
2002 2.9 3.6
The keys of the inner dictionary will be merged and sorted to form the final index. If the index is explicitly specified, this will not be the case:
In [66]: DataFrame(pop,index=[2001,2002,2003])
Out[66]:
Nevada Ohio
2001 2.4 17.0
2002 2.9 3.6
2003 NaN NaN
The dictionary composed of Series has almost the same usage:
In [68]: pdata={'Ohio':frame3['Ohio'][:-1],'Nevada':frame3['Nevada'][:2]}
In [69]: DataFrame(pdata)
Out[69]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 17.0
11. Transpose
In [65]: frame3.T
Out[65]:
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 17.0 3.6
12. Index object
Pandas index objects are responsible for managing axis labels and other metadata (such as axis names, etc.).
Index objects are immutable, so users cannot modify them.
Immutability is very important, because only in this way can Index objects be safely shared among multiple data structures.
Note: Although most users do not need to know too much about the Index object, they are indeed an important part of the pandas data model.