A lot of programming in data analysis and modeling is used for data preparation: onboarding, cleanup, transformation, and remodeling. Sometimes, the data stored in a file or database does not meet the requirements of your data processing application. Many people choose to specialize in data formats using common programming languages such as Python, Perl, R, or Java, or UNIX text processing tools such as SED or awk. Fortunately, the pandas and Python standard libraries provide a set of advanced, flexible, and efficient core functions and algorithms that make it easy for you to regularize your data into the right form.
1. Merging data sets
The data in the Pandas object can be merged in a number of built-in ways:
- Pandas.merge can connect rows in different dataframe according to one or more keys. Users of SQL or other relational databases should be more familiar with this, as it implements the connection operation of the database.
- Pandas.concat can stack multiple objects together along an axis.
- The instance method Combine_first is able to fit the repeated data together, populating the values in one object with the missing values in an object.
2. Dataframe Merging of database style
A merge or join operation of a dataset is a link to a row by one or more keys. These operations are the core of a relational database. The pandas merge function is the primary entry point for applying these algorithms to the data.
In [4]: Import Pandas as Pdin [5]: Import NumPy as Npin [6]: DF1 = PD. DataFrame ({' key ': [' B ', ' B ', ' A ', ' C ', ' A ', ' a ', ' B '], ...: ' data1 ': Range (7)}) in [7]: DF2 = PD. DataFrame ({' key ': [' A ', ' B ', ' d '], ...: ' data2 ': Range (3)}) in [8]: df1out[8]: data1 key0 0 B1 1 B2 2 a3 3 C4 4 a5 5 a6 6 b[7 rows x 2 Columns]in [9] : df2out[9]: data2 key0 0 A1 1 B2 2 d[3 rows x 2 columns]
This is a multiple-to-one merger. The data in DF1 has multiple rows marked as a and B, and each value in the key column in DF2 is only one row. Calling merge on these objects will give you:
In [ten]: Pd.merge (DF1, DF2) out[10]: data1 key data20 0 b 1 b 6 b 2 a 4 a 5 a 0[6 rows x 3 columns]
Note that I did not indicate which column to use for the connection. If not specified, merge will use the column name of the overlapping column as the key. Just, it is best to display the specified:
In [All]: Pd.merge (DF1, DF2, on= ' key ') out[11]: data1 key data20 0 b 1 b 6 B 2 a 4 a 5 a 0[6 rows x 3 columns]
Assuming that the column names of the two objects are different, you can also specify them individually:
In []: DF3 = PD. DataFrame ({' Lkey ': [' B ', ' B ', ' A ', ' C ', ' A ', ' a ', ' B '], ....: ' data1 ': Range (7)}) in []: Df4 = PD. DataFrame ({' Rkey ': [' A ', ' B ', ' d '], ....: ' data2 ': Range (3)}) in []: Pd.merge (DF) df1 df2 df3 Df4 in [+]: Pd.merge (DF3, Df4, left_on= ' Lkey ', right_on= ' Rkey ') out[14]: data1 lkey data2 rkey0 0 b 1 b1 1 b 1 b2 6 b 1 b3 2 a 0 a4 4 a 0 a5 5 a 0 a[6 rows x 4 columns]
you may have been largely there, and the results of C and D and the data associated with it have disappeared. By default, merge does a "inner" connection, and the key in the result is the intersection. Other ways are "left", "right" and "outer". Outer joins are the intersection of keys, combining the effects of left and right joins:
in [+]: Pd.merge (DF1, DF2, how= ' outer ') out[16]: data1 key data20 0 b 1 b 12 6 B 2 a 4 a 5 a 3 C NaN7 NaN D 2[8 rows x 3 columns]
Many-to-many merge operations are easy and do not require additional work. For example, see the following:
In [17]: DF1 = PD. DataFrame ({' key ': [' B ', ' B ', ' A ', ' C ', ' A ', ' B '], ....: ' Data1 ': Range (6)}) in []: DF2 = PD. DataFrame ({' key ': [' A ', ' B ', ' A ', ' B ', ' d '], ....: ' Data2 ': Range (5)}) in [[]: df1out[19]: data1 Key0 0 B1 1 B2 2 A3 3 C4 4 A5 5 b[6 rows x 2 Columns]in []: df2out[20]: Dat A2 Key0 0 A1 1 B2 2 A3 3 B4 4 d[5 rows x 2 Columns]in []: Pd.merge (DF1, DF2, on= ' key ' , how= ' left ') out[21]: Data1 key data20 0 B 0 B 1 B 1 B 3 4 5 B 5 B 2 A 2 a 4 a 4 a 210 3 c nan[11 rows x 3 columns]
A many-to-many connection produces a Cartesian product of rows. Because there are 3 "B" lines on the left side of the Dataframe, there are 2 on the right, so there are 6 "B" lines in the result. The connection mode only affects the keys in the results today:
In [All]: Pd.merge (DF1, DF2, on= ' key ', how= ' inner ') out[23]: data1 key data20 0 b 0 B 1 b 1 b 5 b 5 b 2 A 2 a 4 a 4 a 2[10 rows x 3 columns]
to merge based on multiple keys, passing in a list of column names allows you to:
in [+]: left = PD. DataFrame ({' Key1 ': [' foo ', ' foo ', ' Bar '], ....: ' key2 ': [' One ', ' ' One ', ' one '], ....: ' lval ': [1, 2, 3]} ) in [+]: right = PD. DataFrame ({' Key1 ': [' foo ', ' foo ', ' Bar ', ' Bar '], ....: ' key2 ': [' one ', ' one ', ' one ', ' both '], .... : ' Rval ': [4, 5, 6, 7]} in [+]: Pd.merge (left, right, on=[' key1 ', ' Key2 '), how= ' outer ') out[26]: key1 key2 lval
rval0 foo one 1 + foo one 1- 2 NaN3 Bar One 3 7[5 4 NaN)
The key combinations that appear in the results depend on how you choose to merge, and you can understand that multiple keys form a series of tuples and treat them as a single connection key (which, of course, is not the case).
Warning:
When a column-column connection is made, the index in the Dataframe object is discarded.
The last problem that needs to be considered for a merge operation is the processing of repeated names. Although you can manually handle the problem of overlapping column names, the merge has a more useful suffixes option for specifying strings appended to the overlapping column names of the left and right two Dataframe objects:
in [+]: Pd.merge (left, right, on= ' Key1 ') out[27]: key1 key2_x lval key2_y rval0 foo one 1 One in foo one 1 one 2 foo 2 One 3 bar one 3 7[6 rows x 5 columns]in [+]: Pd.merge (left, right, on= ' Key1 ', suffixes= (' _left ', ' _right ')) out[28]: key1 Key2 _left lval key2_right rval0 foo One 1 one foo one 1 One - 2 foo- 2 One One 3 one 3 7[6 rows x 5 columns]
Data analysis using Python-data normalization: cleanup, transformation, merging, reshaping (vii) (1)