Data analysis using Python-data normalization: cleanup, transformation, merging, reshaping (vii) (1)

Source: Internet
Author: User
Tags joins

A lot of programming in data analysis and modeling is used for data preparation: onboarding, cleanup, transformation, and remodeling. Sometimes, the data stored in a file or database does not meet the requirements of your data processing application. Many people choose to specialize in data formats using common programming languages such as Python, Perl, R, or Java, or UNIX text processing tools such as SED or awk. Fortunately, the pandas and Python standard libraries provide a set of advanced, flexible, and efficient core functions and algorithms that make it easy for you to regularize your data into the right form.

1. Merging data sets

The data in the Pandas object can be merged in a number of built-in ways:

    • Pandas.merge can connect rows in different dataframe according to one or more keys. Users of SQL or other relational databases should be more familiar with this, as it implements the connection operation of the database.
    • Pandas.concat can stack multiple objects together along an axis.
    • The instance method Combine_first is able to fit the repeated data together, populating the values in one object with the missing values in an object.


2. Dataframe Merging of database style

A merge or join operation of a dataset is a link to a row by one or more keys. These operations are the core of a relational database. The pandas merge function is the primary entry point for applying these algorithms to the data.

In [4]: Import Pandas as Pdin [5]: Import NumPy as Npin [6]: DF1 = PD. DataFrame ({' key ': [' B ', ' B ', ' A ', ' C ', ' A ', ' a ', ' B '],   ...:                     ' data1 ': Range (7)}) in [7]: DF2 = PD. DataFrame ({' key ': [' A ', ' B ', ' d '],   ...:                     ' data2 ': Range (3)}) in [8]: df1out[8]:    data1 key0      0   B1      1   B2      2   a3      3   C4      4   a5      5   a6      6   b[7 rows x 2 Columns]in [9] : df2out[9]:    data2 key0      0   A1      1   B2      2   d[3 rows x 2 columns]
This is a multiple-to-one merger. The data in DF1 has multiple rows marked as a and B, and each value in the key column in DF2 is only one row. Calling merge on these objects will give you:

In [ten]: Pd.merge (DF1, DF2) out[10]:    data1 key  data20      0   b      1   b      6   b      2   a      4   a      5   a      0[6 rows x 3 columns]
Note that I did not indicate which column to use for the connection. If not specified, merge will use the column name of the overlapping column as the key. Just, it is best to display the specified:

In [All]: Pd.merge (DF1, DF2, on= ' key ') out[11]:    data1 key  data20      0   b      1   b      6   B      2   a      4   a      5   a      0[6 rows x 3 columns]
Assuming that the column names of the two objects are different, you can also specify them individually:

In []: DF3 = PD. DataFrame ({' Lkey ': [' B ', ' B ', ' A ', ' C ', ' A ', ' a ', ' B '],   ....:                     ' data1 ': Range (7)}) in []: Df4 = PD. DataFrame ({' Rkey ': [' A ', ' B ', ' d '],   ....:                     ' data2 ': Range (3)}) in []: Pd.merge (DF) df1  df2 df3 Df4 in  [+]: Pd.merge (DF3, Df4, left_on= ' Lkey ', right_on= ' Rkey ') out[14]:    data1 lkey data2  rkey0      0    b      1    b1      1    b      1    b2      6    b      1    b3      2    a      0    a4      4    a      0    a5      5    a      0    a[6 rows x 4 columns]
you may have been largely there, and the results of C and D and the data associated with it have disappeared. By default, merge does a "inner" connection, and the key in the result is the intersection. Other ways are "left", "right" and "outer". Outer joins are the intersection of keys, combining the effects of left and right joins:

in [+]: Pd.merge (DF1, DF2, how= ' outer ') out[16]:    data1 key  data20      0   b      1   b      12      6   B      2   a      4   a      5   a      3   C    NaN7    NaN   D      2[8 rows x 3 columns]
Many-to-many merge operations are easy and do not require additional work. For example, see the following:

In [17]: DF1 = PD. DataFrame ({' key ': [' B ', ' B ', ' A ', ' C ', ' A ', ' B '], ....: ' Data1 ': Range (6)}) in []: DF2 = PD. DataFrame ({' key ': [' A ', ' B ', ' A ', ' B ', ' d '], ....: ' Data2 ': Range (5)}) in [[]: df1out[19]: data1 Key0 0 B1 1 B2 2 A3 3 C4 4 A5 5 b[6 rows x 2 Columns]in []: df2out[20]: Dat A2 Key0 0 A1 1 B2 2 A3 3 B4 4 d[5 rows x 2 Columns]in []: Pd.merge (DF1, DF2, on= ' key ' , how= ' left ') out[21]: Data1 key data20 0 B 0 B 1 B 1 B 3       4 5 B 5 B 2 A 2 a 4 a 4 a 210 3 c nan[11 rows x 3 columns] 
A many-to-many connection produces a Cartesian product of rows. Because there are 3 "B" lines on the left side of the Dataframe, there are 2 on the right, so there are 6 "B" lines in the result. The connection mode only affects the keys in the results today:

In [All]: Pd.merge (DF1, DF2, on= ' key ', how= ' inner ') out[23]:    data1 key  data20      0   b      0   B      1   b      1   b      5   b      5   b      2   A      2   a      4   a      4   a      2[10 rows x 3 columns]
to merge based on multiple keys, passing in a list of column names allows you to:

in [+]: left = PD. DataFrame ({' Key1 ': [' foo ', ' foo ', ' Bar '],   ....:                      ' key2 ': [' One ', ' ' One ', ' one '],   ....:                      ' lval ': [1, 2, 3]} ) in [+]: right = PD. DataFrame ({' Key1 ': [' foo ', ' foo ', ' Bar ', ' Bar '],   ....:                       ' key2 ': [' one ', ' one ', ' one ', ' both '], ....   :                       ' Rval ': [4, 5, 6, 7]} in [+]: Pd.merge (left, right, on=[' key1 ', ' Key2 '), how= ' outer ') out[26]:   key1 key2  lval
   
    rval0  foo  one     1 + foo one     1-     2   NaN3  Bar  One     3     7[5 4 NaN)
   
The key combinations that appear in the results depend on how you choose to merge, and you can understand that multiple keys form a series of tuples and treat them as a single connection key (which, of course, is not the case).

Warning:

When a column-column connection is made, the index in the Dataframe object is discarded.

The last problem that needs to be considered for a merge operation is the processing of repeated names. Although you can manually handle the problem of overlapping column names, the merge has a more useful suffixes option for specifying strings appended to the overlapping column names of the left and right two Dataframe objects:

in [+]: Pd.merge (left, right, on= ' Key1 ') out[27]:   key1 key2_x lval key2_y  rval0  foo    one     1 One in     foo one 1 one     2 foo    2 One 3 bar one     3     7[6 rows x 5 columns]in [+]: Pd.merge (left, right, on= ' Key1 ', suffixes= (' _left ', ' _right ')) out[28]:   key1 Key2 _left  lval key2_right  rval0 foo One 1 one       foo one     1        One     -  2 foo-  2       One One     3        one     3     7[6 rows x 5 columns]

Data analysis using Python-data normalization: cleanup, transformation, merging, reshaping (vii) (1)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.