Python data table merge (Python pandas join (), merge (), and concat () usage)

Source: Internet
Author: User

merage#

Pandas provides a method <strong>merage</strong> similar to a connection (join) operation for a relational database, where you can concatenate rows from different dataframe based on one or more keys, with the following syntax:

Merge (left, right, how= ' inner ', On=none, Left_on=none, Right_on=none,      left_index=false, Right_index=false, sort= True,      suffixes= (' _x ', ' _y '), Copy=true, Indicator=false)

As a fully functional and powerful language, the merge () in Python's pandas library supports a variety of internal and external connections.

    • Left and right: two different dataframe
    • How: Refers to the way of merging (connection) has inner (inner connection), left (outer connection), right (outer connection), outer (full outer connection); default is inner
    • On: Refers to the name of the column index used for the connection. Must exist in right-right two Dataframe object, if not specified and other parameters are not specified, the intersection of two dataframe column names as the connection key
    • LEFT_ON: The column name used as the connection key in the left Dataframe, which is not the same as the left and right column names in this parameter, but is useful when the meaning of the representation is the same.
    • RIGHT_ON: The column name used as the connection key in the right Dataframe
    • Left_index: Use the row index in the left dataframe as the connection key
    • Right_index: Use the row index in the right dataframe as the connection key
    • Sort: The default is true to sort the merged data. setting to False in most cases can improve performance
    • Suffixes: A tuple of string values that specifies the suffix name appended to the column name when the left and right dataframe exist with the same column name, by default (' _x ', ' _y ')
    • Copy: Default is true to always copy data into data structures, and in most cases set to false to improve performance
    • Indicator: In 0.17.0, a display of the source of the merged data is also added, such as only the Left (left_only), the two (both)
In SQL
Select *from df1inner JOIN df2 on  df1.key = Df2.key; or select *from df1,df2 where Df1.key=df2.key

In Pandas:

Pd.merge (DF1, DF2, on= ' key ')

Then there are the various external connections:

Pd.merge (DF1, DF2, on= ' key ', how= ' left ')

How to become left/right. Full link outer.

Example # #
#coding =utf-8from Pandas import Series,dataframe,mergeimport numpy as Npdata=dataframe ([{"id": 0, "name": ' Lxh ', "age" : "CP": ' LM '},{"id": 1, "name": ' Xiao ', "age": +, "CP": ' ly '},{' id ': 2, ' name ': ' Hua ', ' age ': 4, ' CP ': ' Yry '},{' ID ': 3, ' Name ": ' Is '," age ": +," CP ": ' Old '}]) Data1=dataframe ([{" id ": +," name ": ' Lxh ', ' CS ': 10},{" id ": 101," name ": ' Xiao ', ' CS ') : 40},{"id": 102, "name": ' Hua2 ', ' CS ':]) data2=dataframe ([{"id": 0, "name": ' Lxh ', ' CS ': 10},{"id": 101, "name": ' Xiao ', ' CS ': 40},{"id": 102, "name": ' Hua2 ', ' CS ': []]) print "single column name as the connection key within the link \ r \ n", merge (data,data1,on= "name", suffixes= (' _a ', ' _b ') print "Multi-column name is the connection key for inner link \ r \ n", merge (data,data2,on= ("name", "id")) print ' does not specify on then two dataframe column name intersection as the connection key \ r \ n ', merge ( DATA,DATA2) #这里使用了id与name # Use the row index of the dataframe on the right as the connection key # #设置行索引名称indexed_data1 =data1.set_index ("name") print " Use the row index on the right side of the dataframe as the connection key \ r \ n ", merge (data,indexed_data1,left_on= ' name ', right_index=true) print ' left outer connection \ r \ n ', merge ( data,data1,on= "Name", how= "left", suffixes= (' _a ', ' _b ')) print ' Outer connection 1\r\n ', merge (data1,data,on= "name", how= "Ieft") print ' right outer connection \ r \ n ', merge (data,data1,on= "name", how= "right") Data3=dataframe ([{"Mid": 0, "mname": ' Lxh ', ' CS ': 10},{"mid": 101, "Mname": ' Xiao ', ' CS ': 40},{"mid": 102, "Mname": ' Hua2 ', ' CS ': []]) #当左右两个DataFrame的列名不同, you can use left_on and right_ when you want to be a connection key On to specify the connection key print "Use left_on and right_on to specify different connection keys for column names \ r \ n", merge (data,data3,left_on=["name", "id"],right_on=["mname", " Mid "])


output is:

A single column name is the connection key for internal links age  cp  id_a  name  CS  id_b0     0 lm   lxh   1001   Max  ly     1  Xiao   101 Multi-column connection key for inner link age  cp  ID name  cs0   20  lm   0  lxh  10 does not specify on to the intersection of two dataframe column names as the connection key   age  CP  ID name  cs0   20  lm   0  lxh  10 Use the row index of the dataframe on the right as the connection key age  CP  id_x  name  CS  id_ Y0  lm     0   lxh 1001-  ly     1  Xiao   101 left Outer connection   Age   cp  id_a  name  CS  id_b0   lm     0   lxh   1001   1 ly  Xiao   1012    4  yry     2   hua NaN   NaN3  Old     3 is    nan   nan left outer connection 1   CS  id_x  name  age   cp  id_y0   100   lxh   lm-   101 Xiao-   ly-   102  Hua2  nan  nan nan   right outer connection   age   cp  id_x  name  CS  id_y0   20   lm     0   lxh   1001-   ly     1  Xiao   1012  NaN  Nan   nan  hua2   102 use left_on and right_on to specify different connection keys for column names age  cp  ID Name  CS  Mid MNAME0  lm   0  lxh    0   lxh

The Join method provides an easy way to combine the different column indexes in two dataframe into a single dataframe.

The meaning of the parameters is basically the same as the merge method, except that the join method defaults to the left outer join How=left.

Example:

#coding =utf-8from Pandas Import series,dataframe,mergedata=dataframe ([{"id": 0, "name": ' Lxh ', "age": +, "CP": ' LM '},{" ID ": 1," "Name": ' Xiao ', "age": +, "CP": ' ly '},{' id ': 2, ' name ': ' Hua ', ' age ': 4, ' CP ': ' Yry '},{' ID ': 3, ' name ': ' Is ', ' age ': 70, "CP": ' Old '}],index=[' a ', ' B ', ' C ', ' d ']) Data1=dataframe ([{"Sex": 0},{"Sex": 1},{"Sex": 2}],index=[' a ', ' B ', ' e ']) print ' Use the default left connection \ r \ n ', Data.join (data1)  #这里可以看出自动屏蔽了data中没有的index =e that row of data print ' use right to connect \ r \ n ', Data.join (data1,how= " Right ") #这里出自动屏蔽了data1中没有index =c,d data, equivalent to Data1.join (data) print ' in-use connection \ r \ n ', Data.join (data1,how= ' inner ') print ' Use the full outer connection \ r \ n ', Data.join (data1,how= ' outer ')

The result is:

Use default left connection age   cp  ID  name  sexa   lm   0   lxh    0b   1 ly  Xiao    1c    4  yry   2   hua  NaNd   all   3 be  nan use right connection   Age   cp  ID  name  sexa   lm   0   lxh 0b-   ly   1  Xiao    1e nan nan nan nan    2 with internal connection   age  CP  ID  name  sexa  lm   0   lxh 0b-  ly   1  Xiao    1 using full outer connection age   CP  ID  name  sexa   lm   0   lxh 0b-   ly   1  Xiao    1c    4  yry   2   hua  NaNd   all   3 be  NaNe  nan  nan   nan nan    2

There is another way to connect: concat

The Concat method is equivalent to the full connection in the database (UNION all), you can specify whether to connect by an axis, or you can specify joins in the same way (Outer,inner only these two types).

Unlike the database is the concat will not go heavy, to achieve the effect of weight can use the Drop_duplicates method

Concat (Objs, axis=0, join= ' outer ', Join_axes=none, Ignore_index=false,           Keys=none, Levels=none, Names=none, verify _integrity=false, Copy=true):

Example:

#coding =utf-8from Pandas Import series,dataframe,concatdf1 = DataFrame ({' City ': [' Chicago ', ' San Francisco ', ' New York Ci ' Ty '], ' rank ': Range (1, 4)}) DF2 = DataFrame ({' City ': [' Chicago ', ' Boston ', ' Los Angeles '], ' rank ': [1, 4, 5]}) print ' inline by axis then \ r \ n ', concat ([df1,df2],join= "inner", axis=1) print ' outer join and specify Keys (row index) \ r \ n ', concat ([df1,df2],keys=[' A ', ' B ']) # Here are the duplicate data print ' go back \ r \ n ', concat ([df1,df2],ignore_index=true). Drop_duplicates ()

The output is:

Internal connection by Axis City  rank City  rank0        Chicago     1      Chicago  San Francisco     2       Boston  New York City     3  Los Angeles     5 outer Joins and assign keys (row index) City  Ranka 0        Chicago     1  1  San Francisco     2  2  New York City     3b 0        Chicago     1  1         Boston     4  2    Los Angeles     5 de-weight after city  rank0        Chicago  San Francisco  New York city         Boston    Los Angeles     5

Python data table merge (Python pandas join (), merge (), and concat () usage)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.