merage#
Pandas provides a method <strong>merage</strong> similar to a connection (join) operation for a relational database, where you can concatenate rows from different dataframe based on one or more keys, with the following syntax:
Merge (left, right, how= ' inner ', On=none, Left_on=none, Right_on=none, left_index=false, Right_index=false, sort= True, suffixes= (' _x ', ' _y '), Copy=true, Indicator=false)
As a fully functional and powerful language, the merge () in Python's pandas library supports a variety of internal and external connections.
- Left and right: two different dataframe
- How: Refers to the way of merging (connection) has inner (inner connection), left (outer connection), right (outer connection), outer (full outer connection); default is inner
- On: Refers to the name of the column index used for the connection. Must exist in right-right two Dataframe object, if not specified and other parameters are not specified, the intersection of two dataframe column names as the connection key
- LEFT_ON: The column name used as the connection key in the left Dataframe, which is not the same as the left and right column names in this parameter, but is useful when the meaning of the representation is the same.
- RIGHT_ON: The column name used as the connection key in the right Dataframe
- Left_index: Use the row index in the left dataframe as the connection key
- Right_index: Use the row index in the right dataframe as the connection key
- Sort: The default is true to sort the merged data. setting to False in most cases can improve performance
- Suffixes: A tuple of string values that specifies the suffix name appended to the column name when the left and right dataframe exist with the same column name, by default (' _x ', ' _y ')
- Copy: Default is true to always copy data into data structures, and in most cases set to false to improve performance
- Indicator: In 0.17.0, a display of the source of the merged data is also added, such as only the Left (left_only), the two (both)
In SQL
Select *from df1inner JOIN df2 on df1.key = Df2.key; or select *from df1,df2 where Df1.key=df2.key
In Pandas:
Pd.merge (DF1, DF2, on= ' key ')
Then there are the various external connections:
Pd.merge (DF1, DF2, on= ' key ', how= ' left ')
How to become left/right. Full link outer.
Example # #
#coding =utf-8from Pandas import Series,dataframe,mergeimport numpy as Npdata=dataframe ([{"id": 0, "name": ' Lxh ', "age" : "CP": ' LM '},{"id": 1, "name": ' Xiao ', "age": +, "CP": ' ly '},{' id ': 2, ' name ': ' Hua ', ' age ': 4, ' CP ': ' Yry '},{' ID ': 3, ' Name ": ' Is '," age ": +," CP ": ' Old '}]) Data1=dataframe ([{" id ": +," name ": ' Lxh ', ' CS ': 10},{" id ": 101," name ": ' Xiao ', ' CS ') : 40},{"id": 102, "name": ' Hua2 ', ' CS ':]) data2=dataframe ([{"id": 0, "name": ' Lxh ', ' CS ': 10},{"id": 101, "name": ' Xiao ', ' CS ': 40},{"id": 102, "name": ' Hua2 ', ' CS ': []]) print "single column name as the connection key within the link \ r \ n", merge (data,data1,on= "name", suffixes= (' _a ', ' _b ') print "Multi-column name is the connection key for inner link \ r \ n", merge (data,data2,on= ("name", "id")) print ' does not specify on then two dataframe column name intersection as the connection key \ r \ n ', merge ( DATA,DATA2) #这里使用了id与name # Use the row index of the dataframe on the right as the connection key # #设置行索引名称indexed_data1 =data1.set_index ("name") print " Use the row index on the right side of the dataframe as the connection key \ r \ n ", merge (data,indexed_data1,left_on= ' name ', right_index=true) print ' left outer connection \ r \ n ', merge ( data,data1,on= "Name", how= "left", suffixes= (' _a ', ' _b ')) print ' Outer connection 1\r\n ', merge (data1,data,on= "name", how= "Ieft") print ' right outer connection \ r \ n ', merge (data,data1,on= "name", how= "right") Data3=dataframe ([{"Mid": 0, "mname": ' Lxh ', ' CS ': 10},{"mid": 101, "Mname": ' Xiao ', ' CS ': 40},{"mid": 102, "Mname": ' Hua2 ', ' CS ': []]) #当左右两个DataFrame的列名不同, you can use left_on and right_ when you want to be a connection key On to specify the connection key print "Use left_on and right_on to specify different connection keys for column names \ r \ n", merge (data,data3,left_on=["name", "id"],right_on=["mname", " Mid "])
output is:
A single column name is the connection key for internal links age cp id_a name CS id_b0 0 lm lxh 1001 Max ly 1 Xiao 101 Multi-column connection key for inner link age cp ID name cs0 20 lm 0 lxh 10 does not specify on to the intersection of two dataframe column names as the connection key age CP ID name cs0 20 lm 0 lxh 10 Use the row index of the dataframe on the right as the connection key age CP id_x name CS id_ Y0 lm 0 lxh 1001- ly 1 Xiao 101 left Outer connection Age cp id_a name CS id_b0 lm 0 lxh 1001 1 ly Xiao 1012 4 yry 2 hua NaN NaN3 Old 3 is nan nan left outer connection 1 CS id_x name age cp id_y0 100 lxh lm- 101 Xiao- ly- 102 Hua2 nan nan nan right outer connection age cp id_x name CS id_y0 20 lm 0 lxh 1001- ly 1 Xiao 1012 NaN Nan nan hua2 102 use left_on and right_on to specify different connection keys for column names age cp ID Name CS Mid MNAME0 lm 0 lxh 0 lxh
The Join method provides an easy way to combine the different column indexes in two dataframe into a single dataframe.
The meaning of the parameters is basically the same as the merge method, except that the join method defaults to the left outer join How=left.
Example:
#coding =utf-8from Pandas Import series,dataframe,mergedata=dataframe ([{"id": 0, "name": ' Lxh ', "age": +, "CP": ' LM '},{" ID ": 1," "Name": ' Xiao ', "age": +, "CP": ' ly '},{' id ': 2, ' name ': ' Hua ', ' age ': 4, ' CP ': ' Yry '},{' ID ': 3, ' name ': ' Is ', ' age ': 70, "CP": ' Old '}],index=[' a ', ' B ', ' C ', ' d ']) Data1=dataframe ([{"Sex": 0},{"Sex": 1},{"Sex": 2}],index=[' a ', ' B ', ' e ']) print ' Use the default left connection \ r \ n ', Data.join (data1) #这里可以看出自动屏蔽了data中没有的index =e that row of data print ' use right to connect \ r \ n ', Data.join (data1,how= " Right ") #这里出自动屏蔽了data1中没有index =c,d data, equivalent to Data1.join (data) print ' in-use connection \ r \ n ', Data.join (data1,how= ' inner ') print ' Use the full outer connection \ r \ n ', Data.join (data1,how= ' outer ')
The result is:
Use default left connection age cp ID name sexa lm 0 lxh 0b 1 ly Xiao 1c 4 yry 2 hua NaNd all 3 be nan use right connection Age cp ID name sexa lm 0 lxh 0b- ly 1 Xiao 1e nan nan nan nan 2 with internal connection age CP ID name sexa lm 0 lxh 0b- ly 1 Xiao 1 using full outer connection age CP ID name sexa lm 0 lxh 0b- ly 1 Xiao 1c 4 yry 2 hua NaNd all 3 be NaNe nan nan nan nan 2
There is another way to connect: concat
The Concat method is equivalent to the full connection in the database (UNION all), you can specify whether to connect by an axis, or you can specify joins in the same way (Outer,inner only these two types).
Unlike the database is the concat will not go heavy, to achieve the effect of weight can use the Drop_duplicates method
Concat (Objs, axis=0, join= ' outer ', Join_axes=none, Ignore_index=false, Keys=none, Levels=none, Names=none, verify _integrity=false, Copy=true):
Example:
#coding =utf-8from Pandas Import series,dataframe,concatdf1 = DataFrame ({' City ': [' Chicago ', ' San Francisco ', ' New York Ci ' Ty '], ' rank ': Range (1, 4)}) DF2 = DataFrame ({' City ': [' Chicago ', ' Boston ', ' Los Angeles '], ' rank ': [1, 4, 5]}) print ' inline by axis then \ r \ n ', concat ([df1,df2],join= "inner", axis=1) print ' outer join and specify Keys (row index) \ r \ n ', concat ([df1,df2],keys=[' A ', ' B ']) # Here are the duplicate data print ' go back \ r \ n ', concat ([df1,df2],ignore_index=true). Drop_duplicates ()
The output is:
Internal connection by Axis City rank City rank0 Chicago 1 Chicago San Francisco 2 Boston New York City 3 Los Angeles 5 outer Joins and assign keys (row index) City Ranka 0 Chicago 1 1 San Francisco 2 2 New York City 3b 0 Chicago 1 1 Boston 4 2 Los Angeles 5 de-weight after city rank0 Chicago San Francisco New York city Boston Los Angeles 5
Python data table merge (Python pandas join (), merge (), and concat () usage)