7. Combining Datasets: merge and join

2025. 6. 19. 00:11Python/Pandas

# In[1]
class display(object):
    """Display HTML representation of multiple objects"""
    template="""<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}{1}
    """
    def __init__(self,*args):
        self.args=args

    def _repr_html_(self):
        return '\n'.join(self.template.format(a,eval(a)._repr_html_()) 
                         for a in self.args)

    def __repr__(self):
        return '\n\n'.join(a+'\n'+repr(eval(a))
                         for a in self.args)

Relational Algebra

  • The behavior implemented in pd.merge is a subset of what is known as relational algebra, which is a formal set of rules for manipulating relational data that forms the conceptual foundation of operations available in most databases.
  • The strength of the relational algebra approach is that it proposes several fundamental operations, which become the building blocks of more complicated operations on any dataset.
  • Pandas implements several of these fundamental building blocks in the pd.merge function and the related join method of Series and DataFrame objects.

Categories of Joins

One-to-One Joins

  • The simplest type of merge.
# In[2]
df1=pd.DataFrame({'employee':['Bob','Jake','Lisa','Sue'],
'group':['Accounting','Engineering','Engineering','HR']})
df2=pd.DataFrame({'employee':['Lisa','Bob','Jake','Sue'],
'hire_data':[2004,2008,2012,2014]})
display('df1','df2')
# Out[2]
df1

  employee          group
0       Bob     Accounting
1      Jake    Engineering
2      Lisa    Engineering
3       Sue             HR

df2

  employee    hire_data
0      Lisa         2004
1       Bob         2008
2      Jake         2012
3       Sue         2014
  • To combine this information into a single DataFrame, we can use the pd.merge function.
# In[3]
df3=pd.merge(df1,df2)
df3
# Out[3]
  employee          group    hire_data
0       Bob     Accounting         2008
1      Jake    Engineering         2012
2      Lisa    Engineering         2004
3       Sue             HR         2014
  • pd.merge function recognizes that each DataFrame has an employee column, and automatically joins using this column as a key.
  • The order of entries in each column is not necessarily maintained: in this case, the order of employee column differs between df1 and df2, and the pd.merge function correctly accounts for this.
  • The merge in general discards the index, except in the special case of merges by index.

Many-to-One Joins

  • Many to One joins are joins in which one of the two key columns contains duplicate entries.
# In[4]
df4=pd.DataFrame({'group':['Accounting','Engineering','HR'],
'supervisor':['Carly','Guido','Steve']})
display('df3','df4','pd.merge(df3,df4)')
# Out[4]
df3

  employee          group    hire_data
0       Bob     Accounting         2008
1      Jake    Engineering         2012
2      Lisa    Engineering         2004
3       Sue             HR         2014

df4

          group    supervisor
0     Accounting         Carly
1    Engineering         Guido
2             HR         Steve

pd.merge(df3,df4)

  employee          group    hire_data    supervisor
0       Bob     Accounting         2008         Carly
1      Jake    Engineering         2012         Guido
2      Lisa    Engineering         2004         Guido
3       Sue             HR         2014         Steve

Many-to-Many Joins

  • If the key column in both the left and right arrays contains duplicates, then the result is a many to many merge.
# In[5]
df5=pd.DataFrame({'group':['Accounting','Accounting','Engineering','Engineering','HR','HR'],
'skills':['math','spreadsheets','software','math','spreadsheets','organization']})
display('df1','df5','pd.merge(df1,df5)')
# Out[5]
df1

  employee          group
0       Bob     Accounting
1      Jake    Engineering
2      Lisa    Engineering
3       Sue             HR

df5

          group          skills
0     Accounting            math
1     Accounting    spreadsheets
2    Engineering        software
3    Engineering            math
4             HR    spreadsheets
5             HR    organization

pd.merge(df1,df5)

  employee          group          skills
0       Bob     Accounting            math
1       Bob     Accounting    spreadsheets
2      Jake    Engineering        software
3      Jake    Engineering            math
4      Lisa    Engineering        software
5      Lisa    Engineering            math
6       Sue             HR    spreadsheets
7       Sue             HR    organization

Specification of the Merge Key

The on Keyword

  • You can explicitly specify the name of the key column using the on keyword, which takes a column name of a list of column names.
# In[6]
display('df1','df2',"pd.merge(df1,df2,on='employee')")
# Out[6]
df1

  employee          group
0       Bob     Accounting
1      Jake    Engineering
2      Lisa    Engineering
3       Sue             HR

df2

  employee    hire_data
0      Lisa         2004
1       Bob         2008
2      Jake         2012
3       Sue         2014

pd.merge(df1,df2,on='employee')

  employee          group    hire_data
0       Bob     Accounting         2008
1      Jake    Engineering         2012
2      Lisa    Engineering         2004
3       Sue             HR         2014
  • This option works only if both the left and right DataFrames have the specified column name.

The left_on and right_on Keywords

  • At times you may wish to merge two datasets with different column names.
# In[7]
df3=pd.DataFrame({'name':['Bob','Jake','Lisa','Sue'],
'salary':[70000,80000,120000,90000]})
display('df1','df3',"pd.merge(df1,df3,left_on='employee',right_on='name')")
# Out[7]
df1

  employee          group
0       Bob     Accounting
1      Jake    Engineering
2      Lisa    Engineering
3       Sue             HR

df3

    name    salary
0     Bob     70000
1    Jake     80000
2    Lisa    120000
3     Sue     90000

pd.merge(df1,df3,left_on='employee',right_on='name')

  employee          group    name    salary
0       Bob     Accounting     Bob     70000
1      Jake    Engineering    Jake     80000
2      Lisa    Engineering    Lisa    120000
3       Sue             HR     Sue     90000
  • The result has a redundant column that we can drop if desired by using DataFrame.drop() method.
# In[8]
pd.merge(df1,df3,left_on='employee',right_on='name').drop('name',axis=1)
# Out[8]
  employee          group    salary
0       Bob     Accounting     70000
1      Jake    Engineering     80000
2      Lisa    Engineering    120000
3       Sue             HR     90000

The left_index and right_index Keywords

  • Sometimes, rather than merging on a column, you would instead like to merge on an index.
  • You can use the index as the key for merging by specifying the left_index and/or right_index flags in pd.merge()
# In[9]
df1a=df1.set_index('employee')
df2a=df2.set_index('employee')
display('df1a','df2a')
# Out[9]
df1a

              group
employee    
Bob         Accounting
Jake    Engineering
Lisa    Engineering
Sue                 HR

df2a

        hire_data
employee    
Lisa         2004
Bob             2008
Jake         2012
Sue             2014

# In[10]
display('df1a','df2a',"pd.merge(df1a,df2a,left_index=True,right_index=True)")
# Out[10]
df1a

              group
employee    
Bob         Accounting
Jake    Engineering
Lisa    Engineering
Sue                 HR

df2a

        hire_data
employee    
Lisa         2004
Bob             2008
Jake         2012
Sue             2014

pd.merge(df1a,df2a,left_index=True,right_index=True)

              group    hire_data
employee        
Bob         Accounting         2008
Jake    Engineering         2012
Lisa    Engineering         2004
Sue                 HR         2014
  • For convenience, Pandas includes the DataFrame.join() method, which performs an index-based merge without extra keywords.
# In[11]
df1a.join(df2a)
# Out[11]
              group    hire_data
employee        
Bob         Accounting         2008
Jake    Engineering         2012
Lisa    Engineering         2004
Sue                 HR         2014
  • If you'd like to mix indices and columns, you can combine left_index with right_on or left_on with right_index to get the desired behavior.
# In[12]
display('df1a','df3',"pd.merge(df1a,df3,left_index=True,right_on='name')")
# Out[12]
df1a

              group
employee    
Bob         Accounting
Jake    Engineering
Lisa    Engineering
Sue                 HR

df3

    name    salary
0     Bob     70000
1    Jake     80000
2    Lisa    120000
3     Sue     90000

pd.merge(df1a,df3,left_index=True,right_on='name')

          group    name    salary
0     Accounting     Bob     70000
1    Engineering    Jake     80000
2    Engineering    Lisa    120000
3             HR     Sue     90000

Specifying Set Arithmetic for Joins

# In[13]
df6=pd.DataFrame({'name':['Peter','Paul','Mary'],
'food':['fish','beans','bread']})
df7=pd.DataFrame({'name':['Mary','Joseph'],
'drink':['wine','beer']}, columns=['name','drink'])
display('df6','df7',"pd.merge(df6,df7)")
# Out[13]
df6

     name     food
0    Peter     fish
1     Paul    beans
2     Mary    bread

df7

      name    drink
0      Mary     wine
1    Joseph     beer

pd.merge(df6,df7)

    name     food    drink
0    Mary    bread     wine
  • By default, the result contains the intersection of the two sets of inputs; this is what is known as an inner join.
  • We can specify this explicitly using the how keyword, which defaults to inner
# In[14]
pd.merge(df6,df7,how='inner')
# Out[14]
    name     food    drink
0    Mary    bread     wine
  • Other options for the how keyword are outer, left, and right.
  • outer join returns a join over the union of the input columns and fills in missing values with NAs.
# In[15]
display('df6','df7',"pd.merge(df6,df7,how='outer')")
# Out[15]
df6

     name     food
0    Peter     fish
1     Paul    beans
2     Mary    bread

df7

      name    drink
0      Mary     wine
1    Joseph     beer

pd.merge(df6,df7,how='outer')

      name     food    drink
0     Peter     fish      NaN
1      Paul    beans      NaN
2      Mary    bread     wine
3    Joseph      NaN     beer
  • The left and right join return joins over the left entries and right entries, respectively.
# In[16]
display('df6','df7',"pd.merge(df6,df7,how='left')")
# Out[16]
df6

     name     food
0    Peter     fish
1     Paul    beans
2     Mary    bread

df7

      name    drink
0      Mary     wine
1    Joseph     beer

pd.merge(df6,df7,how='left')

     name     food    drink
0    Peter     fish      NaN
1     Paul    beans      NaN
2     Mary    bread     wine
  • The output rows now correspond to the entries in the left input.
# In[17]
display('df6','df7',"pd.merge(df6,df7,how='right')")
# Out[17]
df6

     name    food
0    Peter     fish
1     Paul    beans
2     Mary    bread

df7

      name    drink
0      Mary     wine
1    Joseph     beer

pd.merge(df6,df7,how='right')

      name     food    drink
0      Mary    bread     wine
1    Joseph      NaN     beer

Overlapping Column Names: The suffixes Keyword

# In[18]
df8=pd.DataFrame({'name':['Bob','Jake','Lisa','Sue'],
'rank':[1,2,3,4]})
df9=pd.DataFrame({'name':['Bob','Jake','Lisa','Sue'],
'rank':[3,1,4,2]})
display('df8','df9','pd.merge(df8,df9,on="name")')
# Out[18]
df8

    name rank
0     Bob    1
1    Jake    2
2    Lisa    3
3     Sue    4

df9

    name rank
0     Bob    3
1    Jake    1
2    Lisa    4
3     Sue    2

pd.merge(df8,df9,on="name")

    name    rank_x    rank_y
0     Bob         1         3
1    Jake         2         1
2    Lisa         3         4
3     Sue         4         2
  • Because the output would have two conflicting column names, the merge function automatically appends the suffixes _x and _y to make the output columns unique.
  • If these defaults are inappropriate, it is possible to specify a custom suffix using the suffixes keyword.
# In[19]
pd.merge(df8,df9,on="name",suffixes=["_L","_R"])
# Out[19]
    name    rank_L    rank_R
0     Bob         1         3
1    Jake         2         1
2    Lisa         3         4
3     Sue         4         2

Reference these urls if you want to know about join, merge and concat :

  1. Combining DataFrame
  2. About merge method

'Python > Pandas' 카테고리의 다른 글

9. Pivot Tables  (0) 2025.06.19
8. Aggregation and Grouping  (0) 2025.06.19
6. Combining Datasets: concat and append  (0) 2025.06.19
5. Hierarchical Indexing  (0) 2025.06.19
4. Handling Missing Data  (1) 2025.06.19