Joining DataFrames in Pandas
Joining DataFrames is a crucial operation in data analysis. It enables you to combine multiple data sources for deeper insights. Pandas provides several methods to join DataFrames flexibly. In this guide, we will explore the various ways to join DataFrames using merge()
, join()
, concat()
, and other advanced techniques.
Sample DataFrames
import pandas as pd
# Sample DataFrames
data1 = {'Name': ['John', 'Alice', 'Bob', 'Eve'],
'Age': [25, 30, 22, 35],
'Gender': ['Male', 'Female', 'Male', 'Female']}
df1 = pd.DataFrame(data1)
data2 = {'Name': ['John', 'Alice', 'Bob', 'Charlie'],
'Salary': [50000, 55000, 40000, 48000]}
df2 = pd.DataFrame(data2)
data3 = {'ID': [1, 2, 3, 4],
'Department': ['HR', 'IT', 'Finance', 'Marketing']}
df3 = pd.DataFrame(data3)
print(df1)
print(df2)
print(df3)
1. Joining DataFrames Using merge()
The merge()
function is versatile and offers SQL-like joins, making it ideal for combining DataFrames based on specific columns.
Inner Join
merged_df = pd.merge(df1, df2, on='Name', how='inner')
print(merged_df)
Left Join
left_joined_df = pd.merge(df1, df2, on='Name', how='left')
print(left_joined_df)
Right Join
right_joined_df = pd.merge(df1, df2, on='Name', how='right')
print(right_joined_df)
Outer Join
outer_joined_df = pd.merge(df1, df2, on='Name', how='outer')
print(outer_joined_df)
Multi-Key Join
data4 = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 22],
'Score': [85, 90, 78]}
df4 = pd.DataFrame(data4)
multi_key_merge = pd.merge(df1, df4, on=['Name', 'Age'], how='inner')
print(multi_key_merge)
2. Joining DataFrames Using join()
The join()
method is index-based, making it suitable for combining DataFrames with meaningful indices.
Basic Join
df1.set_index('Name', inplace=True)
df2.set_index('Name', inplace=True)
joined_df = df1.join(df2, how='inner')
print(joined_df)
Joining on Non-Index Columns
reset_df1 = df1.reset_index()
reset_df2 = df2.reset_index()
joined_on_column = reset_df1.join(reset_df2.set_index('Name'), on='Name')
print(joined_on_column)
3. Concatenating DataFrames Using concat()
Vertical Concatenation
vertical_concat = pd.concat([df1.reset_index(), df2.reset_index()], axis=0)
print(vertical_concat)
Horizontal Concatenation
horizontal_concat = pd.concat([df1.reset_index(), df2.reset_index()], axis=1)
print(horizontal_concat)
Concatenation with Keys
concat_with_keys = pd.concat([df1, df2], keys=['Dataset1', 'Dataset2'])
print(concat_with_keys)
4. Cross Join (Cartesian Product)
df1['key'] = 1
df3['key'] = 1
cross_joined_df = pd.merge(df1, df3, on='key').drop('key', axis=1)
print(cross_joined_df)
5. Updating DataFrames Using update()
df1.update(df2)
print(df1)
6. Merging with Conditions (Custom Join)
custom_join = df1[df1['Name'].isin(df2['Name'])]
print(custom_join)
7. Merging Using Indicator
merge_with_indicator = pd.merge(df1, df2, on='Name', how='outer', indicator=True)
print(merge_with_indicator)
8. Using combine_first()
for Hierarchical Data
combined_df = df1.combine_first(df2)
print(combined_df)
Summary
Pandas provides a rich set of tools to join and combine DataFrames, including:
merge()
: SQL-like joins (inner
,left
,right
,outer
,cross
).join()
: Index-based joins.concat()
: Concatenates DataFrames along rows or columns.- Custom Joins: Filter-based merging.
update()
: Updates a DataFrame with another.combine_first()
: Combines two DataFrames, prioritizing the first.