Data merging and joining are fundamental operations in data science that allow you to combine information from multiple datasets. This cheatsheet covers the essential techniques youβll need to merge DataFrames effectively using pandas.
Key Concepts:
Merge/Join: Combine two DataFrames based on common columns or indices
Concatenate: Stack DataFrames vertically (rows) or horizontally (columns)
Keys: The columns or indices used to match records between DataFrames
Join Types: Different ways to handle records that donβt have matches
Setup
Letβs start by importing pandas and creating sample datasets to work with:
Students DataFrame:
student_id name major
0 1 Alice Biology
1 2 Bob Chemistry
2 3 Charlie Physics
3 4 Diana Biology
4 5 Eve Chemistry
Grades DataFrame:
student_id course grade
0 1 Bio101 A
1 2 Chem201 B+
2 3 Phys301 A-
3 6 Math101 B
4 7 Stat201 A
Types of Joins
Understanding different join types is crucial for merging data correctly:
Inner Join (Default)
Returns only rows that have matching keys in both DataFrames.
Code
# Inner join - only students with gradesinner_result = pd.merge(students, grades, on='student_id')print("Inner Join Result:")print(inner_result)
Inner Join Result:
student_id name major course grade
0 1 Alice Biology Bio101 A
1 2 Bob Chemistry Chem201 B+
2 3 Charlie Physics Phys301 A-
Left Join
Returns all rows from the left DataFrame and matched rows from the right DataFrame.
Code
# Left join - all students, with grades where availableleft_result = pd.merge(students, grades, on='student_id', how='left')print("Left Join Result:")print(left_result)
Left Join Result:
student_id name major course grade
0 1 Alice Biology Bio101 A
1 2 Bob Chemistry Chem201 B+
2 3 Charlie Physics Phys301 A-
3 4 Diana Biology NaN NaN
4 5 Eve Chemistry NaN NaN
Right Join
Returns all rows from the right DataFrame and matched rows from the left DataFrame.
Code
# Right join - all grades, with student info where availableright_result = pd.merge(students, grades, on='student_id', how='right')print("Right Join Result:")print(right_result)
Right Join Result:
student_id name major course grade
0 1 Alice Biology Bio101 A
1 2 Bob Chemistry Chem201 B+
2 3 Charlie Physics Phys301 A-
3 6 NaN NaN Math101 B
4 7 NaN NaN Stat201 A
Outer Join
Returns all rows from both DataFrames, filling missing values with NaN.
Code
# Outer join - all students and all gradesouter_result = pd.merge(students, grades, on='student_id', how='outer')print("Outer Join Result:")print(outer_result)
Outer Join Result:
student_id name major course grade
0 1 Alice Biology Bio101 A
1 2 Bob Chemistry Chem201 B+
2 3 Charlie Physics Phys301 A-
3 4 Diana Biology NaN NaN
4 5 Eve Chemistry NaN NaN
5 6 NaN NaN Math101 B
6 7 NaN NaN Stat201 A
Merging with Different Column Names
When the key columns have different names in each DataFrame:
Code
# Create DataFrame with different column namepopulation_data = pd.DataFrame({'country_name': ['USA', 'Canada', 'Mexico', 'UK'],'population': [331000000, 38000000, 128000000, 67000000]})eurovision_data = pd.DataFrame({'to_country': ['USA', 'Canada', 'France', 'UK'], 'points': [250, 180, 300, 220]})# Merge with different column namesmerged = pd.merge(eurovision_data, population_data, left_on='to_country', right_on='country_name')print("Merged with different column names:")print(merged)
Merged with different column names:
to_country points country_name population
0 USA 250 USA 331000000
1 Canada 180 Canada 38000000
2 UK 220 UK 67000000
Joining on Index
When you want to combine DataFrames based on their index (row labels), use the .join() method - itβs much simpler than merging:
Code
# Create DataFrames with datetime indices (common in time series data)dates = pd.date_range('2023-01-01', periods=4, freq='D')temperature_df = pd.DataFrame({'temp_celsius': [2.5, 3.1, 4.2, 2.8]}, index=dates)precipitation_df = pd.DataFrame({'precip_mm': [0.0, 12.5, 8.2, 0.0]}, index=dates)print("Temperature data:")print(temperature_df)print("\nPrecipitation data:")print(precipitation_df)# Join DataFrames on their index - simple!weather_data = temperature_df.join(precipitation_df)print("\nJoined weather data:")print(weather_data)
Horizontally concatenated:
A B C D
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
Real-World Examples
Example 1: Eurovision Analysis (from eod-day6)
Code
# Simulate the Eurovision data merging scenarioeurovision_df = pd.DataFrame({'to_country': ['Sweden', 'Norway', 'Finland', 'Denmark'],'points_final': [280, 245, 190, 165],'year': [2023, 2023, 2023, 2023]})population_df = pd.DataFrame({'country_name': ['Sweden', 'Norway', 'Finland', 'Iceland'],'population': [10415000, 5380000, 5530000, 370000]})# Merge Eurovision data with population data merged_df = pd.merge(eurovision_df, population_df, left_on='to_country', right_on='country_name')print("Eurovision-Population merge:")print(merged_df)
Eurovision-Population merge:
to_country points_final year country_name population
0 Sweden 280 2023 Sweden 10415000
1 Norway 245 2023 Norway 5380000
2 Finland 190 2023 Finland 5530000
Example 2: Climate Data Analysis (from 6b_advanced_data_manipulation)
Code
# Create simple temperature and CO2 datatemp_df = pd.DataFrame({'month': ['Jan', 'Feb', 'Mar', 'Apr'],'temperature': [0.8, 1.2, 0.9, 1.1]})co2_df = pd.DataFrame({'month': ['Jan', 'Feb', 'Mar', 'Apr'], 'co2_level': [420, 422, 419, 421]})# Merge on the month columncombined_df = pd.merge(temp_df, co2_df, on='month')print("Climate data merge:")print(combined_df)
Climate data merge:
month temperature co2_level
0 Jan 0.8 420
1 Feb 1.2 422
2 Mar 0.9 419
3 Apr 1.1 421
Handling Common Issues
Missing Values After Merge
Code
# Create DataFrames with some non-matching recordsdf1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'val1': [1, 2, 3]})df2 = pd.DataFrame({'key': ['A', 'C', 'D'], 'val2': [10, 30, 40]})# Outer join creates NaN valuesresult = pd.merge(df1, df2, on='key', how='outer')print("Result with NaN values:")print(result)# Handle missing values - fill all NaN with 0result_filled = result.fillna(0)print("\nAfter filling NaN with 0:")print(result_filled)# Alternative: Drop rows with any NaN valuesresult_clean = result.dropna()print("\nAfter dropping rows with NaN:")print(result_clean)
Result with NaN values:
key val1 val2
0 A 1.0 10.0
1 B 2.0 NaN
2 C 3.0 30.0
3 D NaN 40.0
After filling NaN with 0:
key val1 val2
0 A 1.0 10.0
1 B 2.0 0.0
2 C 3.0 30.0
3 D 0.0 40.0
After dropping rows with NaN:
key val1 val2
0 A 1.0 10.0
2 C 3.0 30.0
Result with automatic suffixes:
key value_x extra value_y other
0 A 1 x 10 p
1 B 2 y 20 q
Result with custom suffixes:
key value_left extra value_right other
0 A 1 x 10 p
1 B 2 y 20 q
Tips and Best Practices
1. Always Inspect Your Data First
# Check the merge keys before mergingprint("Unique keys in df1:", df1['key'].unique())print("Unique keys in df2:", df2['key'].unique())# Check for duplicates and null valuesprint("Duplicates in df1:", df1['key'].duplicated().any())print("Null values in df1:", df1['key'].isnull().any())print("Null values in df2:", df2['key'].isnull().any())# Check data types to ensure compatibilityprint("df1 key dtype:", df1['key'].dtype)print("df2 key dtype:", df2['key'].dtype)
2. Validate Your Merge Results
# Check the shape before and after mergeprint(f"Before: df1 has {df1.shape[0]} rows, df2 has {df2.shape[0]} rows")print(f"After: merged has {merged.shape[0]} rows")# Check for unexpected NaN valuesprint("NaN count after merge:")print(merged.isnull().sum())# Check if merge worked as expectedprint("First few rows of result:")print(merged.head())
3. Start Simple, Then Explore
# For beginners: Start with basic merges and check your results# Once comfortable, you can explore more advanced options# Basic merge (most common)result = pd.merge(df1, df2, on='key')# Specify the type of join if neededresult = pd.merge(df1, df2, on='key', how='left')# Always look at your results to make sure they make sense!print(result.head())
# Add calculated columns after mergingmerged = pd.merge(eurovision_df, population_df, left_on='to_country', right_on='country_name')# Calculate new columns using simple operationsmerged['points_per_capita'] = merged['points_final'] / merged['population'] *1000000# Create categories using simple if-else logicdef categorize_year(year):if year >=2020:return'2020s'else:return'2010s'merged['year_category'] = merged['year'].apply(categorize_year)
Common Patterns in Environmental Data Science
Time Series Joining: Merge climate data from different sources by date
Spatial Joining: Combine location-based data (zip codes, coordinates)
Categorical Joining: Link datasets using country names, species codes, etc.
Multi-year Analysis: Concatenate data from different years for trend analysis
Sensor Data Integration: Combine readings from multiple environmental sensors
Common Troubleshooting
# If your merge doesn't work as expected, try these steps:# 1. Check what's in your key columnsprint("Keys in df1:", df1['key'].unique())print("Keys in df2:", df2['key'].unique())# 2. Make sure there are no missing values in your key columnsprint("Missing values in df1 key:", df1['key'].isnull().sum())print("Missing values in df2 key:", df2['key'].isnull().sum())# 3. If you get fewer rows than expected, try an outer join to see all dataresult_all = pd.merge(df1, df2, on='key', how='outer')print("All possible combinations:")print(result_all)