EDS 217 Cheatsheet

Introduction

Data merging and joining are fundamental operations in data science that allow you to combine information from multiple datasets. This cheatsheet covers the essential techniques you’ll need to merge DataFrames effectively using pandas.

Key Concepts:

Merge/Join: Combine two DataFrames based on common columns or indices
Concatenate: Stack DataFrames vertically (rows) or horizontally (columns)
Keys: The columns or indices used to match records between DataFrames
Join Types: Different ways to handle records that don’t have matches

Setup

Let’s start by importing pandas and creating sample datasets to work with:

Code

import pandas as pd

# Create sample DataFrames for examples
students = pd.DataFrame({
    'student_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'major': ['Biology', 'Chemistry', 'Physics', 'Biology', 'Chemistry']
})

grades = pd.DataFrame({
    'student_id': [1, 2, 3, 6, 7],
    'course': ['Bio101', 'Chem201', 'Phys301', 'Math101', 'Stat201'],
    'grade': ['A', 'B+', 'A-', 'B', 'A']
})

print("Students DataFrame:")
print(students)
print("\nGrades DataFrame:")
print(grades)

Students DataFrame:
   student_id     name      major
0           1    Alice    Biology
1           2      Bob  Chemistry
2           3  Charlie    Physics
3           4    Diana    Biology
4           5      Eve  Chemistry

Grades DataFrame:
   student_id   course grade
0           1   Bio101     A
1           2  Chem201    B+
2           3  Phys301    A-
3           6  Math101     B
4           7  Stat201     A

Types of Joins

Understanding different join types is crucial for merging data correctly:

Inner Join (Default)

Returns only rows that have matching keys in both DataFrames.

Code

# Inner join - only students with grades
inner_result = pd.merge(students, grades, on='student_id')
print("Inner Join Result:")
print(inner_result)

Inner Join Result:
   student_id     name      major   course grade
0           1    Alice    Biology   Bio101     A
1           2      Bob  Chemistry  Chem201    B+
2           3  Charlie    Physics  Phys301    A-

Left Join

Returns all rows from the left DataFrame and matched rows from the right DataFrame.

Code

# Left join - all students, with grades where available
left_result = pd.merge(students, grades, on='student_id', how='left')
print("Left Join Result:")
print(left_result)

Left Join Result:
   student_id     name      major   course grade
0           1    Alice    Biology   Bio101     A
1           2      Bob  Chemistry  Chem201    B+
2           3  Charlie    Physics  Phys301    A-
3           4    Diana    Biology      NaN   NaN
4           5      Eve  Chemistry      NaN   NaN

Right Join

Returns all rows from the right DataFrame and matched rows from the left DataFrame.

Code

# Right join - all grades, with student info where available
right_result = pd.merge(students, grades, on='student_id', how='right')
print("Right Join Result:")
print(right_result)

Right Join Result:
   student_id     name      major   course grade
0           1    Alice    Biology   Bio101     A
1           2      Bob  Chemistry  Chem201    B+
2           3  Charlie    Physics  Phys301    A-
3           6      NaN        NaN  Math101     B
4           7      NaN        NaN  Stat201     A

Outer Join

Returns all rows from both DataFrames, filling missing values with NaN.

Code

# Outer join - all students and all grades
outer_result = pd.merge(students, grades, on='student_id', how='outer')
print("Outer Join Result:")
print(outer_result)

Outer Join Result:
   student_id     name      major   course grade
0           1    Alice    Biology   Bio101     A
1           2      Bob  Chemistry  Chem201    B+
2           3  Charlie    Physics  Phys301    A-
3           4    Diana    Biology      NaN   NaN
4           5      Eve  Chemistry      NaN   NaN
5           6      NaN        NaN  Math101     B
6           7      NaN        NaN  Stat201     A

Merging with Different Column Names

When the key columns have different names in each DataFrame:

Code

# Create DataFrame with different column name
population_data = pd.DataFrame({
    'country_name': ['USA', 'Canada', 'Mexico', 'UK'],
    'population': [331000000, 38000000, 128000000, 67000000]
})

eurovision_data = pd.DataFrame({
    'to_country': ['USA', 'Canada', 'France', 'UK'],  
    'points': [250, 180, 300, 220]
})

# Merge with different column names
merged = pd.merge(eurovision_data, population_data, 
                  left_on='to_country', right_on='country_name')
print("Merged with different column names:")
print(merged)

Merged with different column names:
  to_country  points country_name  population
0        USA     250          USA   331000000
1     Canada     180       Canada    38000000
2         UK     220           UK    67000000

Joining on Index

When you want to combine DataFrames based on their index (row labels), use the .join() method - it’s much simpler than merging:

Code

# Create DataFrames with datetime indices (common in time series data)
dates = pd.date_range('2023-01-01', periods=4, freq='D')

temperature_df = pd.DataFrame({
    'temp_celsius': [2.5, 3.1, 4.2, 2.8]
}, index=dates)

precipitation_df = pd.DataFrame({
    'precip_mm': [0.0, 12.5, 8.2, 0.0]
}, index=dates)

print("Temperature data:")
print(temperature_df)
print("\nPrecipitation data:")
print(precipitation_df)

# Join DataFrames on their index - simple!
weather_data = temperature_df.join(precipitation_df)
print("\nJoined weather data:")
print(weather_data)

Temperature data:
            temp_celsius
2023-01-01           2.5
2023-01-02           3.1
2023-01-03           4.2
2023-01-04           2.8

Precipitation data:
            precip_mm
2023-01-01        0.0
2023-01-02       12.5
2023-01-03        8.2
2023-01-04        0.0

Joined weather data:
            temp_celsius  precip_mm
2023-01-01           2.5        0.0
2023-01-02           3.1       12.5
2023-01-03           4.2        8.2
2023-01-04           2.8        0.0

Concatenating DataFrames

Use pd.concat() to stack DataFrames vertically or horizontally:

Vertical Concatenation (Stacking Rows)

Code

# Create additional data for different years
df_2023 = pd.DataFrame({
    'zipcode': ['90210', '10001', '60601'],
    'temp_min': [-5, -10, -15],
    'year': [2023, 2023, 2023]
})

df_2012 = pd.DataFrame({
    'zipcode': ['90210', '10001', '60601'], 
    'temp_min': [-8, -12, -18],
    'year': [2012, 2012, 2012]
})

# Concatenate vertically (stack rows on top of each other)
combined_years = pd.concat([df_2023, df_2012])
print("Vertically concatenated:")
print(combined_years)

Vertically concatenated:
  zipcode  temp_min  year
0   90210        -5  2023
1   10001       -10  2023
2   60601       -15  2023
0   90210        -8  2012
1   10001       -12  2012
2   60601       -18  2012

Horizontal Concatenation (Adding Columns)

Code

# Create separate DataFrames with same index
df_left = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df_right = pd.DataFrame({'C': [7, 8, 9], 'D': [10, 11, 12]})

# Concatenate horizontally (add columns)
combined_cols = pd.concat([df_left, df_right], axis=1)
print("Horizontally concatenated:")
print(combined_cols)

Horizontally concatenated:
   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12

Real-World Examples

Example 1: Eurovision Analysis (from eod-day6)

Code

# Simulate the Eurovision data merging scenario
eurovision_df = pd.DataFrame({
    'to_country': ['Sweden', 'Norway', 'Finland', 'Denmark'],
    'points_final': [280, 245, 190, 165],
    'year': [2023, 2023, 2023, 2023]
})

population_df = pd.DataFrame({
    'country_name': ['Sweden', 'Norway', 'Finland', 'Iceland'],
    'population': [10415000, 5380000, 5530000, 370000]
})

# Merge Eurovision data with population data  
merged_df = pd.merge(eurovision_df, population_df, 
                     left_on='to_country', right_on='country_name')
print("Eurovision-Population merge:")
print(merged_df)

Eurovision-Population merge:
  to_country  points_final  year country_name  population
0     Sweden           280  2023       Sweden    10415000
1     Norway           245  2023       Norway     5380000
2    Finland           190  2023      Finland     5530000

Example 2: Climate Data Analysis (from 6b_advanced_data_manipulation)

Code

# Create simple temperature and CO2 data
temp_df = pd.DataFrame({
    'month': ['Jan', 'Feb', 'Mar', 'Apr'],
    'temperature': [0.8, 1.2, 0.9, 1.1]
})

co2_df = pd.DataFrame({
    'month': ['Jan', 'Feb', 'Mar', 'Apr'], 
    'co2_level': [420, 422, 419, 421]
})

# Merge on the month column
combined_df = pd.merge(temp_df, co2_df, on='month')
print("Climate data merge:")
print(combined_df)

Climate data merge:
  month  temperature  co2_level
0   Jan          0.8        420
1   Feb          1.2        422
2   Mar          0.9        419
3   Apr          1.1        421

Handling Common Issues

Missing Values After Merge

Code

# Create DataFrames with some non-matching records
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'val1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'C', 'D'], 'val2': [10, 30, 40]})

# Outer join creates NaN values
result = pd.merge(df1, df2, on='key', how='outer')
print("Result with NaN values:")
print(result)

# Handle missing values - fill all NaN with 0
result_filled = result.fillna(0)
print("\nAfter filling NaN with 0:")
print(result_filled)

# Alternative: Drop rows with any NaN values
result_clean = result.dropna()
print("\nAfter dropping rows with NaN:")
print(result_clean)

Result with NaN values:
  key  val1  val2
0   A   1.0  10.0
1   B   2.0   NaN
2   C   3.0  30.0
3   D   NaN  40.0

After filling NaN with 0:
  key  val1  val2
0   A   1.0  10.0
1   B   2.0   0.0
2   C   3.0  30.0
3   D   0.0  40.0

After dropping rows with NaN:
  key  val1  val2
0   A   1.0  10.0
2   C   3.0  30.0

Duplicate Column Names

Code

# DataFrames with overlapping column names  
df1 = pd.DataFrame({'key': ['A', 'B'], 'value': [1, 2], 'extra': ['x', 'y']})
df2 = pd.DataFrame({'key': ['A', 'B'], 'value': [10, 20], 'other': ['p', 'q']})

# Pandas automatically adds suffixes
result = pd.merge(df1, df2, on='key')
print("Result with automatic suffixes:")
print(result)

# Custom suffixes
result_custom = pd.merge(df1, df2, on='key', suffixes=('_left', '_right'))
print("\nResult with custom suffixes:")
print(result_custom)

Result with automatic suffixes:
  key  value_x extra  value_y other
0   A        1     x       10     p
1   B        2     y       20     q

Result with custom suffixes:
  key  value_left extra  value_right other
0   A           1     x           10     p
1   B           2     y           20     q

Tips and Best Practices

1. Always Inspect Your Data First

# Check the merge keys before merging
print("Unique keys in df1:", df1['key'].unique())
print("Unique keys in df2:", df2['key'].unique())

# Check for duplicates and null values
print("Duplicates in df1:", df1['key'].duplicated().any())
print("Null values in df1:", df1['key'].isnull().any())
print("Null values in df2:", df2['key'].isnull().any())

# Check data types to ensure compatibility
print("df1 key dtype:", df1['key'].dtype)
print("df2 key dtype:", df2['key'].dtype)

2. Validate Your Merge Results

# Check the shape before and after merge
print(f"Before: df1 has {df1.shape[0]} rows, df2 has {df2.shape[0]} rows")
print(f"After: merged has {merged.shape[0]} rows")

# Check for unexpected NaN values
print("NaN count after merge:")
print(merged.isnull().sum())

# Check if merge worked as expected
print("First few rows of result:")
print(merged.head())

3. Start Simple, Then Explore

# For beginners: Start with basic merges and check your results
# Once comfortable, you can explore more advanced options

# Basic merge (most common)
result = pd.merge(df1, df2, on='key')

# Specify the type of join if needed
result = pd.merge(df1, df2, on='key', how='left')

# Always look at your results to make sure they make sense!
print(result.head())

Quick Reference

Operation	Syntax	Use Case
Inner join	`pd.merge(df1, df2, on='key')`	Only matching records
Left join	`pd.merge(df1, df2, on='key', how='left')`	All records from left DF
Right join	`pd.merge(df1, df2, on='key', how='right')`	All records from right DF
Outer join	`pd.merge(df1, df2, on='key', how='outer')`	All records from both DFs
Different keys	`pd.merge(df1, df2, left_on='key1', right_on='key2')`	Key columns have different names
Concatenate rows	`pd.concat([df1, df2])`	Stack DataFrames vertically
Concatenate columns	`pd.concat([df1, df2], axis=1)`	Add columns side by side

Adding New Columns After Merge

# Add calculated columns after merging
merged = pd.merge(eurovision_df, population_df, 
                  left_on='to_country', right_on='country_name')

# Calculate new columns using simple operations
merged['points_per_capita'] = merged['points_final'] / merged['population'] * 1000000

# Create categories using simple if-else logic
def categorize_year(year):
    if year >= 2020:
        return '2020s'
    else:
        return '2010s'

merged['year_category'] = merged['year'].apply(categorize_year)

Common Patterns in Environmental Data Science

Time Series Joining: Merge climate data from different sources by date
Spatial Joining: Combine location-based data (zip codes, coordinates)
Categorical Joining: Link datasets using country names, species codes, etc.
Multi-year Analysis: Concatenate data from different years for trend analysis
Sensor Data Integration: Combine readings from multiple environmental sensors

Common Troubleshooting

# If your merge doesn't work as expected, try these steps:

# 1. Check what's in your key columns
print("Keys in df1:", df1['key'].unique())
print("Keys in df2:", df2['key'].unique())

# 2. Make sure there are no missing values in your key columns
print("Missing values in df1 key:", df1['key'].isnull().sum())
print("Missing values in df2 key:", df2['key'].isnull().sum())

# 3. If you get fewer rows than expected, try an outer join to see all data
result_all = pd.merge(df1, df2, on='key', how='outer')
print("All possible combinations:")
print(result_all)