Interactive Session

A cartoon panda in a frame shop. MidJourney 5

Getting Started

Before we begin our interactive session, please follow these steps to set up your Jupyter Notebook:

Open JupyterLab and create a new notebook:
- Click on the + button in the top left corner
- Select Python 3.11.0 from the Notebook options
Rename your notebook:
- Right-click on the Untitled.ipynb tab
- Select “Rename”
- Name your notebook with the format: Session_XY_Topic.ipynb (Replace X with the day number and Y with the session number)
Add a title cell:
- In the first cell of your notebook, change the cell type to “Markdown”
- Add the following content (replace the placeholders with the actual information):

# Day 4: Session A - Dataframes

[Link to session webpage](https://eds-217-essential-python.github.io/course-materials/interactive-sessions/4a_dataframes.html)

Date: 09/05/2025

Add a code cell:
- Below the title cell, add a new cell
- Ensure it’s set as a “Code” cell
- This will be where you start writing your Python code for the session
Throughout the session:
- Take notes in Markdown cells
- Copy or write code in Code cells
- Run cells to test your code
- Ask questions if you need clarification

Caution

Remember to save your work frequently by clicking the save icon or using the keyboard shortcut (Ctrl+S or Cmd+S).

Let’s begin our interactive session!

Introduction

In this interactive session, we’ll explore the basics of working with pandas DataFrames using a dataset of world cities. We’ll cover importing data, basic DataFrame operations, and essential methods for data exploration and manipulation. This session will prepare you for more advanced data analysis tasks and upcoming collaborative coding exercises.

Learning Objectives

By the end of this session, you will be able to:

Import data into a pandas DataFrame
Explore basic DataFrame properties and methods
Perform simple data filtering and selection operations
Use basic aggregation and grouping functions

Setting Up

Let’s start by importing the pandas library and loading our dataset.

Code

import pandas as pd
import numpy as np

1. Basic Data Importing

Code

url = "https://raw.githubusercontent.com/datasets/world-cities/master/data/world-cities.csv"
cities_df = pd.read_csv(url)

2. Basic DataFrame Exploration

Viewing the Data

Let’s take a look at the first few rows of our DataFrame:

Code

print(cities_df.head())

                 name               country          subcountry  geonameid
0        les Escaldes               Andorra  Escaldes-Engordany    3040051
1    Andorra la Vella               Andorra    Andorra la Vella    3041563
2             Warīsān  United Arab Emirates               Dubai     290503
3          Umm Suqaym  United Arab Emirates               Dubai     290581
4  Umm Al Quwain City  United Arab Emirates        UmmalQaywayn     290594

To see the last few rows, we can use:

Code

print(cities_df.tail())

                         name   country                   subcountry  \
32395                 Bindura  Zimbabwe          Mashonaland Central   
32396              Beitbridge  Zimbabwe  Matabeleland South Province   
32397                 Epworth  Zimbabwe                       Harare   
32398             Chitungwiza  Zimbabwe                       Harare   
32399  Harare Western Suburbs  Zimbabwe             Mashonaland West   

       geonameid  
32395     895061  
32396     895269  
32397    1085510  
32398    1106542  
32399   13132735

DataFrame Properties

Now, let’s explore some basic properties of our DataFrame:

Code

# Number of rows and columns
print("Shape:", cities_df.shape)

# Column names
print("\nColumns:", cities_df.columns)

# Data types of each column
print("\nData types:\n", cities_df.dtypes)

# Summary statistics of numeric columns (if any)
print("\nSummary statistics:\n", cities_df.describe())

Shape: (32400, 4)

Columns: Index(['name', 'country', 'subcountry', 'geonameid'], dtype='object')

Data types:
 name          object
country       object
subcountry    object
geonameid      int64
dtype: object

Summary statistics:
           geonameid
count  3.240000e+04
mean   3.355243e+06
std    2.974148e+06
min    4.900000e+02
25%    1.277806e+06
50%    2.641902e+06
75%    3.846877e+06
max    1.351271e+07

Checking for Missing Values

It’s important to identify any missing data in your DataFrame:

Code

print(cities_df.isnull().sum())

name            0
country         0
subcountry    117
geonameid       0
dtype: int64

3. Basic Cleaning

Remove rows with missing data in subcountry using dropna() and the subset argument.

Code

cities_df = cities_df.dropna(subset=['subcountry'])

4. Basic Data Selection and Filtering

Selecting Columns

To select specific columns:

Code

# Select a single column
print(cities_df['name'].head())

# Select multiple columns
print(cities_df[['name', 'country', 'subcountry']].head())

0          les Escaldes
1      Andorra la Vella
2               Warīsān
3            Umm Suqaym
4    Umm Al Quwain City
Name: name, dtype: object
                 name               country          subcountry
0        les Escaldes               Andorra  Escaldes-Engordany
1    Andorra la Vella               Andorra    Andorra la Vella
2             Warīsān  United Arab Emirates               Dubai
3          Umm Suqaym  United Arab Emirates               Dubai
4  Umm Al Quwain City  United Arab Emirates        UmmalQaywayn

Filtering Rows

We can filter rows based on conditions:

Code

# Cities in the United States
us_cities = cities_df[cities_df['country'] == 'United States']
print(us_cities[['name', 'country']].head())

# Cities in California
california_cities = cities_df[(cities_df['country'] == 'United States') & (cities_df['subcountry'] == 'California')]
print(california_cities[['name', 'country', 'subcountry']].head())

             name        country
28010   Fort Hunt  United States
28011    Bessemer  United States
28012     Paducah  United States
28013  Birmingham  United States
28014     Cordova  United States
                name        country  subcountry
30423       Fillmore  United States  California
30472       Adelanto  United States  California
30473         Agoura  United States  California
30474   Agoura Hills  United States  California
30475  Agua Caliente  United States  California

Combining Conditions

We can use logical operators to combine multiple conditions:

Code

# Cities in Canada that start with the letter 'T'
canadian_t_cities = cities_df[(cities_df['country'] == 'Canada') & (cities_df['name'].str.startswith('T'))]
print(canadian_t_cities[['name', 'country', 'subcountry']])

                        name country        subcountry
4172  Tam O'Shanter-Sullivan  Canada           Ontario
4173                Tecumseh  Canada           Ontario
4174           Templeton-Est  Canada            Quebec
4175                 Terrace  Canada  British Columbia
4176              Terrebonne  Canada            Quebec
4177             The Beaches  Canada           Ontario
4178                 Thorold  Canada           Ontario
4179             Thunder Bay  Canada           Ontario
4180             Tillsonburg  Canada           Ontario
4181                 Timmins  Canada           Ontario
4182                 Toronto  Canada           Ontario
4183          Trois-Rivières  Canada            Quebec
4184              Tsawwassen  Canada  British Columbia
4228          Thetford-Mines  Canada            Quebec
4243       Trinity-Bellwoods  Canada           Ontario
4274           Taylor-Massey  Canada           Ontario
4288        Thorncliffe Park  Canada           Ontario
4333                Townline  Canada  British Columbia

5. Basic Sorting and Ranking

To sort the DataFrame based on one or more columns:

Code

# Sort cities alphabetically
sorted_cities = cities_df.sort_values('name')
print(sorted_cities[['name', 'country']].head())

# Sort cities by country, then by name
sorted_cities_by_country = cities_df.sort_values(['country', 'name'])
print(sorted_cities_by_country[['name', 'country']].head())

                      name      country
22330       's-Gravenzande  Netherlands
22329     's-Hertogenbosch  Netherlands
25815            'Ārdamatā        Sudan
9286   6th of October City        Egypt
9917              A Coruña        Spain
         name      country
112   Andkhōy  Afghanistan
111  Asadābād  Afghanistan
72      Aībak  Afghanistan
108   Baghlān  Afghanistan
107     Balkh  Afghanistan

6. Basic Transformations

Creating New Columns

We can create new columns based on existing data:

Code

# Create a column for city name length
cities_df['name_length'] = cities_df['name'].str.len()

# Display the top 5 cities with the longest names
long_named_cities = cities_df.nlargest(5, 'name_length')
print(long_named_cities[['name', 'country', 'name_length']])

                                                    name        country  \
23661  Karachi University Employees Co-operative Hous...       Pakistan   
3811     Setor Complementar de Indústria e Abastecimento         Brazil   
31219      Diamond Head / Kapahulu / Saint Louis Heights  United States   
8342             Universitäts- und Hansestadt Greifswald        Germany   
31371             Aliamanu / Salt Lakes / Foster Village  United States   

       name_length  
23661           57  
3811            47  
31219           45  
8342            39  
31371           38

Adding Series as Columns: Index Alignment

When adding a Series as a new column, pandas aligns data by index:

Code

# This works - Series index matches DataFrame index
sample_series = pd.Series([1, 2, 3], index=[0, 1, 2])
small_df = pd.DataFrame({'A': ['x', 'y', 'z']})
small_df['B'] = sample_series
print("Aligned correctly:")
print(small_df)

# This creates unexpected results - misaligned indices
misaligned_series = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
small_df['C'] = misaligned_series  # Results in NaN values!
print("\nMisaligned indices:")
print(small_df)

# Solution: Use .reindex() or ensure matching indices
small_df['D'] = misaligned_series.reindex(small_df.index, fill_value=0)
print("\nCorrected alignment:")
print(small_df)

Aligned correctly:
   A  B
0  x  1
1  y  2
2  z  3

Misaligned indices:
   A  B   C
0  x  1 NaN
1  y  2 NaN
2  z  3 NaN

Corrected alignment:
   A  B   C  D
0  x  1 NaN  0
1  y  2 NaN  0
2  z  3 NaN  0

7-8: Basic Grouping and Aggregation

Grouping allows us to perform operations on subsets of the data:

Code

# Number of cities by country
cities_per_country = cities_df.groupby('country')['name'].count().sort_values(ascending=False)
print(cities_per_country.head())

# Number of subcountries (e.g., states, provinces) by country
subcountries_per_country = cities_df.groupby('country')['subcountry'].nunique().sort_values(ascending=False)
print(subcountries_per_country.head())

country
India            3767
United States    3368
Brazil           2269
China            2012
Japan            1293
Name: name, dtype: int64
country
Russian Federation    83
Türkiye               81
Thailand              75
Algeria               53
United States         51
Name: subcountry, dtype: int64

Conclusion

In this session, we’ve covered the basics of working with pandas DataFrames using a world cities dataset, including:

Importing data
Exploring DataFrame properties
Selecting and filtering data
Sorting and ranking
Grouping and aggregation
Creating new columns

These skills form the foundation of data analysis with pandas and will be essential for upcoming exercises and projects. Remember, pandas has many more functions and methods that we haven’t covered here. Don’t hesitate to explore the pandas documentation for more advanced features!

Getting Started

Introduction

Learning Objectives

Setting Up

1. Basic Data Importing

2. Basic DataFrame Exploration

Viewing the Data

DataFrame Properties

Checking for Missing Values

3. Basic Cleaning

4. Basic Data Selection and Filtering

Selecting Columns

Filtering Rows

Combining Conditions

5. Basic Sorting and Ranking

6. Basic Transformations

Creating New Columns

Adding Series as Columns: Index Alignment

7-8: Basic Grouping and Aggregation

Conclusion

Resources