Code
import pandas as pd
import numpy as np
Introduction to Pandas DataFrames with World Cities Data
A cartoon panda in a frame shop. MidJourney 5
Before we begin our interactive session, please follow these steps to set up your Jupyter Notebook:
+
button in the top left cornerPython 3.11.0
from the Notebook optionsUntitled.ipynb
tabSession_XY_Topic.ipynb
(Replace X with the day number and Y with the session number)# Day 4: Session A - Dataframes
[Link to session webpage](https://eds-217-essential-python.github.io/course-materials/interactive-sessions/4a_dataframes.html)
Date: 09/05/2025
Remember to save your work frequently by clicking the save icon or using the keyboard shortcut (Ctrl+S or Cmd+S).
Let’s begin our interactive session!
In this interactive session, we’ll explore the basics of working with pandas DataFrames using a dataset of world cities. We’ll cover importing data, basic DataFrame operations, and essential methods for data exploration and manipulation. This session will prepare you for more advanced data analysis tasks and upcoming collaborative coding exercises.
By the end of this session, you will be able to:
Let’s start by importing the pandas library and loading our dataset.
Let’s take a look at the first few rows of our DataFrame:
name country subcountry geonameid
0 les Escaldes Andorra Escaldes-Engordany 3040051
1 Andorra la Vella Andorra Andorra la Vella 3041563
2 Warīsān United Arab Emirates Dubai 290503
3 Umm Suqaym United Arab Emirates Dubai 290581
4 Umm Al Quwain City United Arab Emirates UmmalQaywayn 290594
To see the last few rows, we can use:
name country subcountry \
32395 Bindura Zimbabwe Mashonaland Central
32396 Beitbridge Zimbabwe Matabeleland South Province
32397 Epworth Zimbabwe Harare
32398 Chitungwiza Zimbabwe Harare
32399 Harare Western Suburbs Zimbabwe Mashonaland West
geonameid
32395 895061
32396 895269
32397 1085510
32398 1106542
32399 13132735
Now, let’s explore some basic properties of our DataFrame:
Shape: (32400, 4)
Columns: Index(['name', 'country', 'subcountry', 'geonameid'], dtype='object')
Data types:
name object
country object
subcountry object
geonameid int64
dtype: object
Summary statistics:
geonameid
count 3.240000e+04
mean 3.355243e+06
std 2.974148e+06
min 4.900000e+02
25% 1.277806e+06
50% 2.641902e+06
75% 3.846877e+06
max 1.351271e+07
It’s important to identify any missing data in your DataFrame:
Remove rows with missing data in subcountry using dropna()
and the subset
argument.
To select specific columns:
0 les Escaldes
1 Andorra la Vella
2 Warīsān
3 Umm Suqaym
4 Umm Al Quwain City
Name: name, dtype: object
name country subcountry
0 les Escaldes Andorra Escaldes-Engordany
1 Andorra la Vella Andorra Andorra la Vella
2 Warīsān United Arab Emirates Dubai
3 Umm Suqaym United Arab Emirates Dubai
4 Umm Al Quwain City United Arab Emirates UmmalQaywayn
We can filter rows based on conditions:
# Cities in the United States
us_cities = cities_df[cities_df['country'] == 'United States']
print(us_cities[['name', 'country']].head())
# Cities in California
california_cities = cities_df[(cities_df['country'] == 'United States') & (cities_df['subcountry'] == 'California')]
print(california_cities[['name', 'country', 'subcountry']].head())
name country
28010 Fort Hunt United States
28011 Bessemer United States
28012 Paducah United States
28013 Birmingham United States
28014 Cordova United States
name country subcountry
30423 Fillmore United States California
30472 Adelanto United States California
30473 Agoura United States California
30474 Agoura Hills United States California
30475 Agua Caliente United States California
We can use logical operators to combine multiple conditions:
name country subcountry
4172 Tam O'Shanter-Sullivan Canada Ontario
4173 Tecumseh Canada Ontario
4174 Templeton-Est Canada Quebec
4175 Terrace Canada British Columbia
4176 Terrebonne Canada Quebec
4177 The Beaches Canada Ontario
4178 Thorold Canada Ontario
4179 Thunder Bay Canada Ontario
4180 Tillsonburg Canada Ontario
4181 Timmins Canada Ontario
4182 Toronto Canada Ontario
4183 Trois-Rivières Canada Quebec
4184 Tsawwassen Canada British Columbia
4228 Thetford-Mines Canada Quebec
4243 Trinity-Bellwoods Canada Ontario
4274 Taylor-Massey Canada Ontario
4288 Thorncliffe Park Canada Ontario
4333 Townline Canada British Columbia
To sort the DataFrame based on one or more columns:
name country
22330 's-Gravenzande Netherlands
22329 's-Hertogenbosch Netherlands
25815 'Ārdamatā Sudan
9286 6th of October City Egypt
9917 A Coruña Spain
name country
112 Andkhōy Afghanistan
111 Asadābād Afghanistan
72 Aībak Afghanistan
108 Baghlān Afghanistan
107 Balkh Afghanistan
We can create new columns based on existing data:
name country \
23661 Karachi University Employees Co-operative Hous... Pakistan
3811 Setor Complementar de Indústria e Abastecimento Brazil
31219 Diamond Head / Kapahulu / Saint Louis Heights United States
8342 Universitäts- und Hansestadt Greifswald Germany
31371 Aliamanu / Salt Lakes / Foster Village United States
name_length
23661 57
3811 47
31219 45
8342 39
31371 38
When adding a Series as a new column, pandas aligns data by index:
# This works - Series index matches DataFrame index
sample_series = pd.Series([1, 2, 3], index=[0, 1, 2])
small_df = pd.DataFrame({'A': ['x', 'y', 'z']})
small_df['B'] = sample_series
print("Aligned correctly:")
print(small_df)
# This creates unexpected results - misaligned indices
misaligned_series = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
small_df['C'] = misaligned_series # Results in NaN values!
print("\nMisaligned indices:")
print(small_df)
# Solution: Use .reindex() or ensure matching indices
small_df['D'] = misaligned_series.reindex(small_df.index, fill_value=0)
print("\nCorrected alignment:")
print(small_df)
Aligned correctly:
A B
0 x 1
1 y 2
2 z 3
Misaligned indices:
A B C
0 x 1 NaN
1 y 2 NaN
2 z 3 NaN
Corrected alignment:
A B C D
0 x 1 NaN 0
1 y 2 NaN 0
2 z 3 NaN 0
Grouping allows us to perform operations on subsets of the data:
# Number of cities by country
cities_per_country = cities_df.groupby('country')['name'].count().sort_values(ascending=False)
print(cities_per_country.head())
# Number of subcountries (e.g., states, provinces) by country
subcountries_per_country = cities_df.groupby('country')['subcountry'].nunique().sort_values(ascending=False)
print(subcountries_per_country.head())
country
India 3767
United States 3368
Brazil 2269
China 2012
Japan 1293
Name: name, dtype: int64
country
Russian Federation 83
Türkiye 81
Thailand 75
Algeria 53
United States 51
Name: subcountry, dtype: int64
In this session, we’ve covered the basics of working with pandas DataFrames using a world cities dataset, including:
These skills form the foundation of data analysis with pandas and will be essential for upcoming exercises and projects. Remember, pandas has many more functions and methods that we haven’t covered here. Don’t hesitate to explore the pandas documentation for more advanced features!