EDS 217 Cheatsheet

Selection vs. Filtering:

While filtering and selection are related concepts in data manipulation, they have distinct differences:

Selection:

Definition: Selection refers to choosing specific columns or rows from a DataFrame based on their labels or positions.
Purpose: It’s used to extract a subset of data you’re interested in, without necessarily applying any conditions.
Methods: In pandas, selection is typically done using methods like .loc[], .iloc[], or square brackets df[] for column selection.
Example: Selecting specific columns like df[['name', 'age']] or rows df.loc[0:5].

Filtering:

Definition: Filtering involves choosing rows that meet specific conditions based on the values in one or more columns.
Purpose: It’s used to extract data that satisfies certain criteria or conditions.
Methods: In pandas, filtering is often done using boolean indexing or the .query() method.
Example: Filtering rows where age is greater than 30: df[df['age'] > 30].

Key differences:

Scope:
- Selection typically deals with choosing columns or rows based on their labels or positions.
- Filtering typically deals with choosing rows based on conditions applied to the data values.
Condition-based:
- Selection doesn’t necessarily involve conditions (though it can with .loc)
- Filtering always involves a condition or criteria.
Output:
- Selection can result in both a subset of columns and/or rows.
- Filtering typically results in a subset of rows (though the number of columns can be affected if combined with selection).
Use cases:
- Selection is often used when you know exactly which columns or rows you want.
- Filtering is used when you want to find data that meets certain criteria.

It’s worth noting that in practice, these operations are often combined. For example:

# This combines filtering (age > 30) and selection (only 'name' and 'profession' columns)
result = df.loc[df['age'] > 30, ['name', 'profession']]

Understanding the distinction between filtering and selection helps in choosing the right methods for data manipulation tasks and in communicating clearly about data operations.

Setup

First, let’s import pandas and load our dataset.

Code

import pandas as pd

# Load the dataset
df = pd.read_csv('https://bit.ly/eds217-studentdata')

# Display the first few rows
print(df.head())

   student_id  age   gpa             major
0        1000   24  2.18       Mathematics
1        1001   21  2.39           Physics
2        1002   22  2.09           Physics
3        1003   24  2.65  Computer Science
4        1004   20  2.78         Chemistry

Basic Selection

Select a Single Column

Code

# Using square brackets
ages = df['age']

# Using dot notation (only works for valid Python identifiers)
ages = df.age

Select Multiple Columns

Code

# Select age and gpa columns
age_gpa = df[['age', 'gpa']]

Select Rows by Index

Code

# Select first 5 rows
first_five = df.iloc[0:5]

# Select specific rows by index
specific_rows = df.iloc[[0, 2, 4]]

Select Rows and Columns

Code

# Select first 3 rows and 'age', 'gpa' columns
subset = df.loc[0:2, ['age', 'gpa']]

Filtering

Filter by a Single Condition

Code

# Students with age greater than 21
older_students = df[df['age'] > 21]

Filter by Multiple Conditions

Code

# Students with age > 21 and gpa > 3.5
high_performing_older = df[(df['age'] > 21) & (df['gpa'] > 3.5)]

Filter Using .isin()

Code

# Students majoring in Computer Science or Biology
cs_bio_students = df[df['major'].isin(['Computer Science', 'Biology'])]

Filter Using String Methods

Code

# Majors starting with 'E'
e_majors = df[df['major'].str.startswith('E')]

Combining Selection and Filtering

Code

# Select 'age' and 'gpa' for students with gpa > 3.5
high_gpa_age = df.loc[df['gpa'] > 3.5, ['age', 'gpa']]

Useful Methods

.loc[] vs .iloc[]

Use .loc[] for label-based indexing
Use .iloc[] for integer-based indexing

.query() Method

Code

# Filter using query method
cs_students = df.query("major == 'Computer Science'")

.where() Method

Code

# Replace values not meeting the condition with NaN
high_gpa = df.where(df['gpa'] > 3.5)

Tips and Tricks

Chain methods for complex operations:

result = df[df['age'] > 21].groupby('major')['gpa'].mean()

Use & for AND, | for OR in multiple conditions:

df[(df['age'] > 21) & (df['gpa'] > 3.5) | (df['major'] == 'Computer Science')]

# Because the & operator takes precendence over the | operator, 
# the above expression evaluates the same as:
df[((df['age'] > 21) & (df['gpa'] > 3.5)) | 
   (df['major'] == 'Computer Science')]

# To group the second two conditions, use parentheses:
df[(df['age'] > 21) & 
   ((df['gpa'] > 3.5) | (df['major'] == 'Computer Science'))]

Reset index after filtering if needed:

filtered_df = df[df['age'] > 21].reset_index(drop=True)

Use ~ for negation:

not_cs = df[~(df['major'] == 'Computer Science')]

Remember: Always chain indexers [] or use .loc[]/.iloc[] to avoid the SettingWithCopyWarning when modifying DataFrames. Alternatively, you can assign the output of a filtering or selection to the original dataframe if you want to alter the dataframe itself (and not make a copy or view).