While filtering and selection are related concepts in data manipulation, they have distinct differences:
Selection:
Definition: Selection refers to choosing specific columns or rows from a DataFrame based on their labels or positions.
Purpose: Itβs used to extract a subset of data youβre interested in, without necessarily applying any conditions.
Methods: In pandas, selection is typically done using methods like .loc[], .iloc[], or square brackets df[] for column selection.
Example: Selecting specific columns like df[['name', 'age']] or rows df.loc[0:5].
Filtering:
Definition: Filtering involves choosing rows that meet specific conditions based on the values in one or more columns.
Purpose: Itβs used to extract data that satisfies certain criteria or conditions.
Methods: In pandas, filtering is often done using boolean indexing or the .query() method.
Example: Filtering rows where age is greater than 30: df[df['age'] > 30].
Key differences:
Scope:
Selection typically deals with choosing columns or rows based on their labels or positions.
Filtering typically deals with choosing rows based on conditions applied to the data values.
Condition-based:
Selection doesnβt necessarily involve conditions (though it can with .loc)
Filtering always involves a condition or criteria.
Output:
Selection can result in both a subset of columns and/or rows.
Filtering typically results in a subset of rows (though the number of columns can be affected if combined with selection).
Use cases:
Selection is often used when you know exactly which columns or rows you want.
Filtering is used when you want to find data that meets certain criteria.
Itβs worth noting that in practice, these operations are often combined. For example:
# This combines filtering (age > 30) and selection (only 'name' and 'profession' columns)result = df.loc[df['age'] >30, ['name', 'profession']]
Understanding the distinction between filtering and selection helps in choosing the right methods for data manipulation tasks and in communicating clearly about data operations.
Setup
First, letβs import pandas and load our dataset.
Code
import pandas as pd# Load the datasetdf = pd.read_csv('https://bit.ly/eds217-studentdata')# Display the first few rowsprint(df.head())
# Using square bracketsages = df['age']# Using dot notation (only works for valid Python identifiers)ages = df.age
Select Multiple Columns
Code
# Select age and gpa columnsage_gpa = df[['age', 'gpa']]
Select Rows by Index
Code
# Select first 5 rowsfirst_five = df.iloc[0:5]# Select specific rows by indexspecific_rows = df.iloc[[0, 2, 4]]
Select Rows and Columns
Code
# Select first 3 rows and 'age', 'gpa' columnssubset = df.loc[0:2, ['age', 'gpa']]
Filtering
Filter by a Single Condition
Code
# Students with age greater than 21older_students = df[df['age'] >21]
Filter by Multiple Conditions
Code
# Students with age > 21 and gpa > 3.5high_performing_older = df[(df['age'] >21) & (df['gpa'] >3.5)]
Filter Using .isin()
Code
# Students majoring in Computer Science or Biologycs_bio_students = df[df['major'].isin(['Computer Science', 'Biology'])]
Filter Using String Methods
Code
# Majors starting with 'E'e_majors = df[df['major'].str.startswith('E')]
Combining Selection and Filtering
Code
# Select 'age' and 'gpa' for students with gpa > 3.5high_gpa_age = df.loc[df['gpa'] >3.5, ['age', 'gpa']]
Useful Methods
.loc[] vs .iloc[]
Use .loc[] for label-based indexing
Use .iloc[] for integer-based indexing
.query() Method
Code
# Filter using query methodcs_students = df.query("major == 'Computer Science'")
.where() Method
Code
# Replace values not meeting the condition with NaNhigh_gpa = df.where(df['gpa'] >3.5)
Tips and Tricks
Chain methods for complex operations:
result = df[df['age'] >21].groupby('major')['gpa'].mean()
Use & for AND, | for OR in multiple conditions:
df[(df['age'] >21) & (df['gpa'] >3.5) | (df['major'] =='Computer Science')]# Because the & operator takes precendence over the | operator, # the above expression evaluates the same as:df[((df['age'] >21) & (df['gpa'] >3.5)) | (df['major'] =='Computer Science')]# To group the second two conditions, use parentheses:df[(df['age'] >21) & ((df['gpa'] >3.5) | (df['major'] =='Computer Science'))]
Remember: Always chain indexers [] or use .loc[]/.iloc[] to avoid the SettingWithCopyWarning when modifying DataFrames. Alternatively, you can assign the output of a filtering or selection to the original dataframe if you want to alter the dataframe itself (and not make a copy or view).