Selecting Columns in Pandas: A Complete Guide
When working with data in Pandas, selecting columns is one of the most common and essential operations. Whether you’re extracting a single column or multiple columns, Pandas provides flexible and efficient methods to perform this task. In this guide, we’ll cover all the ways to select columns from a Pandas DataFrame, including common use cases, advanced techniques, and practical examples.
Sample DataFrame
To demonstrate the various methods, we will use the following sample dataset:
import pandas as pd
# Create a sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Eve', 'Charlie'],
'Age': [25, 30, 22, 35, 28],
'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
'Salary': [50000, 55000, 40000, 70000, 48000],
'Department': ['HR', 'IT', 'Finance', 'Marketing', 'Operations']
}
df = pd.DataFrame(data)
print(df)
Name Age Gender Salary Department 0 John 25 Male 50000 HR 1 Alice 30 Female 55000 IT 2 Bob 22 Male 40000 Finance 3 Eve 35 Female 70000 Marketing 4 Charlie 28 Male 48000 Operations
1. Selecting a Single Column
A) Using Bracket Notation
The most common way to select a single column is using bracket notation. This returns the column as a Pandas Series.
# Select the 'Age' column
age_column = df['Age']
print(age_column)
0 25 1 30 2 22 3 35 4 28 Name: Age, dtype: int64
B) Using Dot Notation
You can also use dot notation, but it has some limitations. It only works when the column name is a valid Python identifier (no spaces or special characters).
# Select the 'Salary' column
salary_column = df.Salary
print(salary_column)
0 50000 1 55000 2 40000 3 70000 4 48000 Name: Salary, dtype: int64
2. Selecting Multiple Columns
To select multiple columns, pass a list of column names to the bracket notation. This returns a new DataFrame.
# Select 'Name', 'Age', and 'Salary' columns
selected_columns = df[['Name', 'Age', 'Salary']]
print(selected_columns)
Name Age Salary 0 John 25 50000 1 Alice 30 55000 2 Bob 22 40000 3 Eve 35 70000 4 Charlie 28 48000
3. Selecting Columns with loc
The loc[]
method allows you to select rows and columns by their labels. To select specific columns, use :
for all rows and pass the column names.
# Select 'Gender' and 'Department' columns
selected_columns = df.loc[:, ['Gender', 'Department']]
print(selected_columns)
Gender Department 0 Male HR 1 Female IT 2 Male Finance 3 Female Marketing 4 Male Operations
4. Selecting Columns with iloc
The iloc[]
method selects columns by their index positions. This is useful when you know the position of the columns but not their names.
# Select the first two columns (index positions 0 and 1)
first_two_columns = df.iloc[:, [0, 1]]
print(first_two_columns)
# Select a range of columns (from index 1 to 3)
range_columns = df.iloc[:, 1:4]
print(range_columns)
# Output of first_two_columns: Name Age 0 John 25 1 Alice 30 2 Bob 22 3 Eve 35 4 Charlie 28 # Output of range_columns: Age Gender Salary 0 25 Male 50000 1 30 Female 55000 2 22 Male 40000 3 35 Female 70000 4 28 Male 48000
5. Selecting Columns Using filter
A) Select Columns by Name Containing a Substring
# Select columns that contain 'Age'
filtered_columns = df.filter(like='Age')
print(filtered_columns)
Age 0 25 1 30 2 22 3 35 4 28
B) Select Columns by Regex Pattern
# Select columns that start with 'D'
filtered_columns = df.filter(regex='^D')
print(filtered_columns)
Department 0 HR 1 IT 2 Finance 3 Marketing 4 Operations
6. Summary
Selecting columns is a fundamental operation when working with Pandas DataFrames.
Here’s a quick recap of the methods covered:
- Bracket Notation: Simple and versatile for single or multiple columns.
- Dot Notation: Concise but limited.
- loc[] and iloc[]: Powerful methods for label-based and position-based selection.
- filter(): Ideal for pattern-based selection.
- Advanced Techniques: Combine methods for complex selection tasks.
By mastering these techniques, you can efficiently manipulate your DataFrame to suit your data analysis needs.