PythonPandas Pandas,Python Pandas – How to Select Columns

Pandas – How to Select Columns



 

Selecting Columns in Pandas: A Complete Guide

When working with data in Pandas, selecting columns is one of the most common and essential operations. Whether you’re extracting a single column or multiple columns, Pandas provides flexible and efficient methods to perform this task. In this guide, we’ll cover all the ways to select columns from a Pandas DataFrame, including common use cases, advanced techniques, and practical examples.

Sample DataFrame

To demonstrate the various methods, we will use the following sample dataset:

import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['John', 'Alice', 'Bob', 'Eve', 'Charlie'],
    'Age': [25, 30, 22, 35, 28],
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'Salary': [50000, 55000, 40000, 70000, 48000],
    'Department': ['HR', 'IT', 'Finance', 'Marketing', 'Operations']
}

df = pd.DataFrame(data)
print(df)
Name      Age   Gender   Salary   Department
0  John     25   Male     50000    HR
1  Alice    30   Female   55000    IT
2  Bob      22   Male     40000    Finance
3  Eve      35   Female   70000    Marketing
4  Charlie  28   Male     48000    Operations

1. Selecting a Single Column

A) Using Bracket Notation

The most common way to select a single column is using bracket notation. This returns the column as a Pandas Series.

# Select the 'Age' column
age_column = df['Age']
print(age_column)
0    25
1    30
2    22
3    35
4    28
Name: Age, dtype: int64

B) Using Dot Notation

You can also use dot notation, but it has some limitations. It only works when the column name is a valid Python identifier (no spaces or special characters).

# Select the 'Salary' column
salary_column = df.Salary
print(salary_column)
0    50000
1    55000
2    40000
3    70000
4    48000
Name: Salary, dtype: int64

2. Selecting Multiple Columns

To select multiple columns, pass a list of column names to the bracket notation. This returns a new DataFrame.

# Select 'Name', 'Age', and 'Salary' columns
selected_columns = df[['Name', 'Age', 'Salary']]
print(selected_columns)
      Name  Age  Salary
0    John   25   50000
1   Alice   30   55000
2     Bob   22   40000
3     Eve   35   70000
4  Charlie   28   48000

3. Selecting Columns with loc

The loc[] method allows you to select rows and columns by their labels. To select specific columns, use : for all rows and pass the column names.

# Select 'Gender' and 'Department' columns
selected_columns = df.loc[:, ['Gender', 'Department']]
print(selected_columns)
   Gender   Department
0    Male          HR
1  Female          IT
2    Male     Finance
3  Female   Marketing
4    Male   Operations

4. Selecting Columns with iloc

The iloc[] method selects columns by their index positions. This is useful when you know the position of the columns but not their names.

# Select the first two columns (index positions 0 and 1)
first_two_columns = df.iloc[:, [0, 1]]
print(first_two_columns)

# Select a range of columns (from index 1 to 3)
range_columns = df.iloc[:, 1:4]
print(range_columns)
# Output of first_two_columns:
      Name  Age
0    John   25
1   Alice   30
2     Bob   22
3     Eve   35
4  Charlie   28

# Output of range_columns:
   Age  Gender  Salary
0   25    Male   50000
1   30  Female   55000
2   22    Male   40000
3   35  Female   70000
4   28    Male   48000

5. Selecting Columns Using filter

A) Select Columns by Name Containing a Substring

# Select columns that contain 'Age'
filtered_columns = df.filter(like='Age')
print(filtered_columns)
   Age
0   25
1   30
2   22
3   35
4   28

B) Select Columns by Regex Pattern

# Select columns that start with 'D'
filtered_columns = df.filter(regex='^D')
print(filtered_columns)
    Department
0           HR
1           IT
2      Finance
3    Marketing
4    Operations

6. Summary

Selecting columns is a fundamental operation when working with Pandas DataFrames.
Here’s a quick recap of the methods covered:

  • Bracket Notation: Simple and versatile for single or multiple columns.
  • Dot Notation: Concise but limited.
  • loc[] and iloc[]: Powerful methods for label-based and position-based selection.
  • filter(): Ideal for pattern-based selection.
  • Advanced Techniques: Combine methods for complex selection tasks.

By mastering these techniques, you can efficiently manipulate your DataFrame to suit your data analysis needs.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Post