How to perform feature engineering with Pandas?

Feature engineering is the process of transforming raw data into meaningful features that better represent the underlying problem to predictive models, improving their performance. With Pandas, a powerful Python data manipulation library, you can efficiently perform various feature engineering tasks such as creating new features, encoding categorical variables, handling missing values, and more.

Here’s a breakdown of common feature engineering techniques using Pandas, along with examples:

1. Creating New Features (Feature Creation)

You can derive new columns from existing ones to capture more information.

Example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'date': ['2023-01-01', '2023-01-02', '2023-01-03'],
    'price': [100, 150, 200],
    'quantity': [2, 3, 1]
})

# Convert 'date' to datetime and extract features
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek  # Monday=0, Sunday=6

# Create a new feature 'total_sales'
df['total_sales'] = df['price'] * df['quantity']

print(df)

2. Handling Missing Values

Missing data can be filled or dropped based on the use case.

Example:

# Introduce some NaN values
df.loc[1, 'price'] = None

# Fill missing 'price' with the mean
df['price'] = df['price'].fillna(df['price'].mean())

# Alternatively, drop rows with any missing values
# df = df.dropna()

3. Encoding Categorical Variables

Categorical data needs to be converted into numerical format for most machine learning models.

Label Encoding (for ordinal categories):

df['category'] = ['A', 'B', 'A']
df['category_label'] = df['category'].astype('category').cat.codes

One-Hot Encoding (for nominal categories):

df_onehot = pd.get_dummies(df, columns=['category'])

4. Binning / Discretization

Convert continuous variables into discrete bins.

Example:

df['price_bin'] = pd.cut(df['price'], bins=[0, 120, 180, float('inf')], labels=['low', 'medium', 'high'])

5. Aggregations & Grouping

Use groupby to create aggregate features.

Example:

# Assume we have a 'user_id' column and want to add average price per user
df['avg_price_per_user'] = df.groupby('user_id')['price'].transform('mean')

6. Log Transform / Scaling (Preparation for ML)

Sometimes applying log or normalization helps in stabilizing variance.

Example (Log Transform):

import numpy as np
df['log_price'] = np.log1p(df['price'])  # log(1 + x) to handle zeros

7. Time-Based Features

Extracting temporal patterns is key in time series or event-based data.

Example:

df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

Recommended Tencent Cloud Services for Feature Engineering & ML

When deploying your data pipeline or training machine learning models at scale, Tencent Cloud TI-Platform (Tencent Intelligent Platform) provides managed services for data processing, feature store, model training, and deployment. It integrates well with big data tools and supports scalable feature engineering workflows. For storing large datasets, Tencent Cloud COS (Cloud Object Storage) is a reliable choice, and Tencent Cloud TDSQL or TencentDB can manage structured data efficiently.

Using Pandas is ideal for prototyping and small-to-medium datasets, but for larger-scale production environments, consider integrating with distributed computing frameworks like Spark or Tencent Cloud’s big data solutions.