Feature engineering is the process of transforming raw data into meaningful features that better represent the underlying problem to predictive models, improving their performance. With Pandas, a powerful Python data manipulation library, you can efficiently perform various feature engineering tasks such as creating new features, encoding categorical variables, handling missing values, and more.
Here’s a breakdown of common feature engineering techniques using Pandas, along with examples:
You can derive new columns from existing ones to capture more information.
Example:
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
'date': ['2023-01-01', '2023-01-02', '2023-01-03'],
'price': [100, 150, 200],
'quantity': [2, 3, 1]
})
# Convert 'date' to datetime and extract features
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek # Monday=0, Sunday=6
# Create a new feature 'total_sales'
df['total_sales'] = df['price'] * df['quantity']
print(df)
Missing data can be filled or dropped based on the use case.
Example:
# Introduce some NaN values
df.loc[1, 'price'] = None
# Fill missing 'price' with the mean
df['price'] = df['price'].fillna(df['price'].mean())
# Alternatively, drop rows with any missing values
# df = df.dropna()
Categorical data needs to be converted into numerical format for most machine learning models.
Label Encoding (for ordinal categories):
df['category'] = ['A', 'B', 'A']
df['category_label'] = df['category'].astype('category').cat.codes
One-Hot Encoding (for nominal categories):
df_onehot = pd.get_dummies(df, columns=['category'])
Convert continuous variables into discrete bins.
Example:
df['price_bin'] = pd.cut(df['price'], bins=[0, 120, 180, float('inf')], labels=['low', 'medium', 'high'])
Use groupby to create aggregate features.
Example:
# Assume we have a 'user_id' column and want to add average price per user
df['avg_price_per_user'] = df.groupby('user_id')['price'].transform('mean')
Sometimes applying log or normalization helps in stabilizing variance.
Example (Log Transform):
import numpy as np
df['log_price'] = np.log1p(df['price']) # log(1 + x) to handle zeros
Extracting temporal patterns is key in time series or event-based data.
Example:
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
When deploying your data pipeline or training machine learning models at scale, Tencent Cloud TI-Platform (Tencent Intelligent Platform) provides managed services for data processing, feature store, model training, and deployment. It integrates well with big data tools and supports scalable feature engineering workflows. For storing large datasets, Tencent Cloud COS (Cloud Object Storage) is a reliable choice, and Tencent Cloud TDSQL or TencentDB can manage structured data efficiently.
Using Pandas is ideal for prototyping and small-to-medium datasets, but for larger-scale production environments, consider integrating with distributed computing frameworks like Spark or Tencent Cloud’s big data solutions.