Wednesday, July 1, 2026
HomeData ScienceFeature Engineering for Machine Learning: Complete Python Guide (2026)

Feature Engineering for Machine Learning: Complete Python Guide (2026)

Table of Content

Feature engineering — transforming raw data into useful inputs for ML models — often has more impact on accuracy than algorithm choice. A well-engineered feature can boost performance by 10-30%. This guide covers every major technique.

1. Encoding Categorical Variables

import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({'city': ['Mumbai','Delhi','Mumbai','Bangalore'],
                   'size': ['Small','Large','Medium','Large'],
                   'target': [1,0,1,1]})

# One-Hot (nominal, low cardinality)
df_ohe = pd.get_dummies(df, columns=['city'], drop_first=True)

# Ordinal (when order matters)
enc = OrdinalEncoder(categories=[['Small','Medium','Large']])
df['size_enc'] = enc.fit_transform(df[['size']])

# Target encoding (high cardinality)
target_mean = df.groupby('city')['target'].mean()
df['city_target'] = df['city'].map(target_mean)

2. Feature Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

X = df[['sales','age','income']]

X_standard = StandardScaler().fit_transform(X)   # mean=0, std=1
X_minmax   = MinMaxScaler().fit_transform(X)      # range [0,1]
X_robust   = RobustScaler().fit_transform(X)      # median/IQR — good for outliers

3. DateTime Features

df['date'] = pd.to_datetime(df['date'])
df['year']         = df['date'].dt.year
df['month']        = df['date'].dt.month
df['day_of_week']  = df['date'].dt.dayofweek
df['is_weekend']   = df['date'].dt.dayofweek >= 5
df['quarter']      = df['date'].dt.quarter
df['days_since']   = (df['date'] - pd.Timestamp('2020-01-01')).dt.days

4. Binning

df['age_group']  = pd.cut(df['age'], bins=[0,18,35,50,65,100],
                          labels=['Teen','Young','Adult','Mid','Senior'])
df['inc_quartile'] = pd.qcut(df['income'], q=4, labels=['Q1','Q2','Q3','Q4'])

5. Interaction Features

df['price_per_sqft']  = df['price'] / df['sqft']
df['income_age_ratio'] = df['income'] / (df['age'] + 1)

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(df[['age','income']])
print(f'2 features -> {X_poly.shape[1]} polynomial features')

6. Feature Selection

from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier

selector = SelectKBest(f_classif, k=10)
X_best = selector.fit_transform(X, y)
print('Selected:', X.columns[selector.get_support()].tolist())

rf = RandomForestClassifier(n_estimators=100, random_state=42).fit(X, y)
importance = pd.Series(rf.feature_importances_, index=X.columns).nlargest(10)
print(importance)

FAQ

Does feature engineering matter for tree models?

Less for scaling and encoding (trees are invariant), but domain-driven features like ratios, date components, and interaction terms still significantly improve accuracy.

How many features should I create?

Create 10-30 candidates, then use feature selection to keep the best. More features are not always better — irrelevant features add noise and slow training.

Leave feedback about this

  • Rating

Latest Posts

List of Categories