Feature engineering — transforming raw data into useful inputs for ML models — often has more impact on accuracy than algorithm choice. A well-engineered feature can boost performance by 10-30%. This guide covers every major technique.
1. Encoding Categorical Variables
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame({'city': ['Mumbai','Delhi','Mumbai','Bangalore'],
'size': ['Small','Large','Medium','Large'],
'target': [1,0,1,1]})
# One-Hot (nominal, low cardinality)
df_ohe = pd.get_dummies(df, columns=['city'], drop_first=True)
# Ordinal (when order matters)
enc = OrdinalEncoder(categories=[['Small','Medium','Large']])
df['size_enc'] = enc.fit_transform(df[['size']])
# Target encoding (high cardinality)
target_mean = df.groupby('city')['target'].mean()
df['city_target'] = df['city'].map(target_mean)2. Feature Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
X = df[['sales','age','income']]
X_standard = StandardScaler().fit_transform(X) # mean=0, std=1
X_minmax = MinMaxScaler().fit_transform(X) # range [0,1]
X_robust = RobustScaler().fit_transform(X) # median/IQR — good for outliers3. DateTime Features
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['date'].dt.dayofweek >= 5
df['quarter'] = df['date'].dt.quarter
df['days_since'] = (df['date'] - pd.Timestamp('2020-01-01')).dt.days4. Binning
df['age_group'] = pd.cut(df['age'], bins=[0,18,35,50,65,100],
labels=['Teen','Young','Adult','Mid','Senior'])
df['inc_quartile'] = pd.qcut(df['income'], q=4, labels=['Q1','Q2','Q3','Q4'])5. Interaction Features
df['price_per_sqft'] = df['price'] / df['sqft']
df['income_age_ratio'] = df['income'] / (df['age'] + 1)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(df[['age','income']])
print(f'2 features -> {X_poly.shape[1]} polynomial features')6. Feature Selection
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
selector = SelectKBest(f_classif, k=10)
X_best = selector.fit_transform(X, y)
print('Selected:', X.columns[selector.get_support()].tolist())
rf = RandomForestClassifier(n_estimators=100, random_state=42).fit(X, y)
importance = pd.Series(rf.feature_importances_, index=X.columns).nlargest(10)
print(importance)FAQ
Does feature engineering matter for tree models?
Less for scaling and encoding (trees are invariant), but domain-driven features like ratios, date components, and interaction terms still significantly improve accuracy.
How many features should I create?
Create 10-30 candidates, then use feature selection to keep the best. More features are not always better — irrelevant features add noise and slow training.



