Random Forest is one of the most powerful and widely used machine learning algorithms. It combines hundreds of decision trees to make highly accurate predictions — and it works out-of-the-box with very little tuning. This guide covers everything you need to know.
What is Random Forest?
Random Forest is an ensemble learning method that builds multiple decision trees during training and outputs the mode (classification) or mean (regression) of their individual predictions. The ‘random’ in the name comes from two sources of randomness: random subsets of training data (bootstrap sampling) and random subsets of features at each split.
How Random Forest Works
- Bootstrap sampling: Create N random subsets of training data (with replacement)
- Build N decision trees: Each tree trained on one bootstrap sample
- Random feature selection: At each split, consider only sqrt(total_features) features
- Aggregate predictions: Vote (classification) or average (regression) across all trees
Random Forest in Python (sklearn)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd
import matplotlib.pyplot as plt
# Load data
X, y = load_breast_cancer(return_X_y=True)
feature_names = load_breast_cancer().feature_names
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Random Forest
rf = RandomForestClassifier(
n_estimators=100, # Number of trees
max_depth=None, # Trees grow until pure
min_samples_split=2, # Min samples to split a node
min_samples_leaf=1, # Min samples at leaf
max_features='sqrt', # Features per split
random_state=42
)
rf.fit(X_train, y_train)
# Evaluate
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))Feature Importance
importances = pd.Series(rf.feature_importances_, index=feature_names)
importances.nlargest(10).sort_values().plot(kind='barh')
plt.title('Top 10 Most Important Features')
plt.tight_layout()
plt.show()Key Hyperparameters to Tune
| Parameter | Default | Effect | Tune range |
|---|---|---|---|
| n_estimators | 100 | More trees = better (diminishing returns) | 100-500 |
| max_depth | None | Limits tree depth, reduces overfitting | 5-30 |
| min_samples_leaf | 1 | Smooths predictions | 1-10 |
| max_features | sqrt | Diversity between trees | sqrt, log2, 0.5 |
| class_weight | None | Handle class imbalance | balanced |
Random Forest vs Decision Tree vs XGBoost
| Model | Speed | Accuracy | Overfitting | Interpretability |
|---|---|---|---|---|
| Decision Tree | Fast | Low | High | High |
| Random Forest | Medium | High | Low | Medium |
| XGBoost | Slower | Higher | Medium | Low |
When to Use Random Forest
- ✅ You need a strong baseline quickly
- ✅ Dataset has mixed feature types
- ✅ You need feature importance scores
- ✅ Class imbalance is moderate
- ✅ Interpretability is somewhat important
FAQ
Does Random Forest need feature scaling?
No. Random Forest is tree-based and not affected by feature magnitude. You don’t need to standardize or normalize features.
How many trees should I use?
Start with 100. Accuracy improves with more trees but plateaus around 200-500. Beyond that, you’re just adding computation time.
How do I handle overfitting in Random Forest?
Set max_depth (try 10-20), increase min_samples_leaf (try 5-10), or reduce max_features. Also ensure you have enough training data.


