Saturday, June 27, 2026
HomeUncategorizedRandom Forest Algorithm: Complete Guide with Python Examples (2026)

Random Forest Algorithm: Complete Guide with Python Examples (2026)

Table of Content

Random Forest is one of the most powerful and widely used machine learning algorithms. It combines hundreds of decision trees to make highly accurate predictions — and it works out-of-the-box with very little tuning. This guide covers everything you need to know.

What is Random Forest?

Random Forest is an ensemble learning method that builds multiple decision trees during training and outputs the mode (classification) or mean (regression) of their individual predictions. The ‘random’ in the name comes from two sources of randomness: random subsets of training data (bootstrap sampling) and random subsets of features at each split.

How Random Forest Works

  1. Bootstrap sampling: Create N random subsets of training data (with replacement)
  2. Build N decision trees: Each tree trained on one bootstrap sample
  3. Random feature selection: At each split, consider only sqrt(total_features) features
  4. Aggregate predictions: Vote (classification) or average (regression) across all trees

Random Forest in Python (sklearn)

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd
import matplotlib.pyplot as plt

# Load data
X, y = load_breast_cancer(return_X_y=True)
feature_names = load_breast_cancer().feature_names
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest
rf = RandomForestClassifier(
    n_estimators=100,      # Number of trees
    max_depth=None,        # Trees grow until pure
    min_samples_split=2,   # Min samples to split a node
    min_samples_leaf=1,    # Min samples at leaf
    max_features='sqrt',   # Features per split
    random_state=42
)
rf.fit(X_train, y_train)

# Evaluate
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))

Feature Importance

importances = pd.Series(rf.feature_importances_, index=feature_names)
importances.nlargest(10).sort_values().plot(kind='barh')
plt.title('Top 10 Most Important Features')
plt.tight_layout()
plt.show()

Key Hyperparameters to Tune

ParameterDefaultEffectTune range
n_estimators100More trees = better (diminishing returns)100-500
max_depthNoneLimits tree depth, reduces overfitting5-30
min_samples_leaf1Smooths predictions1-10
max_featuressqrtDiversity between treessqrt, log2, 0.5
class_weightNoneHandle class imbalancebalanced

Random Forest vs Decision Tree vs XGBoost

ModelSpeedAccuracyOverfittingInterpretability
Decision TreeFastLowHighHigh
Random ForestMediumHighLowMedium
XGBoostSlowerHigherMediumLow

When to Use Random Forest

  • ✅ You need a strong baseline quickly
  • ✅ Dataset has mixed feature types
  • ✅ You need feature importance scores
  • ✅ Class imbalance is moderate
  • ✅ Interpretability is somewhat important

FAQ

Does Random Forest need feature scaling?

No. Random Forest is tree-based and not affected by feature magnitude. You don’t need to standardize or normalize features.

How many trees should I use?

Start with 100. Accuracy improves with more trees but plateaus around 200-500. Beyond that, you’re just adding computation time.

How do I handle overfitting in Random Forest?

Set max_depth (try 10-20), increase min_samples_leaf (try 5-10), or reduce max_features. Also ensure you have enough training data.

Leave feedback about this

  • Rating

Latest Posts

List of Categories