Linear regression is the foundation of machine learning. Understanding it deeply makes every other algorithm easier to learn. This guide covers simple regression, multiple regression, assumptions, and evaluation in Python.
What is Linear Regression?
Linear regression models the linear relationship between a dependent variable (what you predict) and one or more independent variables (features). It fits a line that minimises prediction error across all data points.
Simple Linear Regression in Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
np.random.seed(42)
X = np.random.uniform(500, 3000, 200).reshape(-1, 1)
y = 50 + 0.15 * X.flatten() + np.random.normal(0, 20, 200)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
print(f'Slope: {model.coef_[0]:.4f}')
print(f'Intercept: {model.intercept_:.2f}')
print(f'R-squared: {r2_score(y_test, model.predict(X_test)):.4f}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test, model.predict(X_test))):.2f}')
plt.scatter(X_test, y_test, alpha=0.5, label='Actual')
plt.plot(X_test, model.predict(X_test), color='red', label='Predicted')
plt.legend()
plt.title('Linear Regression')
plt.show()Multiple Linear Regression
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.DataFrame({
'size': [1000,1500,2000,1200,1800,900,2200,1600],
'bedrooms': [2,3,4,2,3,1,4,3],
'age_years': [10,5,2,15,8,20,1,6],
'price_lakh': [45,68,95,50,82,35,110,75]
})
X = df[['size','bedrooms','age_years']]
y = df['price_lakh']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model = LinearRegression().fit(X_scaled, y)
for feat, coef in zip(X.columns, model.coef_):
print(f'{feat:12}: {coef:+.4f}')Key Assumptions
- Linearity — X and y have a linear relationship
- Independence — observations are independent
- Homoscedasticity — residuals have constant variance
- Normality — residuals are normally distributed
- No multicollinearity — features not highly correlated
Ridge and Lasso (Regularised Regression)
from sklearn.linear_model import Ridge, Lasso
ridge = Ridge(alpha=1.0).fit(X_train, y_train)
lasso = Lasso(alpha=0.1).fit(X_train, y_train)
print('Ridge R2:', r2_score(y_test, ridge.predict(X_test)).round(4))
print('Lasso R2:', r2_score(y_test, lasso.predict(X_test)).round(4))FAQ
When should I use Ridge vs Lasso?
Ridge (L2) when all features matter but you want to reduce coefficient magnitude. Lasso (L1) when you want automatic feature selection — it drives some coefficients to exactly zero.
What is a good R-squared?
Depends on domain. Physical sciences expect R2 above 0.95. Business/social science: 0.6-0.8 is often acceptable. Always benchmark against a simple baseline model.



