Data scientists spend 60-80% of their time cleaning data. Messy, incomplete, inconsistent data is the norm in the real world. This guide covers every common scenario using Pandas — the most essential data cleaning tool in Python.
The Data Cleaning Checklist
- Inspect shape, dtypes, and missing values
- Handle missing values
- Remove duplicates
- Fix data types
- Handle outliers
- Standardise text and categories
- Validate business rules
Step 1: Initial Inspection
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
print(df.shape, df.dtypes)
print(df.info())
print(df.describe())
missing = pd.DataFrame({
'count': df.isnull().sum(),
'pct': (df.isnull().mean() * 100).round(2)
}).sort_values('pct', ascending=False)
print(missing[missing['count'] > 0])Step 2: Handle Missing Values
df = df.loc[:, df.isnull().mean() < 0.5] # drop cols > 50% missing
for col in df.select_dtypes(include=np.number).columns:
df[col] = df[col].fillna(df[col].median())
for col in df.select_dtypes(include='object').columns:
df[col] = df[col].fillna(df[col].mode()[0] if not df[col].mode().empty else 'Unknown')
print('Nulls remaining:', df.isnull().sum().sum())Step 3: Duplicates
print('Duplicates:', df.duplicated().sum())
df = df.drop_duplicates()
df = df.drop_duplicates(subset=['customer_id', 'order_date'], keep='last')Step 4: Fix Data Types
df['revenue'] = pd.to_numeric(df['revenue'].str.replace('[$,]', '', regex=True), errors='coerce')
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['is_active'] = df['is_active'].astype(bool)
df['category'] = df['category'].astype('category')Step 5: Outlier Removal
def remove_iqr_outliers(df, col):
Q1, Q3 = df[col].quantile([0.25, 0.75])
IQR = Q3 - Q1
mask = df[col].between(Q1 - 1.5*IQR, Q3 + 1.5*IQR)
print(f'{col}: removed {(~mask).sum()} outliers')
return df[mask]
df = remove_iqr_outliers(df, 'salary')Step 6: Standardise Text
df['city'] = df['city'].str.strip().str.title()
city_aliases = {'Bombay': 'Mumbai', 'Bangalore': 'Bengaluru', 'BLR': 'Bengaluru'}
df['city'] = df['city'].replace(city_aliases)
df['phone'] = df['phone'].str.replace(r'[^\d]', '', regex=True).str[-10:]FAQ
Drop rows or impute missing values?
Drop when under 5% missing and rows are independent. Impute with median or mode for moderate missingness. Use KNN imputation for important features with high missingness.
How to handle outliers that are real data?
Cap them (winsorising) instead of removing — set values above the 99th percentile to the 99th percentile. This preserves the data point while reducing its disproportionate influence on models.



