Sunday, June 28, 2026
HomeData AnalyticsData Cleaning in Python: Complete Pandas Guide (2026)

Data Cleaning in Python: Complete Pandas Guide (2026)

Table of Content

Data scientists spend 60-80% of their time cleaning data. Messy, incomplete, inconsistent data is the norm in the real world. This guide covers every common scenario using Pandas — the most essential data cleaning tool in Python.

The Data Cleaning Checklist

  1. Inspect shape, dtypes, and missing values
  2. Handle missing values
  3. Remove duplicates
  4. Fix data types
  5. Handle outliers
  6. Standardise text and categories
  7. Validate business rules

Step 1: Initial Inspection

import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')
print(df.shape, df.dtypes)
print(df.info())
print(df.describe())

missing = pd.DataFrame({
    'count': df.isnull().sum(),
    'pct': (df.isnull().mean() * 100).round(2)
}).sort_values('pct', ascending=False)
print(missing[missing['count'] > 0])

Step 2: Handle Missing Values

df = df.loc[:, df.isnull().mean() < 0.5]  # drop cols > 50% missing

for col in df.select_dtypes(include=np.number).columns:
    df[col] = df[col].fillna(df[col].median())

for col in df.select_dtypes(include='object').columns:
    df[col] = df[col].fillna(df[col].mode()[0] if not df[col].mode().empty else 'Unknown')

print('Nulls remaining:', df.isnull().sum().sum())

Step 3: Duplicates

print('Duplicates:', df.duplicated().sum())
df = df.drop_duplicates()
df = df.drop_duplicates(subset=['customer_id', 'order_date'], keep='last')

Step 4: Fix Data Types

df['revenue'] = pd.to_numeric(df['revenue'].str.replace('[$,]', '', regex=True), errors='coerce')
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['is_active'] = df['is_active'].astype(bool)
df['category'] = df['category'].astype('category')

Step 5: Outlier Removal

def remove_iqr_outliers(df, col):
    Q1, Q3 = df[col].quantile([0.25, 0.75])
    IQR = Q3 - Q1
    mask = df[col].between(Q1 - 1.5*IQR, Q3 + 1.5*IQR)
    print(f'{col}: removed {(~mask).sum()} outliers')
    return df[mask]

df = remove_iqr_outliers(df, 'salary')

Step 6: Standardise Text

df['city'] = df['city'].str.strip().str.title()
city_aliases = {'Bombay': 'Mumbai', 'Bangalore': 'Bengaluru', 'BLR': 'Bengaluru'}
df['city'] = df['city'].replace(city_aliases)
df['phone'] = df['phone'].str.replace(r'[^\d]', '', regex=True).str[-10:]

FAQ

Drop rows or impute missing values?

Drop when under 5% missing and rows are independent. Impute with median or mode for moderate missingness. Use KNN imputation for important features with high missingness.

How to handle outliers that are real data?

Cap them (winsorising) instead of removing — set values above the 99th percentile to the 99th percentile. This preserves the data point while reducing its disproportionate influence on models.

Leave feedback about this

  • Rating

Latest Posts

List of Categories