Saturday, June 27, 2026
HomeUncategorizedK-Means Clustering: Step-by-Step Guide with Python Examples (2026)

K-Means Clustering: Step-by-Step Guide with Python Examples (2026)

Table of Content

K-Means clustering is the most popular unsupervised machine learning algorithm. It groups data into K clusters based on similarity — no labels required. This guide walks through the algorithm, Python implementation, and how to choose the right K.

How K-Means Works

  1. Choose K (number of clusters)
  2. Randomly initialize K centroids
  3. Assign each point to the nearest centroid
  4. Recalculate centroids as mean of assigned points
  5. Repeat steps 3-4 until centroids stop moving

K-Means in Python

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np

# Sample customer data: annual income and spending score
np.random.seed(42)
X = np.column_stack([
    np.random.normal([50, 80, 30, 60, 90], 10, (200, 5)).flatten()[:200],
    np.random.normal([40, 85, 20, 70, 80], 15, (200, 5)).flatten()[:200]
])

# Scale features (important for K-Means)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit K-Means with K=5
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)

# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='tab10', alpha=0.6)
plt.scatter(scaler.inverse_transform(kmeans.cluster_centers_)[:, 0],
            scaler.inverse_transform(kmeans.cluster_centers_)[:, 1],
            c='red', marker='x', s=200, linewidths=3, label='Centroids')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.title('Customer Segments (K-Means)')
plt.legend()
plt.show()

Choosing the Right K

Elbow Method

inertias = []
K_range = range(1, 11)
for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_scaled)
    inertias.append(km.inertia_)

plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Inertia (within-cluster sum of squares)')
plt.title('Elbow Method — Choose K at the "elbow"')
plt.show()

Silhouette Score Method

scores = []
for k in range(2, 11):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    scores.append(score)
    print(f'K={k}: Silhouette={score:.4f}')

best_k = range(2, 11)[scores.index(max(scores))]
print(f'\nBest K: {best_k} (score: {max(scores):.4f}')

Real-World Application: Customer Segmentation

import pandas as pd

# Load RFM data (Recency, Frequency, Monetary)
df = pd.DataFrame({
    'recency': [10, 200, 5, 150, 30, 180, 7, 120],
    'frequency': [50, 3, 80, 5, 40, 2, 90, 8],
    'monetary': [5000, 200, 8000, 300, 4000, 150, 9000, 500]
})

# Scale
X = StandardScaler().fit_transform(df)

# Cluster into 3 segments
km = KMeans(n_clusters=3, random_state=42, n_init=10)
df['segment'] = km.fit_predict(X)

# Profile each segment
print(df.groupby('segment').mean().round(2))

Limitations of K-Means

  • Must specify K in advance
  • Assumes spherical, equally-sized clusters
  • Sensitive to outliers
  • Not suitable for non-convex shapes (use DBSCAN instead)

FAQ

When should I use K-Means vs DBSCAN?

K-Means when clusters are roughly spherical and you know approximate K. DBSCAN when clusters have irregular shapes or you want automatic outlier detection.

Does K-Means require feature scaling?

Yes — always scale features before K-Means. It’s distance-based, so features with larger magnitudes dominate the clustering without scaling.

Leave feedback about this

  • Rating

Latest Posts

List of Categories