K-Means clustering is the most popular unsupervised machine learning algorithm. It groups data into K clusters based on similarity — no labels required. This guide walks through the algorithm, Python implementation, and how to choose the right K.
How K-Means Works
- Choose K (number of clusters)
- Randomly initialize K centroids
- Assign each point to the nearest centroid
- Recalculate centroids as mean of assigned points
- Repeat steps 3-4 until centroids stop moving
K-Means in Python
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np
# Sample customer data: annual income and spending score
np.random.seed(42)
X = np.column_stack([
np.random.normal([50, 80, 30, 60, 90], 10, (200, 5)).flatten()[:200],
np.random.normal([40, 85, 20, 70, 80], 15, (200, 5)).flatten()[:200]
])
# Scale features (important for K-Means)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit K-Means with K=5
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)
# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='tab10', alpha=0.6)
plt.scatter(scaler.inverse_transform(kmeans.cluster_centers_)[:, 0],
scaler.inverse_transform(kmeans.cluster_centers_)[:, 1],
c='red', marker='x', s=200, linewidths=3, label='Centroids')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.title('Customer Segments (K-Means)')
plt.legend()
plt.show()Choosing the Right K
Elbow Method
inertias = []
K_range = range(1, 11)
for k in K_range:
km = KMeans(n_clusters=k, random_state=42, n_init=10)
km.fit(X_scaled)
inertias.append(km.inertia_)
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Inertia (within-cluster sum of squares)')
plt.title('Elbow Method — Choose K at the "elbow"')
plt.show()Silhouette Score Method
scores = []
for k in range(2, 11):
km = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = km.fit_predict(X_scaled)
score = silhouette_score(X_scaled, labels)
scores.append(score)
print(f'K={k}: Silhouette={score:.4f}')
best_k = range(2, 11)[scores.index(max(scores))]
print(f'\nBest K: {best_k} (score: {max(scores):.4f}')Real-World Application: Customer Segmentation
import pandas as pd
# Load RFM data (Recency, Frequency, Monetary)
df = pd.DataFrame({
'recency': [10, 200, 5, 150, 30, 180, 7, 120],
'frequency': [50, 3, 80, 5, 40, 2, 90, 8],
'monetary': [5000, 200, 8000, 300, 4000, 150, 9000, 500]
})
# Scale
X = StandardScaler().fit_transform(df)
# Cluster into 3 segments
km = KMeans(n_clusters=3, random_state=42, n_init=10)
df['segment'] = km.fit_predict(X)
# Profile each segment
print(df.groupby('segment').mean().round(2))Limitations of K-Means
- Must specify K in advance
- Assumes spherical, equally-sized clusters
- Sensitive to outliers
- Not suitable for non-convex shapes (use DBSCAN instead)
FAQ
When should I use K-Means vs DBSCAN?
K-Means when clusters are roughly spherical and you know approximate K. DBSCAN when clusters have irregular shapes or you want automatic outlier detection.
Does K-Means require feature scaling?
Yes — always scale features before K-Means. It’s distance-based, so features with larger magnitudes dominate the clustering without scaling.



