K-means clustering: Algorithm guide and applications

K-means clustering stands as one of the most popular and intuitive unsupervised machine learning algorithms. Whether you’re analyzing customer behavior, compressing images, or discovering patterns in complex datasets, understanding what is k means and how the k-means algorithm works is essential for any data scientist or AI practitioner. This comprehensive guide explores k-means clustering from fundamental concepts to practical implementations, helping you master this powerful clustering algorithm.

Content

1. Understanding k-means clustering fundamentals

What is clustering in machine learning?

Clustering is an unsupervised learning technique that groups similar data points together based on their characteristics. Unlike supervised learning where we have labeled data, cluster analysis discovers hidden patterns without predefined categories. The goal is to maximize similarity within clusters while maximizing differences between clusters.

K-means clustering, often simply called kmeans, is a centroid-based algorithm that partitions data into k distinct clusters. Each cluster is represented by its centroid—the mean position of all points in that cluster. The algorithm iteratively refines these centroids until it achieves optimal grouping.

The k meaning in k-means

The “k” in k-means represents the number of clusters you want to identify in your dataset. This is a hyperparameter that you must specify before running the algorithm. For instance, if you’re segmenting customers into three groups, you would set k=3. Choosing the right k value is crucial for meaningful results, and we’ll explore techniques for this later.

How k-means differs from other clustering algorithms

While hierarchical clustering builds a tree-like structure of nested clusters, k-means assigns each data point to exactly one cluster. Hierarchical clustering doesn’t require you to specify the number of clusters upfront, but k means clustering algorithm is generally faster and more scalable for large datasets. Other clustering algorithms like DBSCAN focus on density, whereas k-means uses distance from centroids as its primary metric.

2. The k-means clustering algorithm explained

Step-by-step algorithm breakdown

The k means algorithm follows an elegant iterative process:

Initialization: Randomly select k data points as initial centroids
Assignment: Assign each data point to the nearest centroid based on distance
Update: Recalculate centroids as the mean of all points in each cluster
Repeat: Continue steps 2-3 until centroids no longer change significantly or maximum iterations are reached

This process guarantees convergence, though not necessarily to the global optimum. The algorithm minimizes the within-cluster sum of squares (WCSS), also known as inertia.

Mathematical foundation

The objective function that k-means clustering minimizes is:

$$J = \sum_{i=1}^{k} \sum_{x \in C_i} ||x – \mu_i||^2$$

Where:

$k$ is the number of clusters
$C_i$ represents cluster $i$
$x$ is a data point in cluster $C_i$
$\mu_i$ is the centroid of cluster $C_i$
$||x – \mu_i||^2$ is the squared Euclidean distance

Calculating the average position of points within each cluster defines what we mean by ‘means’ in this context. The centroid update formula is:

$$\mu_i = \frac{1}{|C_i|} \sum_{x \in C_i} x$$

Distance metrics in k-means

While Euclidean distance is the default metric, understanding distance measurement is critical. Euclidean distance between points $p$ and $q$ in n-dimensional space is:

$$d(p,q) = \sqrt{\sum_{i=1}^{n} (p_i – q_i)^2}$$

Alternative distance metrics like Manhattan distance or cosine similarity can be used for specific applications, though they require modifications to the standard k means cluster algorithm.

3. Implementing k-means clustering in Python

Basic implementation with scikit-learn

Let’s start with a practical implementation using Python’s scikit-learn library:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate sample data
X, y_true = make_blobs(n_samples=300, centers=4, 
                       cluster_std=0.60, random_state=42)

# Apply k-means clustering
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
y_kmeans = kmeans.fit_predict(X)

# Visualize results
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], 
           kmeans.cluster_centers_[:, 1], 
           c='red', s=200, alpha=0.75, marker='X')
plt.title('K-Means Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

print(f"Inertia: {kmeans.inertia_:.2f}")
print(f"Cluster centers:\n{kmeans.cluster_centers_}")

Building k-means from scratch

Understanding the algorithm deeply requires implementing it yourself:

class KMeansFromScratch:
    def __init__(self, n_clusters=3, max_iters=100, random_state=42):
        self.n_clusters = n_clusters
        self.max_iters = max_iters
        self.random_state = random_state
        
    def fit(self, X):
        np.random.seed(self.random_state)
        # Random initialization
        random_indices = np.random.choice(X.shape[0], 
                                         self.n_clusters, 
                                         replace=False)
        self.centroids = X[random_indices]
        
        for iteration in range(self.max_iters):
            # Assignment step
            distances = self._calculate_distances(X)
            labels = np.argmin(distances, axis=1)
            
            # Update step
            new_centroids = np.array([X[labels == k].mean(axis=0) 
                                     for k in range(self.n_clusters)])
            
            # Check convergence
            if np.allclose(self.centroids, new_centroids):
                break
                
            self.centroids = new_centroids
            
        return labels
    
    def _calculate_distances(self, X):
        distances = np.zeros((X.shape[0], self.n_clusters))
        for k in range(self.n_clusters):
            distances[:, k] = np.linalg.norm(X - self.centroids[k], axis=1)
        return distances

# Test the implementation
kmeans_custom = KMeansFromScratch(n_clusters=4, random_state=42)
labels_custom = kmeans_custom.fit(X)

Handling real-world data

When working with actual datasets, preprocessing is crucial:

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd

# Load and prepare data
def prepare_data_for_clustering(df, features):
    # Handle missing values
    df_clean = df[features].dropna()
    
    # Standardize features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(df_clean)
    
    return X_scaled, scaler

# Example with dimensionality reduction
def cluster_with_pca(X, n_clusters=3, n_components=2):
    # Apply PCA for visualization
    pca = PCA(n_components=n_components)
    X_pca = pca.fit_transform(X)
    
    # Perform clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    labels = kmeans.fit_predict(X_pca)
    
    print(f"Explained variance: {pca.explained_variance_ratio_}")
    
    return labels, X_pca, kmeans

4. Determining the optimal number of clusters

This technique plots

An elbow method plots inertia against different k values. The “elbow” point where the rate of decrease sharply shifts indicates the optimal k:

def elbow_method(X, max_k=10):
    inertias = []
    K_range = range(1, max_k + 1)
    
    for k in K_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        kmeans.fit(X)
        inertias.append(kmeans.inertia_)
    
    # Plot the elbow curve
    plt.figure(figsize=(10, 6))
    plt.plot(K_range, inertias, 'bo-')
    plt.xlabel('Number of Clusters (k)')
    plt.ylabel('Inertia (WCSS)')
    plt.title('Elbow Method for Optimal k')
    plt.grid(True)
    plt.show()
    
    return inertias

# Apply elbow method
elbow_method(X, max_k=10)

Silhouette analysis

The silhouette score measures how similar a point is to its own cluster compared to other clusters. Values range from -1 to 1, where higher is better:

from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.cm as cm

def silhouette_analysis(X, k_range):
    silhouette_scores = []
    
    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=42)
        labels = kmeans.fit_predict(X)
        score = silhouette_score(X, labels)
        silhouette_scores.append(score)
        print(f"k={k}, Silhouette Score: {score:.3f}")
    
    # Plot silhouette scores
    plt.figure(figsize=(10, 6))
    plt.plot(k_range, silhouette_scores, 'go-')
    plt.xlabel('Number of Clusters (k)')
    plt.ylabel('Silhouette Score')
    plt.title('Silhouette Analysis')
    plt.grid(True)
    plt.show()
    
    return silhouette_scores

# Perform silhouette analysis
silhouette_analysis(X, range(2, 11))

Gap statistic method

The gap statistic compares the total intracluster variation for different k values with their expected values under null reference distribution:

def gap_statistic(X, k_range, n_refs=10):
    gaps = []
    
    for k in k_range:
        # Cluster original data
        kmeans = KMeans(n_clusters=k, random_state=42)
        kmeans.fit(X)
        original_inertia = kmeans.inertia_
        
        # Generate reference datasets and cluster them
        ref_inertias = []
        for _ in range(n_refs):
            random_data = np.random.uniform(X.min(axis=0), 
                                           X.max(axis=0), 
                                           size=X.shape)
            kmeans_ref = KMeans(n_clusters=k, random_state=42)
            kmeans_ref.fit(random_data)
            ref_inertias.append(np.log(kmeans_ref.inertia_))
        
        gap = np.mean(ref_inertias) - np.log(original_inertia)
        gaps.append(gap)
    
    return gaps

5. Advanced techniques and optimizations

K-means++ initialization

Standard k-means can converge to poor local optima depending on initial centroid placement. K-means++ improves initialization by selecting centroids that are far apart:

from sklearn.cluster import KMeans

# K-means++ is the default in scikit-learn
kmeans_plus = KMeans(n_clusters=4, init='k-means++', 
                     n_init=10, random_state=42)
labels = kmeans_plus.fit_predict(X)

The k-means++ algorithm selects initial centroids with probability proportional to their squared distance from the nearest existing centroid, improving convergence speed and quality.

Mini-batch k-means for large datasets

When dealing with massive datasets, mini-batch k-means offers significant speedup:

from sklearn.cluster import MiniBatchKMeans
import time

# Generate large dataset
X_large, _ = make_blobs(n_samples=100000, centers=5, 
                        cluster_std=1.0, random_state=42)

# Standard k-means
start_time = time.time()
kmeans_standard = KMeans(n_clusters=5, random_state=42)
kmeans_standard.fit(X_large)
standard_time = time.time() - start_time

# Mini-batch k-means
start_time = time.time()
kmeans_minibatch = MiniBatchKMeans(n_clusters=5, 
                                   batch_size=1000, 
                                   random_state=42)
kmeans_minibatch.fit(X_large)
minibatch_time = time.time() - start_time

print(f"Standard K-Means: {standard_time:.2f}s")
print(f"Mini-Batch K-Means: {minibatch_time:.2f}s")
print(f"Speedup: {standard_time/minibatch_time:.2fx")

Handling categorical and mixed data

K-means works best with numerical data, but we can adapt it for categorical features:

from sklearn.preprocessing import OneHotEncoder

def kmeans_mixed_data(df, numerical_features, categorical_features, n_clusters):
    # One-hot encode categorical features
    encoder = OneHotEncoder(sparse=False, drop='first')
    cat_encoded = encoder.fit_transform(df[categorical_features])
    
    # Standardize numerical features
    scaler = StandardScaler()
    num_scaled = scaler.fit_transform(df[numerical_features])
    
    # Combine features
    X_combined = np.hstack([num_scaled, cat_encoded])
    
    # Apply k-means
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    labels = kmeans.fit_predict(X_combined)
    
    return labels, kmeans

6. Real-world applications of k-means clustering

Customer segmentation in marketing

One of the most popular applications is segmenting customers based on purchasing behavior:

# Example: Customer segmentation
def customer_segmentation_analysis(customer_data):
    # Select relevant features
    features = ['annual_income', 'spending_score', 
                'age', 'purchase_frequency']
    
    # Prepare data
    X_scaled, scaler = prepare_data_for_clustering(customer_data, features)
    
    # Determine optimal clusters
    silhouette_scores = silhouette_analysis(X_scaled, range(2, 8))
    
    # Apply k-means with optimal k
    optimal_k = 4
    kmeans = KMeans(n_clusters=optimal_k, random_state=42)
    customer_data['segment'] = kmeans.fit_predict(X_scaled)
    
    # Analyze segments
    segment_profiles = customer_data.groupby('segment')[features].mean()
    print("\nSegment Profiles:")
    print(segment_profiles)
    
    return customer_data, kmeans

Image compression and color quantization

K-means clustering can compress images by reducing the number of colors:

from sklearn.cluster import KMeans
from PIL import Image

def compress_image_kmeans(image_path, n_colors=16):
    # Load image
    img = Image.open(image_path)
    img_array = np.array(img)
    
    # Reshape to 2D array of pixels
    h, w, d = img_array.shape
    pixels = img_array.reshape(-1, d)
    
    # Apply k-means
    kmeans = KMeans(n_clusters=n_colors, random_state=42)
    labels = kmeans.fit_predict(pixels)
    
    # Replace each pixel with its cluster center
    compressed_pixels = kmeans.cluster_centers_[labels]
    compressed_img = compressed_pixels.reshape(h, w, d).astype(np.uint8)
    
    # Calculate compression ratio
    original_colors = len(np.unique(pixels, axis=0))
    print(f"Original colors: {original_colors}")
    print(f"Compressed to: {n_colors} colors")
    
    return compressed_img

Anomaly detection

Points far from any cluster center can be identified as anomalies:

def detect_anomalies_kmeans(X, contamination=0.1):
    # Apply k-means
    kmeans = KMeans(n_clusters=3, random_state=42)
    kmeans.fit(X)
    
    # Calculate distances to nearest centroid
    distances = np.min(kmeans.transform(X), axis=1)
    
    # Set threshold based on contamination
    threshold = np.percentile(distances, 100 * (1 - contamination))
    
    # Identify anomalies
    anomalies = distances > threshold
    
    print(f"Detected {np.sum(anomalies)} anomalies")
    
    return anomalies, distances

Document clustering and text analysis

K-means clustering helps organize large document collections:

from sklearn.feature_extraction.text import TfidfVectorizer

def cluster_documents(documents, n_clusters=5):
    # Convert documents to TF-IDF vectors
    vectorizer = TfidfVectorizer(max_features=1000, 
                                 stop_words='english')
    X_tfidf = vectorizer.fit_transform(documents)
    
    # Apply k-means
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    labels = kmeans.fit_predict(X_tfidf)
    
    # Get top terms per cluster
    feature_names = vectorizer.get_feature_names_out()
    for i in range(n_clusters):
        center = kmeans.cluster_centers_[i]
        top_indices = center.argsort()[-10:][::-1]
        top_terms = [feature_names[idx] for idx in top_indices]
        print(f"\nCluster {i}: {', '.join(top_terms)}")
    
    return labels, kmeans

7. Limitations and best practices

Understanding k-means limitations

While powerful, kmeans clustering has important constraints:

Assumes spherical clusters: K-means struggles with elongated or irregularly shaped clusters
Sensitive to outliers: Extreme values can significantly skew centroids
Requires k specification: You must choose the number of clusters beforehand
Scale dependent: Features with larger ranges dominate distance calculations
Local optima: Different initializations can yield different results

When to use alternatives

Consider hierarchical clustering when you need a dendrogram showing relationships between clusters, or when you’re unsure about k. For density-based clusters of arbitrary shape, DBSCAN or HDBSCAN may be more appropriate. For very high-dimensional data, spectral clustering can be more effective.

Best practices for k-means clustering

Always standardize features: Use StandardScaler to ensure all features contribute equally
Try multiple initializations: Use n_init parameter to run k-means multiple times
Validate results: Use multiple metrics (silhouette, elbow, gap statistic) to choose k
Handle outliers: Consider removing or treating outliers before clustering
Visualize results: Use dimensionality reduction (PCA, t-SNE) to visualize clusters
Domain knowledge matters: Interpret clusters in the context of your problem
Test sensitivity: Check how results change with different k values

# Complete pipeline with best practices
def robust_kmeans_pipeline(X, k_range=range(2, 11)):
    # Standardize features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Remove outliers using IQR method
    Q1 = np.percentile(X_scaled, 25, axis=0)
    Q3 = np.percentile(X_scaled, 75, axis=0)
    IQR = Q3 - Q1
    outlier_mask = ~((X_scaled < (Q1 - 1.5 * IQR)) | 
                     (X_scaled > (Q3 + 1.5 * IQR))).any(axis=1)
    X_clean = X_scaled[outlier_mask]
    
    # Find optimal k
    silhouette_scores = []
    inertias = []
    
    for k in k_range:
        kmeans = KMeans(n_clusters=k, n_init=20, random_state=42)
        labels = kmeans.fit_predict(X_clean)
        silhouette_scores.append(silhouette_score(X_clean, labels))
        inertias.append(kmeans.inertia_)
    
    # Select optimal k based on silhouette score
    optimal_k = k_range[np.argmax(silhouette_scores)]
    print(f"Optimal k: {optimal_k}")
    
    # Final clustering
    final_kmeans = KMeans(n_clusters=optimal_k, n_init=20, random_state=42)
    final_labels = final_kmeans.fit_predict(X_clean)
    
    return final_labels, final_kmeans, scaler

8. Knowledge Check

Quiz 1: Core Concept of K-Means

• Question: What is the fundamental goal of the K-means clustering algorithm?

• Answer: The fundamental goal of K-means is to partition a dataset into a predefined number of ‘k’ distinct clusters. The algorithm assigns each data point to a cluster represented by a centroid (the mean position of all points in the cluster) with the dual objective to maximize intra-cluster similarity while maximizing inter-cluster differences.

Quiz 2: The Role of ‘k’

• Question: In the K-means algorithm, what does the hyperparameter ‘k’ represent, and who is responsible for setting its value?

• Answer: The hyperparameter ‘k’ represents the number of clusters you want the algorithm to identify in the dataset. This value must be specified by the user or practitioner before the algorithm runs.

Quiz 3: Algorithm Steps

• Question: After the initial centroids are chosen, what are the two primary, repeating steps of the K-means algorithm?

• Answer: The two primary steps that repeat are:

1. Assignment: Each data point is assigned to the nearest centroid based on a distance metric.

2. Update: The centroid of each cluster is recalculated to be the mean of all data points assigned to it. These steps continue until the centroids no longer change significantly or the maximum number of iterations is reached.

Quiz 4: The Objective Function

• Question: What is the objective function that K-means seeks to minimize, and what is its common alternative name mentioned in the text?

• Answer: K-means seeks to minimize the Within-Cluster Sum of Squares (WCSS). This objective function is also commonly known as inertia.

Quiz 5: Finding the Optimal ‘k’

• Question: How does the Elbow Method help determine the optimal number of clusters?

• Answer: The Elbow Method involves plotting the inertia (WCSS) against different values of ‘k’. The optimal ‘k’ is identified at the “elbow” point of the curve, which is where the rate of decrease in inertia sharply shifts, indicating that adding more clusters provides diminishing returns.

Quiz 6: Initialization Improvement

• Question: How does the K-means++ initialization method improve upon the standard random initialization of centroids?

• Answer: K-means++ improves initialization by selecting initial centroids that are far apart from each other. It accomplishes this by choosing subsequent centroids with a probability proportional to their squared distance from the nearest existing centroid, which helps improve convergence speed and the quality of the final clustering results.

Quiz 7: Real-World Application

• Question: Name one of the specific real-world applications of K-means clustering detailed in the source text.

• Answer: One specific application is image compression, where K-means is used to reduce the number of colors in an image by grouping similar colors into clusters and replacing them with the cluster’s centroid color. Another application is customer segmentation in marketing, where K-means groups customers based on purchasing behavior to enable targeted strategies. Other applications mentioned are anomaly detection and document clustering and text analysis.

Quiz 8: A Key Limitation

• Question: What is a key limitation of the K-means algorithm concerning the shape of clusters it can identify?

• Answer: A key limitation is that K-means assumes clusters are spherical. Consequently, K-means struggles with elongated or irregularly shaped clusters.

Quiz 9: A Critical Best Practice

• Question: What is a critical data preprocessing step recommended as a best practice, and why is it necessary for K-means?

• Answer: A critical best practice is to always standardize features. This is necessary because K-means is scale-dependent; features with larger numerical ranges can dominate the distance calculations and influence the clustering results. Standardization ensures all features contribute equally.

Quiz 10: Handling Large Datasets

• Question: What is Mini-batch K-means, and for what type of dataset does it provide a significant advantage?

• Answer: Mini-batch K-means is a variation of the standard algorithm that processes the data in small, random batches. It provides a significant advantage for massive datasets by offering a substantial speedup in processing time compared to the standard implementation (e.g., the source example shows a speedup of over 30x).

Explore more: