14  K-Mean Clustering

14.1 Principles of K-Mean Clustering

K-mean clustering is a popular unsupervised learning algorithm used to group similar data points together into a fixed number of clusters. It is a centroid-based algorithm, which means that it assigns data points to clusters based on their proximity to a central point called the centroid. In R, the K-Means algorithm can be implemented using the following code:

# Load iris dataset
data(iris)

# Extract relevant columns for clustering
iris_data <- iris[, 1:4]

stats::kmeans(iris_data, centers = 3, nstart = 10)
#> K-means clustering with 3 clusters of sizes 50, 62, 38
#> 
#> Cluster means:
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     5.006000    3.428000     1.462000    0.246000
#> 2     5.901613    2.748387     4.393548    1.433871
#> 3     6.850000    3.073684     5.742105    2.071053
#> 
#> Clustering vector:
#>   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>  [37] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#>  [73] 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3
#> [109] 3 3 3 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3
#> [145] 3 3 2 3 3 2
#> 
#> Within cluster sum of squares by cluster:
#> [1] 15.15100 39.82097 23.87947
#>  (between_SS / total_SS =  88.4 %)
#> 
#> Available components:
#> 
#> [1] "cluster"      "centers"      "totss"        "withinss"    
#> [5] "tot.withinss" "betweenss"    "size"         "iter"        
#> [9] "ifault"

Here, the parameters are as follows: - x: A numeric data matrix containing the observations - centers: The predefined number of clusters, k - nstart: The number of times the K-Means algorithm is repeated to enhance the resulting model, as it has a random component

To learn about k-means, let’s use the iris dataset with the sepal and petal length variables only (to facilitate visualisation).

library(ggplot2)
library(plotly)
#> 
#> Attaching package: 'plotly'
#> The following object is masked from 'package:ggplot2':
#> 
#>     last_plot
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following object is masked from 'package:graphics':
#> 
#>     layout

# Perform K-Means clustering with 3 clusters
set.seed(123)
kmeans_result <- kmeans(iris_data, centers = 3)

# Visualize the clusters
library(ggplot2)
p <- ggplot(iris, aes(Petal.Length, Petal.Width, color = factor(kmeans_result$cluster))) + 
  geom_point() + 
  labs(title = "K-Means Clustering of Iris Dataset",
       x = "Petal Length",
       y = "Petal Width",
       color = "Cluster") + 
  theme_minimal()
ggplotly(p)

15 Model selection

Since K-Means clustering has a random initialisation, it can produce different clustering outcomes. To improve the clustering results, the algorithm is run multiple times, and the best outcome is chosen based on the smallest total within-cluster sum of squares (SS):

\[SS=\sum_{i=1}^{k}\sum_{x\in C_i}^{}{\left \| x-\mu_i \right \|}^2\]

where k is the number of clusters, \(C_i\) is the ith cluster, and \(\mu_i\) is the centroid of the ith cluster. The formula calculates the sum of the squared distances between each data point in a cluster and its respective centroid.

16 How to determine the number of clusters

Determining the appropriate number of clusters is an important step in the clustering process. One common approach is to use the elbow method, which involves running K-Means clustering with a range of values for k, and recording the total within-cluster sum of squares (WSS) for each value of k.

To implement the elbow method, we perform the following steps:

  • Run k-means with \(k=1, k=2, \cdots, k=n\)
  • Record total within SS for each value of \(k\).
  • Choose \(k\) at the elbow position, as illustrated below.
library(ggplot2)
library(plotly)

# Perform K-Means clustering with 3 clusters
set.seed(123)
ks <- 1:5
tot_within_ss <- sapply(ks, function(k) {
    cl <- kmeans(iris_data, k, nstart = 10)
    cl$tot.withinss
})
km_df <- data.frame(ks, tot_within_ss)
p <- ggplot(km_df) +
  aes(x = ks, y = tot_within_ss) +
  geom_line(colour = "red") +
  geom_point(size=2) + 
  labs(title = "Elbow Methods for K-Mean",
       x = "Number of clusters",
       y = "Total within-cluster sum of squares") +
  geom_vline(xintercept = 2, colour = 'blue', linetype = 'dotted') +
  theme_minimal()

ggplotly(p)

One of the advantages of K-means clustering is its simplicity and ease of implementation. However, it also has some limitations, such as the sensitivity to the initial random placement of centroids, the need to specify the number of clusters in advance, and the inability to handle non-spherical clusters or clusters of varying sizes.

Despite these limitations, K-means clustering is widely used in various fields, such as marketing, image processing, and bioinformatics, among others. It is often used as a first step in more complex data analysis pipelines or as a baseline for comparison with other clustering algorithms.