Kernel K-means

Smart Initialization - K-means++

The concept of K-means++ involves selecting centroids that are maximally distant from each other.

Step 1: Randomly select \(\boldsymbol{\mu} _1 ^0\) from the dataset.
Step 2: For \(l \in \{2, 3, \ldots, k\}\), choose \(\boldsymbol{\mu} _l ^0\) probabilistically proportional to the score(\(S\)), where \(S\) is defined as follows: \[ S(\mathbf{x}_i) = \min _{\{j=1, 2, \ldots, l-1\}} {|| \mathbf{x}_i - \boldsymbol{\mu} _{j} ^0 ||}^2 \quad \forall \mathbf{x}_i \in \mathbf{X} \] The probabilistic aspect of the algorithm provides an expected guarantee of optimal convergence in K-means. The guarantee is given by: \[ \mathbb{E} \left[ \sum _{i=1} ^{n} {|| \mathbf{x}_i - \boldsymbol{\mu} _{z_i} ||}^2 \right ] \le O(\log k) \left [ \min _{\{z_1, z_2, \ldots, z_n\}} \sum _{i=1} ^{n} {|| \mathbf{x}_i - \boldsymbol{\mu} _{z_i} ||}^2 \right ] \] where \(O(\log k)\) is a constant of order \(\log k\).
Step 3: Once the centroids are determined, we proceed with our usual Lloyd’s Algorithm.

Choice of K

A prerequisite for K-means is determining the number of clusters, denoted as \(k\). However, what if the value of \(k\) is unknown?

If we were to choose \(k\) to be equal to \(n\): \[ F(z_1, z_2, \dots, z_n) = \sum _{i=1} ^{n} {|| \mathbf{x}_i - \boldsymbol{\mu} _{z_i} ||} ^2 = 0 \] However, as having as many clusters as datapoints is undesirable, we aim to minimize \(k\) while penalizing large values of \(k\). \[ \underset{k}{\arg \min} \left [ \sum _{i=1} ^{n} {|| \mathbf{x}_i - \boldsymbol{\mu} _{z_i} ||} ^2 + \text{Penalty}(k) \right ] \] Two common criteria for making the above argument are:

\(\text{\textcolor{blue}{Akaike Information Criterion}}\): \(\left [ 2K - 2\ln(\hat{\mathcal{L}}(\theta ^*)) \right ]\)
\(\text{\textcolor{blue}{Bayesian Information Criterion}}\): \(\left [ K\ln(n) - 2\ln(\hat{\mathcal{L}}(\theta ^*)) \right ]\)

However, detailed elaboration of these criteria is not discussed here.