Expectation-maximization (EM) algorithm

Convexity and Jensen’s Inequality

Convexity is a property of a function or set that implies a unique line segment can be drawn between any two points within the function or set. For a concave function, this property can be expressed as, \[ f \left (\sum _{k=1} ^K \lambda_k a_k \right ) \ge \sum _{k=1} ^K \lambda_k f(a_k) \] where \[ \sum _{k=1} ^K \lambda _k = 1 \] \[ a_k \text{ are points of the function} \] This is also known as Jensen’s Inequality.

Estimating the Parameters

Since log is a concave function, we can approximate the likelihood function for GMM’s as follows, \[ \log\mathcal{L}(\boldsymbol{\theta}) = \sum _{i=1} ^n \log \left [ \sum _{k=1} ^K \pi_k * \frac{1}{\sqrt{2\pi}\boldsymbol{\sigma}_k} e^{\frac{-(\mathbf{x}_i-\boldsymbol{\mu}_k)^2}{2\boldsymbol{\sigma}^2_k}} \right ] \] By introducing parameters \(\{\lambda_1^i, \lambda_2^i, \ldots, \lambda_k^i\}\) for data point \(\mathbf{x}_i\) such that \(\forall i,k \displaystyle \sum _{k=1} ^K \lambda _k ^i = 1; 0 \le \lambda _k ^i \le 1\), we obtain: \[ \log\mathcal{L}(\boldsymbol{\theta}) = \sum _{i=1} ^n \log \left [ \sum _{k=1} ^K \lambda _k ^i \left ( \pi_k * \frac{1}{\lambda _k ^i\sqrt{2\pi}\boldsymbol{\sigma}_k} e^{\frac{-(\mathbf{x}_i-\boldsymbol{\mu}_k)^2}{2\boldsymbol{\sigma}^2_k}} \right ) \right ] \] Using Jensen’s Inequality, we get: \[ \log\mathcal{L}(\boldsymbol{\theta}) \ge \text{modified\_log}\mathcal{L}(\boldsymbol{\theta}) \] \[\begin{align} \therefore \text{modified\_log}\mathcal{L}(\boldsymbol{\theta}) &= \sum _{i=1} ^n \sum _{i=k} ^K \lambda _k ^i \log \left ( \pi_k * \frac{1}{\lambda _k ^i\sqrt{2\pi}\boldsymbol{\sigma}_k} e^{\frac{-(\mathbf{x}_i-\boldsymbol{\mu}_k)^2}{2\boldsymbol{\sigma}^2_k}} \right ) \end{align}\] Note that the modified-log likelihood function gives a lower bound for the true log likelihood function at \(\boldsymbol{\theta}\). Finally, to get the parameters, we do the following:

To get \(\boldsymbol{\theta}\): Fix \(\lambda\) and maximize over \(\boldsymbol{\theta}\). \[ \underset{\boldsymbol{\theta}} {\max} \sum _{i=1} ^n \sum _{i=k} ^K \lambda _k ^i \log \left ( \pi_k * \frac{1}{\lambda _k ^i\sqrt{2\pi}\boldsymbol{\sigma}_k} e^{\frac{-(\mathbf{x}_i-\boldsymbol{\mu}_k)^2}{2\boldsymbol{\sigma}^2_k}} \right ) \] \[ \text{Differentiate w.r.t. }\boldsymbol{\mu},\boldsymbol{\sigma}^2,\text{ and }\pi \text{ to get the following} \] \[\begin{align*} \hat{\boldsymbol{\mu}}_k^{\text{MML}} &= \frac{\displaystyle \sum _{i=1} ^n \lambda_k^i \mathbf{x}_i}{\displaystyle \sum _{i=1} ^n \lambda_k^i} \\ \hat{\boldsymbol{\sigma}}_k^{2^{\text{MML}}} &= \frac{\displaystyle \sum _{i=1} ^n \lambda_k^i (\mathbf{x}_i-\hat{\boldsymbol{\mu}}_k^{\text{MML}})^2}{\displaystyle \sum _{i=1} ^n \lambda_k^i} \\ \hat{\pi}_k^{\text{MML}} &= \frac{\displaystyle \sum _{i=1} ^n \lambda_k^i}{n} \\ \end{align*}\]
To get \(\lambda\): Fix \(\boldsymbol{\theta}\) and maximize over \(\lambda\). For any \(i\): \[ \underset{\lambda_1^i, \lambda_2^i, \ldots, \lambda_k^i} {\max} \sum _{k=1} ^K \left [ \lambda _k ^i \log \left ( \pi_k * \frac{1}{\sqrt{2\pi}\boldsymbol{\sigma}_k} e^{\frac{-(\mathbf{x}_i-\boldsymbol{\mu}_k)^2}{2\boldsymbol{\sigma}^2_k}} \right ) - \lambda _k ^i \log( \lambda _k ^i) \right ] \hspace{1em} s.t. \hspace{1em} \sum _{k=1} ^K \lambda _k ^i = 1; 0 \le \lambda _k ^i \le 1 \] Solving the above constrained optimization problem analytically, we get: \[ \hat{\lambda}_k^{i^{\text{MML}}} = \frac{\left ( \frac{1}{\sqrt{2\pi}\boldsymbol{\sigma}_k} e^{\frac{-(\mathbf{x}_i-\boldsymbol{\mu}_k)^2}{2\boldsymbol{\sigma}^2_k}} \right ) * \pi_k}{\displaystyle \sum _{k=1} ^K \left ( \frac{1}{\sqrt{2\pi}\boldsymbol{\sigma}_k} e^{\frac{-(\mathbf{x}_i-\boldsymbol{\mu}_k)^2}{2\boldsymbol{\sigma}^2_k}} \right ) * \pi_k} \]

EM Algorithm

The EM (Expectation-Maximization) algorithm is a popular method for estimating the parameters of statistical models with incomplete data by iteratively alternating between expectation and maximization steps until convergence to a stable solution.

The algorithm is as follows:

Initialize \(\boldsymbol{\theta}^0 = \left \{ \begin{array}{cccc} \boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \ldots, \boldsymbol{\mu}_K \\ \boldsymbol{\sigma}^2_1, \boldsymbol{\sigma}^2_2, \ldots, \boldsymbol{\sigma}^2_K\\ \pi_1, \pi_2, \ldots, \pi_K \end{array} \right \}\) using Lloyd’s algorithm.
Until convergence (\(||\boldsymbol{\theta}^{t+1}-\boldsymbol{\theta}^{t} || \le \epsilon\) where \(\epsilon\) is the tolerance parameter) do the following: \[\begin{align*} \lambda^{t+1} &= \underset{\lambda}{\arg \max} \text{ modified\_log}(\boldsymbol{\theta}^t, \textcolor{red}{\lambda}) &\rightarrow \text{ Expectation Step}\\ \boldsymbol{\theta}^{t+1} &= \underset{\boldsymbol{\theta}}{\arg \max} \text{ modified\_log}(\textcolor{red}{\boldsymbol{\theta}}, \lambda^{t+1}) &\rightarrow \text{ Maximization Step}\\ \end{align*}\]

EM algorithm produces soft clustering. For hard clustering using EM, a further step is involved:

For a point \(\mathbf{x}_i\), assign it to a cluster using the following equation: \[ z_i = \underset{k}{\arg\max} \lambda_k^i \]