Unsupervised Learning - Estimation - Recap of MLE + Bayesian estimation, Gaussian Mixture Model - EM algorithm

Note

Feedback/Correction: Click here!.

Introduction to Estimation in Machine Learning

Estimation in machine learning involves inferring unknown parameters or predicting outcomes from observed data. Estimators, often algorithms or models, are used for these tasks and to characterize the data’s underlying distribution.

Let $\{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\}$ represent a dataset, where each data point $\mathbf{x}_i$ is in the $d$-dimensional binary space $\{0,1\}^d$. It is assumed that the data points are independent and identically distributed (i.i.d).

Independence is denoted as $P(\mathbf{x}_i|\mathbf{x}_j) = P(\mathbf{x}_i)$. Identically distributed means $P(\mathbf{x}_i)=P(\mathbf{x}_j)=p$.

Maximum Likelihood Estimation

Fisher’s Principle of Maximum Likelihood

Fisher’s principle of maximum likelihood is a statistical method used to estimate parameters of a statistical model by selecting values that maximize the likelihood function. This function quantifies how well the model fits the observed data.

Likelihood Estimation for Bernoulli Distributions

Applying the likelihood function on the aforementioned dataset, we obtain: \[\begin{align*} \mathcal{L}(p;\{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\}) &= P(\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n;p)\\ &= p(\mathbf{x}_1;p)p(\mathbf{x}_2;p)\ldots p(\mathbf{x}_n;p) \\ &=\prod _{i=1} ^n {p^{\mathbf{x}_i}(1-p)^{1-\mathbf{x}_i}} \end{align*}\] \[\begin{align*} \therefore \log(\mathcal{L}(p;\{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\})) &=\underset{p} {\arg \max}\log \left ( \prod _{i=1} ^n {p^{\mathbf{x}_i}(1-p)^{1-\mathbf{x}_i}} \right ) \\ \text{Differentiating wrt $p$, we get}\\ \therefore \hat{p}_{\text{ML}} &= \frac{1}{n}\sum _{i=1} ^n \mathbf{x}_i \end{align*}\]

Likelihood Estimation for Gaussian Distributions

Let $\{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\}$ be a dataset where $\mathbf{x}_i \sim \mathcal{N}(\boldsymbol{\mu},\boldsymbol{\sigma}^2)$. We assume that the data points are independent and identically distributed.

\[\begin{align*} \mathcal{L}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2;\{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\}) &= f_{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n}(\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n;\boldsymbol{\mu}, \boldsymbol{\sigma}^2) \\ &=\prod _{i=1} ^n f_{\mathbf{x}_i}(\mathbf{x}_i;\boldsymbol{\mu}, \boldsymbol{\sigma}^2) \\ &=\prod _{i=1} ^n \left [ \frac{1}{\sqrt{2\pi}\boldsymbol{\sigma}} e^{\frac{-(\mathbf{x}_i-\boldsymbol{\mu})^2}{2\boldsymbol{\sigma}^2}} \right ] \\ \therefore \log(\mathcal{L}(p;\{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\})) &= \sum _{i=1} ^n \left[ \log \left (\frac{1}{\sqrt{2\pi}\boldsymbol{\sigma}} \right ) - \frac{(\mathbf{x}_i-\boldsymbol{\mu})^2}{2\boldsymbol{\sigma}^2} \right] \\ \end{align*}\] \[ \text{By differentiating with respect to $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$, we get} \] \[\begin{align*} \hat{\boldsymbol{\mu}}_{\text{ML}} &= \frac{1}{n}\sum _{i=1} ^n \mathbf{x}_i \\ \hat{\boldsymbol{\sigma}^2}_{\text{ML}} &= \frac{1}{n}\sum _{i=1} ^n (\mathbf{x}_i-\boldsymbol{\mu})^T(\mathbf{x}_i-\boldsymbol{\mu}) \end{align*}\]

Bayesian Estimation

Bayesian estimation is a statistical method that updates parameter estimates by incorporating prior knowledge or beliefs along with observed data to calculate the posterior probability distribution of the parameters.

Let $\{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\}$ be a dataset where $\mathbf{x}_i$ follows a distribution with parameters $\boldsymbol{\theta}$. We assume that the data points are independent and identically distributed, and we also consider $\boldsymbol{\theta}$ as a random variable with its own probability distribution.

Our objective is to update the parameters using the available data.

i.e. \[ P(\boldsymbol{\theta})\Rightarrow P(\boldsymbol{\theta}|\{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\}) \] where, employing Bayes’ Law, we find \[ P(\boldsymbol{\theta}|\{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\})=\left ( \frac{P(\{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\}|\boldsymbol{\theta})}{P(\{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\})} \right )*P(\boldsymbol{\theta}) \]

Bayesian Estimation for a Bernoulli Distribution

Let $\{x_1, x_2, \ldots, x_n\}$ be a dataset where $x_i \in \{0,1\}$ with parameter $\theta$. What distribution can be suitable for $P(\theta)$?

A commonly used distribution for priors is the Beta Distribution. \[ f(p;\alpha,\beta) = \frac{p^{\alpha-1}(1-p)^{\beta-1}}{z} \hspace{2em} \forall p \in [0,1] \\ \] \[ \text{where $z$ is a normalizing factor} \]

Hence, utilizing the Beta Distribution as the Prior, we obtain, \[\begin{align*} P(\theta|\{x_1, x_2, \ldots, x_n\}) &\propto P(\theta|\{x_1, x_2, \ldots, x_n\})*P(\theta) \\ f_{\theta|\{x_1, x_2, \ldots, x_n\}}(p) &\propto \left [ \prod _{i=1} ^n {p^{x_i}(1-p)^{1-x_i}} \right ]*\left [ p^{\alpha-1}(1-p)^{\beta-1} \right ] \\ f_{\theta|\{x_1, x_2, \ldots, x_n\}}(p) &\propto p^{\sum _{i=1} ^n x_i + \alpha - 1}(1-p)^{\sum _{i=1} ^n(1-x_i) + \beta - 1} \end{align*}\] i.e. we obtain, \[ \text{BETA PRIOR }(\alpha, \beta) \xrightarrow[Bernoulli]{\{x_1, x_2, \ldots, x_n\}} \text{BETA POSTERIOR }(\alpha + n_h, \beta + n_t) \] \[ \therefore \hat{p_{\text{ML}}} = \mathbb{E}[\text{Posterior}]=\mathbb{E}[\text{Beta}(\alpha +n_h, \beta + n_t)]= \frac{\alpha + n_h}{\alpha + n_h + \beta + n_t} \]

Gaussian Mixture Models

Gaussian Mixture Models are a type of probabilistic model used to represent complex data distributions by combining multiple Gaussian distributions.

The procedure is as follows:

Step 1: Generate a mixture component among $\{1, 2, \ldots, K\}$ where $z_i \in \{1, 2, \ldots, K\}$. We obtain, \[ P(z_i=k) = \pi_k \hspace{2em} \left [ \sum _{i=1} ^K \pi_i = 1 \hspace{1em} 0 \le \pi_i \le 1 \hspace{1em} \forall i \right ] \]
Step 2: Generate $\mathbf{x}_i \sim \mathcal{N}(\boldsymbol{\mu}_{z_i}, \boldsymbol{\sigma}^2_{z_i})$

Hence, there are $3K$ parameters. However, since $\displaystyle \sum _{i=1} ^K \pi_i = 1$, the number of parameters to be estimated becomes $3K-1$ for a GMM with $K$ components.

Likelihood of GMM’s

\[\begin{align*} \mathcal{L}\left( \begin{array}{cccc} \boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \ldots, \boldsymbol{\mu}_K \\ \boldsymbol{\sigma}^2_1, \boldsymbol{\sigma}^2_2, \ldots, \boldsymbol{\sigma}^2_K\\ \pi_1, \pi_2, \ldots, \pi_K \end{array}; \mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n \right ) &= \prod _{i=1} ^n f_{\text{mix}} \left( \mathbf{x}_i; \begin{array}{cccc} \boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \ldots, \boldsymbol{\mu}_K \\ \boldsymbol{\sigma}^2_1, \boldsymbol{\sigma}^2_2, \ldots, \boldsymbol{\sigma}^2_K\\ \pi_1, \pi_2, \ldots, \pi_K \end{array} \right ) \\ &= \prod _{i=1} ^n \left [ \sum _{k=1} ^K \pi_k * f_{\text{mix}}(\mathbf{x}_i; \boldsymbol{\mu}_k, \boldsymbol{\sigma}_k) \right ] \\ \therefore \log\mathcal{L}(\boldsymbol{\theta}) &= \sum _{i=1} ^n \log \left [ \sum _{k=1} ^K \pi_k * \frac{1}{\sqrt{2\pi}\boldsymbol{\sigma}_k} e^{\frac{-(\mathbf{x}_i-\boldsymbol{\mu}_k)^2}{2\boldsymbol{\sigma}^2_k}} \right ] \\ \end{align*}\] To solve the above equation, we need to understand convexity.

Convexity and Jensen’s Inequality

Convexity is a property of a function or set that implies a unique line segment can be drawn between any two points within the function or set. For a concave function, this property can be expressed as, \[ f \left (\sum _{k=1} ^K \lambda_k a_k \right ) \ge \sum _{k=1} ^K \lambda_k f(a_k) \] where \[ \sum _{k=1} ^K \lambda _k = 1 \] \[ a_k \text{ are points of the function} \] This is also known as Jensen’s Inequality.

Estimating the Parameters

Since log is a concave function, we can approximate the likelihood function for GMM’s as follows, \[ \log\mathcal{L}(\boldsymbol{\theta}) = \sum _{i=1} ^n \log \left [ \sum _{k=1} ^K \pi_k * \frac{1}{\sqrt{2\pi}\boldsymbol{\sigma}_k} e^{\frac{-(\mathbf{x}_i-\boldsymbol{\mu}_k)^2}{2\boldsymbol{\sigma}^2_k}} \right ] \] By introducing parameters $\{\lambda_1^i, \lambda_2^i, \ldots, \lambda_k^i\}$ for data point $\mathbf{x}_i$ such that $\forall i,k \displaystyle \sum _{k=1} ^K \lambda _k ^i = 1; 0 \le \lambda _k ^i \le 1$, we obtain: \[ \log\mathcal{L}(\boldsymbol{\theta}) = \sum _{i=1} ^n \log \left [ \sum _{k=1} ^K \lambda _k ^i \left ( \pi_k * \frac{1}{\lambda _k ^i\sqrt{2\pi}\boldsymbol{\sigma}_k} e^{\frac{-(\mathbf{x}_i-\boldsymbol{\mu}_k)^2}{2\boldsymbol{\sigma}^2_k}} \right ) \right ] \] Using Jensen’s Inequality, we get: \[ \log\mathcal{L}(\boldsymbol{\theta}) \ge \text{modified\_log}\mathcal{L}(\boldsymbol{\theta}) \] \[\begin{align} \therefore \text{modified\_log}\mathcal{L}(\boldsymbol{\theta}) &= \sum _{i=1} ^n \sum _{i=k} ^K \lambda _k ^i \log \left ( \pi_k * \frac{1}{\lambda _k ^i\sqrt{2\pi}\boldsymbol{\sigma}_k} e^{\frac{-(\mathbf{x}_i-\boldsymbol{\mu}_k)^2}{2\boldsymbol{\sigma}^2_k}} \right ) \end{align}\] Note that the modified-log likelihood function gives a lower bound for the true log likelihood function at $\boldsymbol{\theta}$. Finally, to get the parameters, we do the following:

To get $\boldsymbol{\theta}$: Fix $\lambda$ and maximize over $\boldsymbol{\theta}$. \[ \underset{\boldsymbol{\theta}} {\max} \sum _{i=1} ^n \sum _{i=k} ^K \lambda _k ^i \log \left ( \pi_k * \frac{1}{\lambda _k ^i\sqrt{2\pi}\boldsymbol{\sigma}_k} e^{\frac{-(\mathbf{x}_i-\boldsymbol{\mu}_k)^2}{2\boldsymbol{\sigma}^2_k}} \right ) \] \[ \text{Differentiate w.r.t. }\boldsymbol{\mu},\boldsymbol{\sigma}^2,\text{ and }\pi \text{ to get the following} \] \[\begin{align*} \hat{\boldsymbol{\mu}}_k^{\text{MML}} &= \frac{\displaystyle \sum _{i=1} ^n \lambda_k^i \mathbf{x}_i}{\displaystyle \sum _{i=1} ^n \lambda_k^i} \\ \hat{\boldsymbol{\sigma}}_k^{2^{\text{MML}}} &= \frac{\displaystyle \sum _{i=1} ^n \lambda_k^i (\mathbf{x}_i-\hat{\boldsymbol{\mu}}_k^{\text{MML}})^2}{\displaystyle \sum _{i=1} ^n \lambda_k^i} \\ \hat{\pi}_k^{\text{MML}} &= \frac{\displaystyle \sum _{i=1} ^n \lambda_k^i}{n} \\ \end{align*}\]
To get $\lambda$: Fix $\boldsymbol{\theta}$ and maximize over $\lambda$. For any $i$: \[ \underset{\lambda_1^i, \lambda_2^i, \ldots, \lambda_k^i} {\max} \sum _{k=1} ^K \left [ \lambda _k ^i \log \left ( \pi_k * \frac{1}{\sqrt{2\pi}\boldsymbol{\sigma}_k} e^{\frac{-(\mathbf{x}_i-\boldsymbol{\mu}_k)^2}{2\boldsymbol{\sigma}^2_k}} \right ) - \lambda _k ^i \log( \lambda _k ^i) \right ] \hspace{1em} s.t. \hspace{1em} \sum _{k=1} ^K \lambda _k ^i = 1; 0 \le \lambda _k ^i \le 1 \] Solving the above constrained optimization problem analytically, we get: \[ \hat{\lambda}_k^{i^{\text{MML}}} = \frac{\left ( \frac{1}{\sqrt{2\pi}\boldsymbol{\sigma}_k} e^{\frac{-(\mathbf{x}_i-\boldsymbol{\mu}_k)^2}{2\boldsymbol{\sigma}^2_k}} \right ) * \pi_k}{\displaystyle \sum _{k=1} ^K \left ( \frac{1}{\sqrt{2\pi}\boldsymbol{\sigma}_k} e^{\frac{-(\mathbf{x}_i-\boldsymbol{\mu}_k)^2}{2\boldsymbol{\sigma}^2_k}} \right ) * \pi_k} \]

EM Algorithm

The EM (Expectation-Maximization) algorithm is a popular method for estimating the parameters of statistical models with incomplete data by iteratively alternating between expectation and maximization steps until convergence to a stable solution.

The algorithm is as follows:

Initialize $\boldsymbol{\theta}^0 = \left \{ \begin{array}{cccc} \boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \ldots, \boldsymbol{\mu}_K \\ \boldsymbol{\sigma}^2_1, \boldsymbol{\sigma}^2_2, \ldots, \boldsymbol{\sigma}^2_K\\ \pi_1, \pi_2, \ldots, \pi_K \end{array} \right \}$ using Lloyd’s algorithm.
Until convergence ($||\boldsymbol{\theta}^{t+1}-\boldsymbol{\theta}^{t} || \le \epsilon$ where $\epsilon$ is the tolerance parameter) do the following: \[\begin{align*} \lambda^{t+1} &= \underset{\lambda}{\arg \max} \text{ modified\_log}(\boldsymbol{\theta}^t, \textcolor{red}{\lambda}) &\rightarrow \text{ Expectation Step}\\ \boldsymbol{\theta}^{t+1} &= \underset{\boldsymbol{\theta}}{\arg \max} \text{ modified\_log}(\textcolor{red}{\boldsymbol{\theta}}, \lambda^{t+1}) &\rightarrow \text{ Maximization Step}\\ \end{align*}\]

EM algorithm produces soft clustering. For hard clustering using EM, a further step is involved:

For a point $\mathbf{x}_i$, assign it to a cluster using the following equation: \[ z_i = \underset{k}{\arg\max} \lambda_k^i \]