Bayesian view of least squares regression

Probabilistic View of Linear Regression

Consider a dataset \(\mathbf{x}_1, \ldots, \mathbf{x}_n\) with \(\mathbf{x}_i \in \mathbb{R}^d\), and the corresponding labels \(\mathbf{y}_1, \ldots, \mathbf{y}_n\) with \(\mathbf{y}_i \in \mathbb{R}\). The probabilistic view of linear regression assumes that the target variable \(\mathbf{y}_i\) can be modeled as a linear combination of the input features \(\mathbf{x}_i\), with an additional noise term \(\epsilon\) following a zero-mean Gaussian distribution with variance \(\sigma^2\). Mathematically, this can be expressed as:

\[ \mathbf{y}_i = \mathbf{w}^T\mathbf{x}_i + \epsilon_i \]

where \(\mathbf{w} \in \mathbb{R}^d\) represents the weight vector that captures the relationship between the inputs and the target variable.

To estimate the weight vector \(\mathbf{w}\) that best fits the data, we can apply the principle of Maximum Likelihood (ML). The ML estimation seeks to find the parameter values that maximize the likelihood of observing the given data.

Assuming that the noise term \(\epsilon_i\) follows a zero-mean Gaussian distribution with variance \(\sigma^2\), we can express the likelihood function as:

\[\begin{align*} \mathcal{L}(\mathbf{w}; \mathbf{X}, \mathbf{y}) &= P(\mathbf{y}|\mathbf{X}; \mathbf{w}) \\ &= \prod_{i=1}^n P(\mathbf{y}_i|\mathbf{x}_i; \mathbf{w}) \\ &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(\mathbf{w}^T\mathbf{x}_i - \mathbf{y}_i)^2}{2\sigma^2}\right) \end{align*}\]

Taking the logarithm of the likelihood function, we have:

\[\begin{align*} \log \mathcal{L}(\mathbf{w}; \mathbf{X}, \mathbf{y}) &= \sum_{i=1}^n \log \left(\frac{1}{\sqrt{2\pi}\sigma}\right) - \frac{(\mathbf{w}^T\mathbf{x}_i - \mathbf{y}_i)^2}{2\sigma^2} \\ &= -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (\mathbf{w}^T\mathbf{x}_i - \mathbf{y}_i)^2 \end{align*}\]

To find the maximum likelihood estimate \(\mathbf{w}_{\text{ML}}\), we want to maximize \(\log \mathcal{L}(\mathbf{w}; \mathbf{X}, \mathbf{y})\). Maximizing the likelihood is equivalent to minimizing the negative log-likelihood. Thus, we seek to minimize:

\[\begin{align*} -\log \mathcal{L}(\mathbf{w}; \mathbf{X}, \mathbf{y}) &= \frac{1}{2\sigma^2}\sum_{i=1}^n (\mathbf{w}^T\mathbf{x}_i - \mathbf{y}_i)^2 \end{align*}\]

This expression is equivalent to the mean squared error (MSE) objective function used in linear regression. Therefore, finding the maximum likelihood estimate \(\mathbf{w}_{\text{ML}}\) is equivalent to solving the linear regression problem using the squared error loss.

To obtain the closed-form solution for \(\mathbf{w}_{\text{ML}}\), we differentiate the negative log-likelihood with respect to \(\mathbf{w}\) and set the derivative to zero:

\[\begin{align*} \nabla_{\mathbf{w}} \left(-\log \mathcal{L}(\mathbf{w}; \mathbf{X}, \mathbf{y})\right) &= \frac{1}{\sigma^2}\sum_{i=1}^n (\mathbf{w}^T\mathbf{x}_i - \mathbf{y}_i)\mathbf{x}_i^T = \mathbf{0} \end{align*}\]

This can be rewritten as:

\[\begin{align*} \frac{1}{\sigma^2}\left(\mathbf{X}\mathbf{X}^T\mathbf{w} - \mathbf{X}\mathbf{y}\right) &= \mathbf{0} \end{align*}\]

where \(\mathbf{X}\) is the matrix whose rows are the input vectors \(\mathbf{x}_i\) and \(\mathbf{y}\) is the column vector of labels. Rearranging the equation, we have:

\[\begin{align*} \mathbf{X}\mathbf{X}^T\mathbf{w} &= \mathbf{X}\mathbf{y} \end{align*}\]

To obtain the closed-form solution for \(\mathbf{w}_{\text{ML}}\), we multiply both sides by the inverse of \(\mathbf{X}\mathbf{X}^T\), denoted as \((\mathbf{X}\mathbf{X}^T)^{-1}\):

\[\begin{align*} \mathbf{w}_{\text{ML}} &= (\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{X}\mathbf{y} \end{align*}\]

Thus, the closed-form solution for the maximum likelihood estimate \(\mathbf{w}_{\text{ML}}\) is given by the product of \((\mathbf{X}\mathbf{X}^T)^{-1}\) and \(\mathbf{X}\mathbf{y}\).

The closed-form solution for \(\mathbf{w}_{\text{ML}}\) in linear regression demonstrates that it can be obtained by directly applying a matrix inverse operation to the product of the input matrix \(\mathbf{X}\) and the target variable vector \(\mathbf{y}\). This closed-form solution provides an efficient and direct way to estimate the weight vector \(\mathbf{w}\) based on the given data.

Evaluation of the Maximum Likelihood Estimator for Linear Regression

Linear regression is a widely used technique for modeling the relationship between a dependent variable and one or more independent variables. The maximum likelihood estimator (MLE) is commonly employed to estimate the parameters of a linear regression model. Here, we discuss the goodness of the MLE for linear regression, explore cross-validation techniques to minimize mean squared error (MSE), examine Bayesian modeling as an alternative approach, and finally, delve into ridge and lasso regression as methods to mitigate overfitting.

Goodness of Maximum Likelihood Estimator for Linear Regression

Consider a dataset comprising input vectors \(\{\mathbf{x}_1, \ldots, \mathbf{x}_n\}\), where each \(\mathbf{x}_i \in \mathbb{R}^d\), and corresponding labels \(\{y_1, \ldots, y_n\}\), with \(y_i \in \mathbb{R}\). We can express the relationship between the inputs and labels using the linear regression model:

\[ y|\mathbf{X} = \mathbf{w}^T\mathbf{x} + \epsilon \]

Here, \(\epsilon\) represents the random noise following a normal distribution \(\mathcal{N}(0,\sigma^2)\), and \(\mathbf{w} \in \mathbb{R}^d\) denotes the regression coefficients. The maximum likelihood parameter estimation for linear regression, denoted as \(\hat{\mathbf{w}}_{\text{ML}}\), can be computed as:

\[ \hat{\mathbf{w}}_{\text{ML}} = \mathbf{w}^* = (\mathbf{X}\mathbf{X}^T)^+\mathbf{X}\mathbf{y} \]

To evaluate the quality of the estimated parameters, we measure the mean squared error (MSE) between the estimated parameters and the true parameters \(\mathbf{w}\). The MSE is given by:

\[ \mathbb{E}[|| \hat{\mathbf{w}}_{\text{ML}} - \mathbf{w} ||^2_2] \]

Interestingly, the MSE can be expressed as:

\[ \mathbb{E}[|| \hat{\mathbf{w}}_{\text{ML}} - \mathbf{w} ||^2_2] = \sigma^2 \cdot \text{trace}((\mathbf{X}\mathbf{X}^T)^{-1}) \]

This result provides a quantitative measure of the goodness of the maximum likelihood estimator for linear regression.

Cross-Validation for Minimizing MSE

In order to minimize the MSE, we can utilize cross-validation techniques. Let the eigenvalues of \(\mathbf{X}\mathbf{X}^T\) be denoted as \(\{\lambda_1, \ldots, \lambda_d\}\). Consequently, the eigenvalues of \((\mathbf{X}\mathbf{X}^T)^{-1}\) are given by \(\{\frac{1}{\lambda_1}, \ldots, \frac{1}{\lambda_d}\}\).

The MSE can be expressed as:

\[ \mathbb{E}[|| \hat{\mathbf{w}}_{\text{ML}} - \mathbf{w} ||^2_2] = \sigma^2 \sum_{i=1}^d \frac{1}{\lambda_i} \]

To improve the estimator, we introduce a modified estimator, denoted as \(\hat{\mathbf{w}}_{\text{new}}\), defined as:

\[ \hat{\mathbf{w}}_{\text{new}} = (\mathbf{X}\mathbf{X}^T + \lambda \mathbf{I})^{-1}\mathbf{X}\mathbf{y} \]

Here, \(\lambda \in \mathbb{R}\) and \(\mathbf{I} \in \mathbb{R}^{d\times d}\) represents the identity matrix. By utilizing this modified estimator, we can calculate:

\[ \text{trace}((\mathbf{X}\mathbf{X}^T + \lambda \mathbf{I})^{-1}) = \sum_{i=1}^d \frac{1}{\lambda_i + \lambda} \]

According to the Existence Theorem, there exists a value of \(\lambda\) such that \(\hat{\mathbf{w}}_{\text{new}}\) exhibits a lower mean squared error than \(\hat{\mathbf{w}}_{\text{ML}}\). In practice, the value for \(\lambda\) is determined using cross-validation techniques.

Three commonly used techniques for cross-validation are as follows:

Training-Validation Split: The training set is randomly divided into a training set and a validation set, typically in an 80:20 ratio. Among various \(\lambda\) values, the one that yields the lowest error is selected.
K-Fold Cross Validation: The training set is partitioned into K equally-sized parts. The model is trained K times, each time using K-1 parts as the training set and the remaining part as the validation set. The \(\lambda\) value that leads to the lowest average error is chosen.
Leave-One-Out Cross Validation: The model is trained using all but one sample in the training set, and the left-out sample is used for validation. This process is repeated for each sample in the dataset. The optimal \(\lambda\) is determined based on the average error across all iterations.

By employing cross-validation techniques, we can enhance the performance of the linear regression model by selecting an appropriate value of \(\lambda\).

Bayesian Modeling

Alternatively, we can understand the maximum likelihood estimator \(\hat{\mathbf{w}}_{\text{ML}}\) in the context of Bayesian modeling.

Assume that \(P(y|\mathbf{X})\) follows a normal distribution \(\mathcal{N}(\mathbf{w}^T\mathbf{x},\mathbf{I})\), where \(I\) represents the identity matrix for simplicity.

For the prior distribution of \(\mathbf{w}\), a suitable choice is the normal distribution \(\mathcal{N}(0,\gamma^2\mathbf{I})\), where \(\gamma^2\mathbf{I} \in\mathbb{R}^{d\times d}\).

Thus, we can write:

\[\begin{align*} P(\mathbf{w}|\{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n,y_n)\}) &\propto P(\{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n,y_n)\}|\mathbf{w})*P(\mathbf{w}) \\ &\propto \left ( \prod_{i=1}^n e^{\frac{-(y_i - \mathbf{w}^T\mathbf{x}_i)^2}{2}} \right ) * \left ( \prod_{i=1}^d e^{\frac{-(\mathbf{w}_i - 0)^2}{2\gamma^2}} \right ) \\ &\propto \left ( \prod_{i=1}^n e^{\frac{-(y_i - \mathbf{w}^T\mathbf{x}_i)^2}{2}} \right ) * e^{\frac{-||\mathbf{w}||^2}{2\gamma^2}} \end{align*}\]

Taking the logarithm, we obtain:

\[ \log(P(\mathbf{w}|\{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n,y_n)\})) \propto \frac{-(y_i - \mathbf{w}^T\mathbf{x}_i)^2}{2}-\frac{||\mathbf{w}||^2}{2\gamma^2} \]

Upon computing the gradient, we find:

\[ \nabla \log(P(\mathbf{w}|\{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n,y_n)\})) \propto (\mathbf{X}\mathbf{X}^T)\hat{\mathbf{w}}_{\text{MAP}} - \mathbf{X}\mathbf{y} + \frac{\hat{\mathbf{w}}_{\text{MAP}}}{\gamma^2} \]

Consequently, the maximum a posteriori estimate (MAP) for \(\mathbf{w}\) can be computed as:

\[ \hat{\mathbf{w}}_{\text{MAP}} = (\mathbf{X}\mathbf{X}^T + \frac{1}{\gamma^2} \mathbf{I})^{-1}\mathbf{X}\mathbf{y} \]

In practice, the value of \(\frac{1}{\gamma^2}\) is obtained using cross-validation. Remarkably, this maximum a posteriori estimation for linear regression with a Gaussian prior \(\mathcal{N}(0,\gamma^2\mathbf{I})\) for \(\mathbf{w}\) is equivalent to the modified estimator \(\hat{\mathbf{w}}_{\text{new}}\) discussed earlier.