Ridge regression

Ridge regression is a linear regression technique that addresses multicollinearity and overfitting by adding a penalty term to the ordinary least squares method.

The objective function of ridge regression is given by:

\[ \min_{\mathbf{w} \in \mathbb{R}^d} \sum^n_{i=1}(\mathbf{w}^T\mathbf{x}_i-y_i)^2 + \lambda||\mathbf{w}||_2^2 \]

Here, \(\lambda||\mathbf{w}||_2^2\) serves as the regularization term, and \(||\mathbf{w}||_2^2\) represents the squared L2 norm of \(\mathbf{w}\). Let us denote this equation as \(f(\mathbf{w})\).

Alternatively, we can express ridge regression as:

\[ \min_{\mathbf{w} \in \mathbb{R}^d} \sum^n_{i=1}(\mathbf{w}^T\mathbf{x}_i-y_i)^2 \hspace{1em}\text{s.t.}\hspace{1em}||\mathbf{w}||_2^2\leq\theta \]

Here, the value of \(\theta\) depends on \(\lambda\). We can conclude that for any choice of \(\lambda>0\), there exists a corresponding \(\theta\) such that optimal solutions to our objective function exist.

The loss function for linear regression with the maximum likelihood estimator \(\mathbf{w}_{\text{ML}}\) is defined as:

\[ f(\mathbf{w}_{\text{ML}}) = \sum^n_{i=1}(\mathbf{w}_{\text{ML}}^T\mathbf{x}_i-y_i)^2 \]

Consider the set of all \(\mathbf{w}\) such that \(f(\mathbf{w}_{\text{ML}}) = f(\mathbf{w}) + c\), where \(c>0\). This set can be represented as:

\[ S_c = \left \{\mathbf{w}: f(\mathbf{w}_{\text{ML}}) = f(\mathbf{w}) + c \right \} \]

In other words, every \(\mathbf{w} \in S_c\) satisfies the equation:

\[ ||\mathbf{X}^T\mathbf{w}-\mathbf{y}||^2 = ||\mathbf{X}^T\mathbf{w}_{\text{ML}}-\mathbf{y}||^2 + c \]

Simplifying the equation yields:

\[ (\mathbf{w}-\mathbf{w}_{\text{ML}})^T(\mathbf{X}\mathbf{X}^T)(\mathbf{w}-\mathbf{w}_{\text{ML}}) = c' \]

The value of \(c'\) depends on \(c\), \(\mathbf{X}\mathbf{X}^T\), and \(\mathbf{w}_{\text{ML}}\), but it does not depend on \(\mathbf{w}\).

Pictoral Representation of what Ridge Regression does.

In summary, ridge regression regularizes the feature values, pushing them towards zero, but not necessarily to zero.