Kernel Least squares regression

What if the data points reside in a non-linear subspace? Similar to dealing with non-linear data clustering, kernel functions are employed in this scenario as well.

Let \(\mathbf{w}^* = \mathbf{X}\boldsymbol{\alpha}^*\), where \(\boldsymbol{\alpha}^* \in \mathbb{R}^n\). \[\begin{align*} \mathbf{X}\boldsymbol{\alpha}^* &= \mathbf{w}^* \\ \therefore \mathbf{X}\boldsymbol{\alpha}^* &= (\mathbf{X}\mathbf{X}^T)^+\mathbf{X}\mathbf{y} \\ (\mathbf{X}\mathbf{X}^T)\mathbf{X}\boldsymbol{\alpha}^* &= (\mathbf{X}\mathbf{X}^T)(\mathbf{X}\mathbf{X}^T)^+\mathbf{X}\mathbf{y} \\ (\mathbf{X}\mathbf{X}^T)\mathbf{X}\boldsymbol{\alpha}^* &= \mathbf{X}\mathbf{y} \\ \mathbf{X}^T(\mathbf{X}\mathbf{X}^T)\mathbf{X}\boldsymbol{\alpha}^* &= \mathbf{X}^T\mathbf{X}\mathbf{y} \\ (\mathbf{X}^T\mathbf{X})^2\boldsymbol{\alpha}^* &= \mathbf{X}^T\mathbf{X}\mathbf{y} \\ \mathbf{K}^2\boldsymbol{\alpha}^* &= \mathbf{K}\mathbf{y} \\ \therefore \boldsymbol{\alpha}^* &= \mathbf{K}^{-1}\mathbf{y} \end{align*}\]

Here, \(\mathbf{K} \in \mathbb{R}^{n \times n}\), and it can be obtained using a kernel function such as the Polynomial Kernel or RBF Kernel.

To predict using \(\boldsymbol{\alpha}\) and the kernel function, let \(\mathbf{X}_{\text{test}} \in \mathbb{R}^{d \times m}\) represent the test dataset. The prediction is made as follows:

\[\begin{align*} \mathbf{w}^*\phi(\mathbf{X}_{\text{test}}) &= \sum_{i=1}^n \alpha_i^* k(\mathbf{x}_i, \mathbf{x}_{\text{test}_i}) \end{align*}\]

Here, \(\alpha_i^*\) denotes the importance of the \(i\)-th data point in relation to \(\mathbf{w}^*\), and \(k(\mathbf{x}_i, \mathbf{x}_{\text{test}_i})\) signifies the similarity between \(\mathbf{x}_{\text{test}_i}\) and \(\mathbf{x}_i\).