Linear Regression

Linear regression is a supervised learning algorithm employed to predict a continuous output variable based on one or more input features, assuming a linear relationship between the input and output variables. The primary objective of linear regression is to determine the line of best fit that minimizes the sum of squared errors between the predicted and actual output values.

Given a dataset \(\mathbf{x}_1, \ldots, \mathbf{x}_n\) where each \(\mathbf{x}_i\) belongs to \(\mathbb{R}^d\), and the corresponding labels \(\mathbf{y}_1, \ldots, \mathbf{y}_n\) belong to \(\mathbb{R}\), the goal of linear regression is to find a mapping between the input and output variables, represented as follows:

\[ h: \mathbb{R}^d \rightarrow \mathbb{R} \]

The error for this mapping function can be quantified as:

\[ \text{error}(h) = \sum_{i=1}^n (h(\mathbf{x}_i) - \mathbf{y}_i)^2 \]

Ideally, this error should be minimized, which occurs when \(h(\mathbf{x}_i) = \mathbf{y}_i\) for all \(i\). However, achieving this may only result in memorizing the data and its outputs, which is not a desired outcome.

To mitigate the memorization problem, introducing a structure to the mapping becomes necessary. The simplest and commonly used structure is linear, which we will adopt as the underlying structure for our data.

Let \(\mathcal{H}_{\text{linear}}\) denote the solution space for the mapping in the linear domain:

\[ \mathcal{H}_{\text{linear}} = \left\lbrace h_w: \mathbb{R}^d \rightarrow \mathbb{R} \ \text{s.t.} \ h_w(\mathbf{x}) = \mathbf{w}^T\mathbf{x} \ \forall \mathbf{w} \in \mathbb{R}^d \right\rbrace \]

Thus, our objective is to minimize:

\[\begin{align*} \min_{h \in \mathcal{H}_{\text{linear}}} \sum_{i=1}^n (h(\mathbf{x}_i) - \mathbf{y}_i)^2 \\ \text{Equivalently,} \\ \min_{\mathbf{w} \in \mathbb{R}^d} \sum_{i=1}^n (\mathbf{w}^T\mathbf{x}_i - \mathbf{y}_i)^2 \end{align*}\]

Optimizing the above objective is the main aim of the linear regression algorithm.

Optimizing the Error Function

The minimization equation can be expressed in vectorized form as:

\[ \min_{\mathbf{w} \in \mathbb{R}^d} \|\mathbf{X}^T\mathbf{w} - \mathbf{y}\|_2^2 \]

Defining a function \(f(\mathbf{w})\) that captures this minimization problem, we have:

\[\begin{align*} f(\mathbf{w}) &= \min_{\mathbf{w} \in \mathbb{R}^d} \|\mathbf{X}^T\mathbf{w} - \mathbf{y}\|_2^2 \\ f(\mathbf{w}) &= (\mathbf{X}^T\mathbf{w} - \mathbf{y})^T(\mathbf{X}^T\mathbf{w} - \mathbf{y}) \\ \therefore \nabla f(\mathbf{w}) &= 2(\mathbf{X}\mathbf{X}^T)\mathbf{w} - 2(\mathbf{X}\mathbf{y}) \end{align*}\]

Setting the gradient equation to zero, we obtain:

\[\begin{align*} (\mathbf{X}\mathbf{X}^T)\mathbf{w}^* &= \mathbf{X}\mathbf{y} \\ \therefore \mathbf{w}^* &= (\mathbf{X}\mathbf{X}^T)^+\mathbf{X}\mathbf{y} \end{align*}\]

Here, \((\mathbf{X}\mathbf{X}^T)^+\) represents the pseudo-inverse of \(\mathbf{X}\mathbf{X}^T\).

Further analysis reveals that \(\mathbf{X}^T\mathbf{w}\) corresponds to the projection of the labels onto the subspace spanned by the features.

Implementing Linear Regression Algorithm (Closed Form) in Python

We will be using the FuelConsumption.csv dataset.

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df = pd.read_csv("FuelConsumption.csv")

# take a look at the dataset
df.head()
MODELYEAR MAKE MODEL VEHICLECLASS ENGINESIZE CYLINDERS TRANSMISSION FUELTYPE FUELCONSUMPTION_CITY FUELCONSUMPTION_HWY FUELCONSUMPTION_COMB FUELCONSUMPTION_COMB_MPG CO2EMISSIONS
2014 ACURA ILX COMPACT 2.0 4 AS5 Z 9.9 6.7 8.5 33 196
2014 ACURA ILX COMPACT 2.4 4 M6 Z 11.2 7.7 9.6 29 221
2014 ACURA ILX HYBRID COMPACT 1.5 4 AV7 Z 6.0 5.8 5.9 48 136

Let’s select some features to explore more.

cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
cdf.head()
ENGINESIZE CYLINDERS FUELCONSUMPTION_COMB CO2EMISSIONS
2.0 4 8.5 196
2.4 4 9.6 221
1.5 4 5.9 136
3.5 6 11.1 255
3.5 6 10.6 244

Let’s split the dataset into features and labels and convert them into numpy arrays for computation.

X = np.asanyarray(cdf[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB', 'Ones']]).T
y = np.asanyarray(cdf[['CO2EMISSIONS']])

X.shape, y.shape

((4, 825), (825, 1))

Let compute the \(w\) vector and print the coefficients and intercept.

w = np.linalg.pinv(X@X.T)@X@y

print ('Coefficients: ', w.reshape((-1,))[:3])
print ('Intercept: ', w.reshape((-1,))[-1])

Coefficients: [10.59881197, 7.4005805, 9.48207228]

Intercept: 67.82612794918894

Gradient Descent

The normal equation for linear regression, as shown above, involves calculating \((\mathbf{X}\mathbf{X}^T)^+\), which can be computationally expensive with a complexity of \(O(d^3)\).

Since \(\mathbf{w}^*\) represents the solution of an unconstrained optimization problem, it can be solved using gradient descent. The iterative formula for gradient descent is:

\[\begin{align*} \mathbf{w}^{t+1} &= \mathbf{w}^t - \eta^t \nabla f(\mathbf{w}^t) \\ \therefore \mathbf{w}^{t+1} &= \mathbf{w}^t - \eta^t \left [ 2(\mathbf{X}\mathbf{X}^T)\mathbf{w}^t - 2(\mathbf{X}\mathbf{y}) \right ] \end{align*}\]

Here, \(\eta\) is a scalar that controls the step-size of the descent, and \(t\) represents the current iteration.

Implementing Gradient Descent in Python

Let compute the \(w\) vector using Gradient Descent and print the coefficients and intercept.

w = np.ones((X.shape[0], 1))
eta = 1e-6
t = 100000

for i in range(t):
    grad = 2*(X @ X.T @ w) - 2*(X @ y)
    w = w - eta*grad
    
print ('Coefficients: ', w.reshape((-1,))[:3])
print ('Intercept: ', w.reshape((-1,))[-1])

Coefficients: [10.85347843, 7.51750331, 9.59591605]

Intercept: 65.21825985969691

Even in the above equation, the calculation of \(\mathbf{X}\mathbf{X}^T\) is required, which remains computationally expensive. Is there a way to further enhance this process?

Stochastic Gradient Descent

Stochastic gradient descent (SGD) is an optimization algorithm widely employed in machine learning to minimize the loss function of a model by determining the optimal parameters. Unlike traditional gradient descent, which updates the model parameters based on the entire dataset, SGD updates the parameters using a randomly selected subset of the data, known as a batch. This approach leads to faster training times and makes SGD particularly suitable for handling large datasets.

Instead of updating \(\mathbf{w}\) using the entire dataset at each step \(t\), SGD leverages a small randomly selected subset of \(k\) data points to update \(\mathbf{w}\). Consequently, the new gradient becomes \(2(\tilde{\mathbf{X}}\tilde{\mathbf{X}}^T\mathbf{w}^t - \tilde{\mathbf{X}}\tilde{\mathbf{y}})\), where \(\tilde{\mathbf{X}}\) and \(\tilde{\mathbf{y}}\) represent small samples randomly chosen from the dataset. This strategy is feasible since \(\tilde{\mathbf{X}} \in \mathbb{R}^{d \times k}\), which is considerably smaller compared to \(\mathbf{X}\).

After \(T\) rounds of training, the final estimate is obtained as follows:

\[ \mathbf{w}_{\text{SGD}}^T = \frac{1}{T} \sum_{i=1}^T \mathbf{w}^i \]

The stochastic nature of SGD contributes to optimal convergence to a certain extent.

Implementing Stochastic Gradient Descent in Python

Let compute the \(w\) vector using Stochastic Gradient Descent and print the coefficients and intercept.

w = np.ones((X.shape[0], 1))
eta = 1e-4
t = 100000
n = X.shape[1]
b = 10

for i in range(t):
    # randomly select a batch of samples
    idx = np.random.choice(n, b, replace=False)
    X_b = X[:, idx]
    y_b = y[idx]
    # compute the gradient for the batch
    grad = 2*(X_b @ X_b.T @ w) - 2*(X_b @ y_b)
    # update the weights
    w = w - eta*grad
    
print ('Coefficients: ', w.reshape((-1,))[:3])
print ('Intercept: ', w.reshape((-1,))[-1])

Coefficients: [10.72000912, 8.3366805, 9.13970723]

Intercept: 65.2366350549217