Linear Regression
Linear regression is a supervised learning algorithm employed to predict a continuous output variable based on one or more input features, assuming a linear relationship between the input and output variables. The primary objective of linear regression is to determine the line of best fit that minimizes the sum of squared errors between the predicted and actual output values.
Given a dataset \(\mathbf{x}_1, \ldots, \mathbf{x}_n\) where each \(\mathbf{x}_i\) belongs to \(\mathbb{R}^d\), and the corresponding labels \(\mathbf{y}_1, \ldots, \mathbf{y}_n\) belong to \(\mathbb{R}\), the goal of linear regression is to find a mapping between the input and output variables, represented as follows:
\[ h: \mathbb{R}^d \rightarrow \mathbb{R} \]
The error for this mapping function can be quantified as:
\[ \text{error}(h) = \sum_{i=1}^n (h(\mathbf{x}_i) - \mathbf{y}_i)^2 \]
Ideally, this error should be minimized, which occurs when \(h(\mathbf{x}_i) = \mathbf{y}_i\) for all \(i\). However, achieving this may only result in memorizing the data and its outputs, which is not a desired outcome.
To mitigate the memorization problem, introducing a structure to the mapping becomes necessary. The simplest and commonly used structure is linear, which we will adopt as the underlying structure for our data.
Let \(\mathcal{H}_{\text{linear}}\) denote the solution space for the mapping in the linear domain:
\[ \mathcal{H}_{\text{linear}} = \left\lbrace h_w: \mathbb{R}^d \rightarrow \mathbb{R} \ \text{s.t.} \ h_w(\mathbf{x}) = \mathbf{w}^T\mathbf{x} \ \forall \mathbf{w} \in \mathbb{R}^d \right\rbrace \]
Thus, our objective is to minimize:
\[\begin{align*} \min_{h \in \mathcal{H}_{\text{linear}}} \sum_{i=1}^n (h(\mathbf{x}_i) - \mathbf{y}_i)^2 \\ \text{Equivalently,} \\ \min_{\mathbf{w} \in \mathbb{R}^d} \sum_{i=1}^n (\mathbf{w}^T\mathbf{x}_i - \mathbf{y}_i)^2 \end{align*}\]
Optimizing the above objective is the main aim of the linear regression algorithm.
Optimizing the Error Function
The minimization equation can be expressed in vectorized form as:
\[ \min_{\mathbf{w} \in \mathbb{R}^d} \|\mathbf{X}^T\mathbf{w} - \mathbf{y}\|_2^2 \]
Defining a function \(f(\mathbf{w})\) that captures this minimization problem, we have:
\[\begin{align*} f(\mathbf{w}) &= \min_{\mathbf{w} \in \mathbb{R}^d} \|\mathbf{X}^T\mathbf{w} - \mathbf{y}\|_2^2 \\ f(\mathbf{w}) &= (\mathbf{X}^T\mathbf{w} - \mathbf{y})^T(\mathbf{X}^T\mathbf{w} - \mathbf{y}) \\ \therefore \nabla f(\mathbf{w}) &= 2(\mathbf{X}\mathbf{X}^T)\mathbf{w} - 2(\mathbf{X}\mathbf{y}) \end{align*}\]
Setting the gradient equation to zero, we obtain:
\[\begin{align*} (\mathbf{X}\mathbf{X}^T)\mathbf{w}^* &= \mathbf{X}\mathbf{y} \\ \therefore \mathbf{w}^* &= (\mathbf{X}\mathbf{X}^T)^+\mathbf{X}\mathbf{y} \end{align*}\]
Here, \((\mathbf{X}\mathbf{X}^T)^+\) represents the pseudo-inverse of \(\mathbf{X}\mathbf{X}^T\).
Further analysis reveals that \(\mathbf{X}^T\mathbf{w}\) corresponds to the projection of the labels onto the subspace spanned by the features.
Implementing Linear Regression Algorithm (Closed Form) in Python
We will be using the FuelConsumption.csv
dataset.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
= pd.read_csv("FuelConsumption.csv")
df
# take a look at the dataset
df.head()
MODELYEAR | MAKE | MODEL | VEHICLECLASS | ENGINESIZE | CYLINDERS | TRANSMISSION | FUELTYPE | FUELCONSUMPTION_CITY | FUELCONSUMPTION_HWY | FUELCONSUMPTION_COMB | FUELCONSUMPTION_COMB_MPG | CO2EMISSIONS |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2014 | ACURA | ILX | COMPACT | 2.0 | 4 | AS5 | Z | 9.9 | 6.7 | 8.5 | 33 | 196 |
2014 | ACURA | ILX | COMPACT | 2.4 | 4 | M6 | Z | 11.2 | 7.7 | 9.6 | 29 | 221 |
2014 | ACURA | ILX HYBRID | COMPACT | 1.5 | 4 | AV7 | Z | 6.0 | 5.8 | 5.9 | 48 | 136 |
Let’s select some features to explore more.
= df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
cdf cdf.head()
ENGINESIZE | CYLINDERS | FUELCONSUMPTION_COMB | CO2EMISSIONS |
---|---|---|---|
2.0 | 4 | 8.5 | 196 |
2.4 | 4 | 9.6 | 221 |
1.5 | 4 | 5.9 | 136 |
3.5 | 6 | 11.1 | 255 |
3.5 | 6 | 10.6 | 244 |
Let’s split the dataset into features and labels and convert them into numpy arrays for computation.
= np.asanyarray(cdf[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB', 'Ones']]).T
X = np.asanyarray(cdf[['CO2EMISSIONS']])
y
X.shape, y.shape
Let compute the \(w\) vector and print the coefficients and intercept.
= np.linalg.pinv(X@X.T)@X@y
w
print ('Coefficients: ', w.reshape((-1,))[:3])
print ('Intercept: ', w.reshape((-1,))[-1])
Gradient Descent
The normal equation for linear regression, as shown above, involves calculating \((\mathbf{X}\mathbf{X}^T)^+\), which can be computationally expensive with a complexity of \(O(d^3)\).
Since \(\mathbf{w}^*\) represents the solution of an unconstrained optimization problem, it can be solved using gradient descent. The iterative formula for gradient descent is:
\[\begin{align*} \mathbf{w}^{t+1} &= \mathbf{w}^t - \eta^t \nabla f(\mathbf{w}^t) \\ \therefore \mathbf{w}^{t+1} &= \mathbf{w}^t - \eta^t \left [ 2(\mathbf{X}\mathbf{X}^T)\mathbf{w}^t - 2(\mathbf{X}\mathbf{y}) \right ] \end{align*}\]
Here, \(\eta\) is a scalar that controls the step-size of the descent, and \(t\) represents the current iteration.
Implementing Gradient Descent in Python
Let compute the \(w\) vector using Gradient Descent and print the coefficients and intercept.
= np.ones((X.shape[0], 1))
w = 1e-6
eta = 100000
t
for i in range(t):
= 2*(X @ X.T @ w) - 2*(X @ y)
grad = w - eta*grad
w
print ('Coefficients: ', w.reshape((-1,))[:3])
print ('Intercept: ', w.reshape((-1,))[-1])
Even in the above equation, the calculation of \(\mathbf{X}\mathbf{X}^T\) is required, which remains computationally expensive. Is there a way to further enhance this process?
Stochastic Gradient Descent
Stochastic gradient descent (SGD) is an optimization algorithm widely employed in machine learning to minimize the loss function of a model by determining the optimal parameters. Unlike traditional gradient descent, which updates the model parameters based on the entire dataset, SGD updates the parameters using a randomly selected subset of the data, known as a batch. This approach leads to faster training times and makes SGD particularly suitable for handling large datasets.
Instead of updating \(\mathbf{w}\) using the entire dataset at each step \(t\), SGD leverages a small randomly selected subset of \(k\) data points to update \(\mathbf{w}\). Consequently, the new gradient becomes \(2(\tilde{\mathbf{X}}\tilde{\mathbf{X}}^T\mathbf{w}^t - \tilde{\mathbf{X}}\tilde{\mathbf{y}})\), where \(\tilde{\mathbf{X}}\) and \(\tilde{\mathbf{y}}\) represent small samples randomly chosen from the dataset. This strategy is feasible since \(\tilde{\mathbf{X}} \in \mathbb{R}^{d \times k}\), which is considerably smaller compared to \(\mathbf{X}\).
After \(T\) rounds of training, the final estimate is obtained as follows:
\[ \mathbf{w}_{\text{SGD}}^T = \frac{1}{T} \sum_{i=1}^T \mathbf{w}^i \]
The stochastic nature of SGD contributes to optimal convergence to a certain extent.
Implementing Stochastic Gradient Descent in Python
Let compute the \(w\) vector using Stochastic Gradient Descent and print the coefficients and intercept.
= np.ones((X.shape[0], 1))
w = 1e-4
eta = 100000
t = X.shape[1]
n = 10
b
for i in range(t):
# randomly select a batch of samples
= np.random.choice(n, b, replace=False)
idx = X[:, idx]
X_b = y[idx]
y_b # compute the gradient for the batch
= 2*(X_b @ X_b.T @ w) - 2*(X_b @ y_b)
grad # update the weights
= w - eta*grad
w
print ('Coefficients: ', w.reshape((-1,))[:3])
print ('Intercept: ', w.reshape((-1,))[-1])