Backpropagation
Backpropagation is the process of iteratively computing gradients of the loss function with respect to the weights and biases of the neural network, enabling the adjustment of parameters by propagating errors backward through the network’s layers to enhance learning and optimize performance.
- Output Layer Error Derivative:
- \(\frac{\partial \mathcal{J} }{ \partial z_{L}^{(i)} } = \frac{1}{m} (\hat{y}^{(i)} - y^{(i)}) \odot g'(z^{[L] (i)})\)
- Calculate the derivative of the cost function with respect to the pre-activation value of the output layer for each training example.
- \(\odot\) denotes element-wise multiplication.
- \(g'(\cdot)\) represents the derivative of the activation function used in the output layer.
- \(\frac{\partial \mathcal{J} }{ \partial z_{L}^{(i)} } = \frac{1}{m} (\hat{y}^{(i)} - y^{(i)}) \odot g'(z^{[L] (i)})\)
- Gradient Calculation for Output Layer Weights and Bias:
- \(\frac{\partial \mathcal{J} }{ \partial W^{[L]} } = \frac{\partial \mathcal{J} }{ \partial z_{L}^{(i)} } a^{[L-1] (i) T}\)
- Compute the gradient of the loss function with respect to the weights of the output layer.
- \(\frac{\partial \mathcal{J} }{ \partial b^{[L]} } = \sum_i{\frac{\partial \mathcal{J} }{ \partial z_{L}^{(i)}}}\)
- Compute the gradient of the loss function with respect to the bias of the output layer.
- \(\frac{\partial \mathcal{J} }{ \partial W^{[L]} } = \frac{\partial \mathcal{J} }{ \partial z_{L}^{(i)} } a^{[L-1] (i) T}\)
- Backpropagate the Error to Previous Layers:
- \(\frac{\partial \mathcal{J} }{ \partial z_{l}^{(i)} } = W^{[l+1] T} \frac{\partial \mathcal{J} }{ \partial z_{l+1}^{(i)} } \odot g'(z^{[l] (i)})\)
- Propagate the error derivative backward through the layers by considering the relationship between consecutive layers.
- Involves the transpose of weights connecting the subsequent and current layers.
- Involves the derivative of the activation function used in layer \(l\).
- \(\frac{\partial \mathcal{J} }{ \partial z_{l}^{(i)} } = W^{[l+1] T} \frac{\partial \mathcal{J} }{ \partial z_{l+1}^{(i)} } \odot g'(z^{[l] (i)})\)
- Gradient Calculation for Weights and Bias of Hidden Layers:
- \(\frac{\partial \mathcal{J} }{ \partial W^{[l]} } = \frac{\partial \mathcal{J} }{ \partial z_{l}^{(i)} } a^{[l-1] (i) T}\)
- Compute the gradient of the loss function with respect to the weights of layer \(l\).
- \(\frac{\partial \mathcal{J} }{ \partial b^{[l]} } = \sum_i{\frac{\partial \mathcal{J} }{ \partial z_{l}^{(i)}}}\)
- Compute the gradient of the loss function with respect to the bias of layer \(l\).
- \(\frac{\partial \mathcal{J} }{ \partial W^{[l]} } = \frac{\partial \mathcal{J} }{ \partial z_{l}^{(i)} } a^{[l-1] (i) T}\)
These steps form the core of backpropagation in a shallow neural network, allowing for the iterative adjustment of weights and biases to minimize the loss function and improve the model’s accuracy in predicting outputs for given inputs.
Code
def backward_propagation(parameters, cache, X, Y):
"""
Implement the backward propagation using the instructions above.
Arguments:
parameters -- python dictionary containing our parameters
cache -- a dictionary containing "Z1", "A1", "Z2" and "A2".
X -- input data of shape (2, number of examples)
Y -- "true" labels vector of shape (1, number of examples)
Returns:
grads -- python dictionary containing your gradients with respect to different parameters
"""
= X.shape[1]
m
= parameters["W1"], parameters["W2"]
W1, W2
= cache["A1"], cache["A2"]
A1, A2
= A2 - Y
dZ2 = (1/Y.shape[1])*dZ2@A1.T
dW2 = (1/Y.shape[1])*np.sum(dZ2, axis=1, keepdims=True)
db2 = W2.T@dZ2*(1-A1**2)
dZ1 = (1/Y.shape[1])*dZ1@X.T
dW1 = (1/Y.shape[1])*np.sum(dZ1, axis=1, keepdims=True)
db1
= {"dW1": dW1,
grads "db1": db1,
"dW2": dW2,
"db2": db2}
return grads