Backpropagation

Backpropagation is the process of iteratively computing gradients of the loss function with respect to the weights and biases of the neural network, enabling the adjustment of parameters by propagating errors backward through the network’s layers to enhance learning and optimize performance.

  1. Output Layer Error Derivative:
    • \(\frac{\partial \mathcal{J} }{ \partial z_{L}^{(i)} } = \frac{1}{m} (\hat{y}^{(i)} - y^{(i)}) \odot g'(z^{[L] (i)})\)
      • Calculate the derivative of the cost function with respect to the pre-activation value of the output layer for each training example.
      • \(\odot\) denotes element-wise multiplication.
      • \(g'(\cdot)\) represents the derivative of the activation function used in the output layer.
  2. Gradient Calculation for Output Layer Weights and Bias:
    • \(\frac{\partial \mathcal{J} }{ \partial W^{[L]} } = \frac{\partial \mathcal{J} }{ \partial z_{L}^{(i)} } a^{[L-1] (i) T}\)
      • Compute the gradient of the loss function with respect to the weights of the output layer.
    • \(\frac{\partial \mathcal{J} }{ \partial b^{[L]} } = \sum_i{\frac{\partial \mathcal{J} }{ \partial z_{L}^{(i)}}}\)
      • Compute the gradient of the loss function with respect to the bias of the output layer.
  3. Backpropagate the Error to Previous Layers:
    • \(\frac{\partial \mathcal{J} }{ \partial z_{l}^{(i)} } = W^{[l+1] T} \frac{\partial \mathcal{J} }{ \partial z_{l+1}^{(i)} } \odot g'(z^{[l] (i)})\)
      • Propagate the error derivative backward through the layers by considering the relationship between consecutive layers.
      • Involves the transpose of weights connecting the subsequent and current layers.
      • Involves the derivative of the activation function used in layer \(l\).
  4. Gradient Calculation for Weights and Bias of Hidden Layers:
    • \(\frac{\partial \mathcal{J} }{ \partial W^{[l]} } = \frac{\partial \mathcal{J} }{ \partial z_{l}^{(i)} } a^{[l-1] (i) T}\)
      • Compute the gradient of the loss function with respect to the weights of layer \(l\).
    • \(\frac{\partial \mathcal{J} }{ \partial b^{[l]} } = \sum_i{\frac{\partial \mathcal{J} }{ \partial z_{l}^{(i)}}}\)
      • Compute the gradient of the loss function with respect to the bias of layer \(l\).

These steps form the core of backpropagation in a shallow neural network, allowing for the iterative adjustment of weights and biases to minimize the loss function and improve the model’s accuracy in predicting outputs for given inputs.

def backward_propagation(parameters, cache, X, Y):
    """
    Implement the backward propagation using the instructions above.
    
    Arguments:
    parameters -- python dictionary containing our parameters 
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2".
    X -- input data of shape (2, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    
    Returns:
    grads -- python dictionary containing your gradients with respect to different parameters
    """
    m = X.shape[1]
    
    W1, W2 = parameters["W1"], parameters["W2"]
    
    A1, A2 = cache["A1"], cache["A2"]
        
    dZ2 = A2 - Y
    dW2 = (1/Y.shape[1])*dZ2@A1.T
    db2 = (1/Y.shape[1])*np.sum(dZ2, axis=1, keepdims=True)
    dZ1 = W2.T@dZ2*(1-A1**2)
    dW1 = (1/Y.shape[1])*dZ1@X.T
    db1 = (1/Y.shape[1])*np.sum(dZ1, axis=1, keepdims=True)
        
    grads = {"dW1": dW1,
             "db1": db1,
             "dW2": dW2,
             "db2": db2}
    
    return grads