Neural Networks and Binary Classification

Due to the popularity of deep learning in recent years, neural networks have become popular. It has been used to solve a wide variety of problems. This article will introduce the neural network in detail with the binary classification neural network.

The complete code for this chapter can be found in .

Neural Networks
Activation Functions
Binary Classification
Multi-class Classification
Conclusion
References

Neural Networks

Neural networks are connected by a large number of neurons. A neural network consists of three layers, namely an input layer that receives data, an output layer that outputs results, and a hidden layer composed of a large number of neurons in the middle, as shown below.

Input layer and output layer each have only one layer, while hidden layer can have several layers. In addition, when we say that the figure below is a three-layer neural network, it refers to the number of layers in the hidden layer plus an output layer, and the input layer is not included.

A neural network with 3 layers (2 hidden layers and 1 output layer).

Each neuron will have an input vector $\vec{x}$ , the weights of the vector $\vec{w}$ , a scalar bias b, and a non-linear function g. Therefore, the function of a neuron is to find the inner product of $\vec{x}$ and $\vec{w}$ and add b to get z, and then add z into g to get an output value a. And this output value a will become one of the input values x_i of the neurons in the next layer. This non-linear function g is called the activation function, and a is called the activation value.

A neuron with sigmoid function as its activation function.

Therefore, several to a large number of neurons form a layer. This layer takes the output value of the previous layer as the input value, and the output value after the operation of the neurons in this layer will be used as the input value of the next layer. In this way, the layers are connected layer by layer to form a hidden layer containing a large number of neurons.

Activation Functions

If a neuron does not contain non-linear function, that is, it only performs linear operations. Even if a large number of neurons are connected, it is only a multiple linear regression. Linear regression models cannot solve complex problems in the real world, but complex problems can be approximated using non-linear functions. The non-linear functions in neuron are called activation functions.

Below we will introduce four commonly used activation functions. We will also introduce how to find its derivatives, because later in backpropagation, we will need to find their derivatives.

Sigmoid Function and its Derivatives

Sigmoid function converts the input value into a value between 0 and 1, as shown below. It is often used as the output layer in neural networks for binary classification. We can think of this output value as a probability. For example, we want a neural network to determine whether there is a cat in a picture. 1 means it’s a cat, and 0 means it’s not a cat. When the output layer output value is greater than 0.5, the prediction is 1; when it is less than or equal to 0.5, the prediction is 0.

The implementation of the sigmoid function is as follows.

def sigmoid(Z):
    """
    Implements the sigmoid function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the sigmoid function

    Returns
    -------
    A: (ndarray of same shape as Z) or (scalar) - output from the sigmoid function
    """

    A = 1 / (1 + np.exp(-Z))
    return A

The derivative of the sigmoid function is as follows.

$\frac{d}{dz}\sigma(z) =\frac{d}{dz}\frac{1}{1+e^{-z}} \\\\ \hphantom{\frac{d}{dz}\sigma(z)}=-(1+e^{-z})^2\cdot(-e^{-z}) \\\\ \hphantom{\frac{d}{dz}\sigma(z)}=\frac{1}{1+e^{-z}}\frac{e^{-z}}{1+e^{-z}} \\\\ \hphantom{\frac{d}{dz}\sigma(z)}=\frac{1}{1+e^{-z}}\frac{1+e^{-z}-1}{1+e^{-z}} \\\\ \hphantom{\frac{d}{dz}\sigma(z)}=\frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}}) \\\\ \hphantom{\frac{d}{dz}\sigma(z)}=\sigma(z)(1-\sigma(z))$

The derivative of the sigmoid function is implemented as follows.

def sigmoid_derivative(Z):
    """
    Implements the derivative of the sigmoid function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the sigmoid function

    Returns
    --------
    dZ: (ndarray of the same shape as Z) or (scalar) - derivative of the sigmoid function with respect to Z
    """

    g = 1 / (1 + np.exp(-Z))
    dZ = g * (1 - g)
    return dZ

Tanh Function and its Derivatives

Tanh function is very similar to the sigmoid function, but the output value of the tanh function is between -1 and 1, as shown below.

The implementation of the tanh function is as follows.

def tanh(Z):
    """
    Implements the tanh function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the tanh function

    Returns
    -------
    A: (ndarray of same shape as Z) or (scalar) - output from the tanh function
    """

    A = (np.exp(Z) - np.exp(-Z)) / (np.exp(Z) + np.exp(-Z))
    return A

The derivative of the tanh function is solved as follows:

$\frac{d}{dz}tanh(z)=\frac{d}{dz}\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}} \\\\ \hphantom{\frac{d}{dz}tanh(z)}=\frac{(e^{z}+e^{-z})\cdot\frac{d}{dz}(e^{z}-e^{-z})-(e^{z}-e^{-z})\cdot\frac{d}{dz}(e^{z}+e^{-z})}{(e^{z}+e^{-z})^2} \\\\ \hphantom{\frac{d}{dz}tanh(z)}=\frac{(e^{z}+e^{-z})(e^{z}+e^{-z})-(e^{z}-e^{-z})(e^{z}-e^{-z})}{(e^{z}+e^{-z})^2} \\\\ \hphantom{\frac{d}{dz}tanh(z)}=\frac{(e^{z}+e^{-z})^2}{(e^{z}+e^{-z})^2}-\frac{(e^{z}-e^{-z})^2}{(e^{z}+e^{-z})^2} \\\\ \hphantom{\frac{d}{dz}tanh(z)}=1-tanh^{2}(z)$

The derivative of the tanh function is implemented as follows.

def tanh_derivative(Z):
    """
    Implements the derivative of the tanh function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the tanh function

    Returns
    -------
    dZ: (ndarray of the same shape as Z) or (scalar) - derivative of the tanh function with respect to Z
    """

    g = (np.exp(Z) - np.exp(-Z)) / (np.exp(Z) + np.exp(-Z))
    dZ = 1 - g ** 2
    return dZ

ReLU Function and its Derivatives

ReLU (rectified linear unit) function is widely used in neural networks. When z is less than or equal to 0, output 0; when z is greater than 0, output z. It can be seen that the execution efficiency of ReLU is very fast.

The function implementation of ReLU is as follows.

def relu(Z):
    """
    Implements the ReLU function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the ReLU function

    Returns
    -------
    A: (ndarray of same shape as Z) or (scalar) - output from the ReLU function
    """

    A = np.maximum(0, Z)
    return A

The derivative of ReLU is as follows. When z is less than 0, the derivative is 0; when z is greater than 0, the derivative is 1; when z is equal to 0, the derivative is undefined. In practice, by convention, the derivative is set to 1 when z equals 0.

$\frac{d}{dz}relu(z)=\begin{cases} 1 & \text{if } z \ge 0 \\ 0 & \text{if } z<0 \end{cases}$

The derivative of the ReLU function is implemented as follows.

def relu_derivative(Z):
    """
    Implements the derivative of the ReLU function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the ReLU function

    Returns
    -------
    dZ: (ndarray of the same shape as Z) or (scalar) - derivative of the ReLU function with respect to Z
    """

    dZ = np.array(Z, copy=True)
    dZ[Z < 0] = 0
    return dZ

Leaky ReLU function and its derivatives

Leaky ReLU function is a variant of the ReLU function. When z is less than 0, output $\lambda z$ , where $\lambda$ is a value between 0 and 1.

$leaky\_relu(z)=max(\lambda z,z) \text{, where }0<\lambda<1$

The implementation of the leaky ReLU function is as follows.

def leaky_relu(Z, negative_slope=0.01):
    """
    Implements the leaky ReLU function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the leaky ReLU function
    negative_slope: (float) - the slope for negative values

    Returns
    -------
    A: (ndarray of same shape as Z) or (scalar) - output from the leaky ReLU function
    """

    A = np.maximum(0, Z) + negative_slope * np.minimum(0, Z)
    return A

The derivative of leaky ReLU is as follows. When z is less than 0, the derivative is $\lambda$ ; when z is greater than 0, the derivative is 1; when z is equal to 0, the derivative is undefined. In practice, by convention, the derivative is set to 1 when z equals 0.

$\frac{d}{dz}leaky\_relu(z)=\begin{cases} 1 & \text{if } z \ge 1 \\ \lambda & \text{if } z<0, \text{where } 0<\lambda<1 \end{cases}$

The derivative of the Leaky ReLU function is implemented as follows.

def leaky_relu_derivative(Z, negative_slope=0.01):
    """
    Implements the derivative of the leaky ReLU function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the leaky ReLU function
    negative_slope: (float) - the slope for negative values

    Returns
    -------
    dZ: (ndarray of the same shape as Z) or (scalar) - derivative of the leaky ReLU function with respect to Z
    """

    dZ = np.array(Z, copy=True)
    dZ[Z < 0] = negative_slope
    return dZ

Binary Classification

The figure below is a neural network for binary classification. Because it is a binary classification, the activation function in its output layer is a sigmoid function $\sigma$ . We can see that there are quite a few variables in the graph. Each layer has its own activation function, and each neuron has its own parameters w and b. Therefore, vectorizing these variables and using matrices to represent them can greatly simplify the formula, as shown in the yellow part in the figure.

Therefore, the vectorized formula for each layer and the dimensions of the array are as follows.

$Z^{[\ell]}=W^{[\ell]}A^{[\ell-1]}+b^{[\ell]} \\\\ A^{[\ell]}=g^{[\ell]}(Z^{[\ell]}) \\\\ \ell: \ell \text{-th layer} \\\\ m: \text{number of examples} \\\\ L: \text{number of layers} \\\\ n_h:\text{number of inputs} \\\\ n^{[\ell]}:\text{number of units in } \ell \text{-th layer}$

Layer	Shape of W	Shape of X	Shape of b	Shape of Z	Shape of A
1	$W^{[1]}:(n^{[1]},n_h)$	$X:(n_h,m)$	$b^{[1]}:(n^{[1]},1)$	$Z^{[1]}:(n^{[1]},m)$	$A^{[1]}:(n^{[1]},m)$
2	$W^{[2]}:(n^{[2]},n^{[1]})$	$A^{[1]}:(n^{[1]},m)$	$b^{[2]}:(n^{[2]},1)$	$Z^{[2]}:(n^{[2]},m)$	$A^{[2]}:(n^{[2]},m)$
$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$
L	$W^{[L]}:(n^{[L]},n^{[L-1]})$	$A^{[L-1]}:(n^{[L-1]},m)$	$b^{[L]}:(n^{[L]},1)$	$Z^{[L]}:(n^{[L]},m)$	$A^{[L]}:(n^{[L]},m)$

Gradient Descent

The gradient descent of Neural network is as follows. Since each layer has corresponding parameters W and b, we must calculate the partial derivative of J with respect to each layer W and b.

$L:\text{number of layers} \\\\ m:\text{number of examples} \\\\ \text{Parameters}: W^{[\ell]},b^{[\ell]},\ell=1,...,L \\\\ \text{Loss function}: J(W,b)=\frac{1}{m}\displaystyle\sum_{i=1}^{m}\mathcal{L}(a^{[L](i)},y^{(i)}) \\\\ \text{repeat until convergence \{} \\\\ \hphantom{xxxx}\text{Compute predict }(\hat{y}^{(i)},i=1,...,m) \\\\ \hphantom{xxxx}W^{[\ell]}:=W^{[\ell]}-\frac{\partial J}{\partial W^{[\ell]}}, \ell=1,...,L \\\\ \hphantom{xxxx}b^{[\ell]}:=b^{[\ell]}-\frac{\partial J}{\partial b^{[\ell]}}, \ell=1,...,L \\\\ \text{\}}$

The process of gradient descent is as shown below.

Initialize all parameters W and b.
Calculate $\hat{y}$ and store W, b, A, and Z in the process. Because we will need these values when calculating the partial derivatives in the next step. This part is forward propagation.
Compute the partial derivative of J for each W and b. This part is backward propagation.
Update all parameters W and b.
Repeat steps 2 to 4 for a total of num_iterations times.

Gradient descent of binary classification neural network.

Loss Function

In the neural network of binary classification, the activation function of the output layer is the sigmoid function. Therefore, we can use the loss function of sigmoid regression as the loss function of binary classification neural network. Sigmoid regression uses cross-entropy loss as its loss function.

$J(W,b)=\frac{1}{m}\displaystyle\sum_{i=1}^{m}\mathcal{L}(a^{[L](i)},y^{(i)}) \\\\ \mathcal{L}(a^{[L](i)},y^{(i)})=-y^{(i)}\log a^{[L](i)}-(1-y^{(i)})\log (1-a^{[L](i)}) \\\\ L: \text{number of layers}\\\\ a^{[L](i)}:\text{the activation in the last layer for } i \text{-th example}$

The following code implements the loss function.

def compute_cost(AL, Y):
    """
    Computes the cross-entropy loss.

    Parameters
    ----------
    AL: (ndarray (1, number of examples)) - the output of the last layer
    Y: (ndarray (1, number of examples)) - true labels

    Returns
    -------
    cost: (float) - the cross-entropy cost
    """

    m = Y.shape[1]
    cost = -(1 / m) * np.sum(Y * np.log(AL) + (1 - Y) * np.log(1 - AL), axis=1, keepdims=True)
    cost = np.squeeze(cost)
    return cost

Parameter initialization

We have previously listed the dimensions of all parameters W and b. Their dimensions are related to the number of neurons in each layer, so when initializing the parameters, we must first determine the number of neurons in each layer.

In the following code, we use random numbers to initialize the parameters W and b.

def initialize_parameters(layer_dims):
    """
    Initializes parameters for a deep neural network.

    Parameters
    ----------
    layer_dims: (list) - the number of units of each layer in the network.

    Returns
    -------
    (dict) with keys where 1 <= l <= len(layer_dims) - 1:
        Wl: (ndarray (layer_dims[l], layer_dims[l-1])) - weight matrix for layer l
        bl: (ndarray (layer_dims[l], 1)) - bias vector for layer l
    """

    parameters = {}
    for l in range(1, len(layer_dims)):
        parameters[f'W{l}'] = np.random.randn(layer_dims[l], layer_dims[l - 1]) / np.sqrt(layer_dims[l - 1])
        parameters[f'b{l}'] = np.zeros((layer_dims[l], 1))
    return parameters

Forward Propagation

Forward propagation is the first half of gradient descent in neural networks. The output of each layer A will be the input of the next layer, so A will be passed layer by layer, and each layer will change the value of A. Finally, $A^{[L]}$ will be $\hat{y}$ . Each layer will store the calculated values in caches because they will be needed for the backpropagation in the second half.

After we execute the entire gradient descent, we will get the final parameters W_final and b_final. Suppose we want to use this model to predict X_new, we substitute the input value X_new and the parameters W_final and b_final into forward propagation, and the final result $A_{new}^{[L]}=\hat{y}_{new}$ is the prediction value of X_new.

Forward propagation of binary classification neural network.

In the following code, linear_forward() implements the linear forward part of each level in the process. linear_forwarded() not only returns Z, but also returns A, W, and b to the caller, who stores them in caches.

def linear_forward(A_prev, W, b):
    """
    Implements the linear part of a layer's forward propagation.

    Parameters
    ----------
    A_prev: (ndarray (size of previous layer, number of examples)) - activations from previous layer
    W: (ndarray (size of current layer, size of previous layer)) - weight matrix
    b: (ndarray (size of current layer, 1)) - bias vector

    Returns
    -------
    Z: (ndarray (size of current layer, number of examples)) - the input to the activation function
    cache: (tuple) - containing A_prev, W, b for backpropagation
    """

    Z = W @ A_prev + b
    cache = (A_prev, W, b)
    return Z, cache

In the following code, we implement four activation functions. These implementations are almost the same as the activation function implementation at the beginning of the article. The difference is that Z is also returned to the caller. The caller will store it in caches.

def sigmoid(Z):
    """
    Implements the sigmoid activation.

    Parameters
    ----------
    Z: (ndarray of any shape) - input to the activation function

    Returns
    -------
    A: (ndarray of same shape as Z) - output of the activation function
    cache: (ndarray) - returning Z for backpropagation
    """

    A = 1 / (1 + np.exp(-Z))
    cache = Z
    return A, cache


def tanh(Z):
    """
    Implements the tanh activation.

    Parameters
    ----------
    Z: (ndarray of any shape) - input to the activation function

    Returns
    -------
    A: (ndarray of same shape as Z) - output of the activation function
    cache: (ndarray) - returning Z for backpropagation
    """

    A = (np.exp(Z) - np.exp(-Z)) / (np.exp(Z) + np.exp(-Z))
    cache = Z
    return A, cache


def relu(Z):
    """
    Implements the ReLU activation.

    Parameters
    ----------
    Z: (ndarray of any shape) - input to the activation function

    Returns
    -------
    A: (ndarray of same shape as Z) - output of the activation function
    cache: (ndarray) - returning Z for backpropagation
    """

    A = np.maximum(0, Z)
    cache = Z
    return A, cache


def leaky_relu(Z, negative_slope=0.01):
    """
    Implements the Leaky ReLU activation.

    Parameters
    ----------
    Z: (ndarray of any shape) - input to the activation function
    negative_slope: (float) - the slope for negative values

    Returns
    -------
    A: (ndarray of same shape as Z) - output of the activation function
    cache: (ndarray) - returning Z for backpropagation
    """

    A = np.maximum(0, Z) + negative_slope * np.minimum(0, Z)
    cache = Z
    return A, cache

In the following code, linear_activation_forward() implements one layer in the above figure. It will first call linear_forward() to obtain Z, and then pass Z to an activation function to obtain A. Finally, A and cache are passed back to the caller.

def linear_activation_forward(A_prev, W, b, activation_function):
    """
    Implements the forward propagation for the linear and activation layer.

    Parameters
    ----------
    A_prev: (ndarray (size of previous layer, number of examples)) - activations from previous layer
    W: (ndarray (size of current layer, size of previous layer)) - weight matrix
    b: (ndarray (size of current layer, 1)) - bias vector
    activation_function: (str) - the activation function to be used

    Returns
    -------
    A: (ndarray (size of current layer, number of examples)) - the output of the activation function
    cache: (tuple) - containing linear_cache (A_prev, W, b) and activation_cache (Z) for backpropagation
    """

    Z, linear_cache = linear_forward(A_prev, W, b)
    if activation_function == 'sigmoid':
        A, activation_cache = sigmoid(Z)
    elif activation_function == 'tanh':
        A, activation_cache = tanh(Z)
    elif activation_function == 'relu':
        A, activation_cache = relu(Z)
    elif activation_function == 'leaky_relu':
        A, activation_cache = leaky_relu(Z)
    else:
        raise ValueError(f'Activation function {activation_function} not supported.')
    cache = (linear_cache, activation_cache)
    return A, cache

In the following code, model_forward() implements the entire forward propagation. Finally, it returns $A^{[L]}$ and call caches.

def model_forward(X, parameters, activation_functions):
    """
    Implements forward propagation for the entire network.

    Parameters
    ----------
    X: (ndarray (input size, number of examples)) - input data
    parameters: (dict) - output of initialize_parameters()
    activation_functions: (list) - the activation function for each layer. The first element is unused.

    Returns
    -------
    AL: (ndarray (output size, number of examples)) - the output of the last layer
    caches: (list of tuples) - containing caches for each layer
    """

    caches = []
    A = X
    L = len(activation_functions)
    for l in range(1, L):
        A_prev = A
        A, cache = linear_activation_forward(A_prev, parameters[f'W{l}'], parameters[f'b{l}'], activation_functions[l])
        caches.append(cache)
    return A, caches

Backpropagation or Backward Propagation

In gradient descent, we must calculate the partial derivatives of J(W, b) for W and b of each layer to update the parameters W and b. When a neural network has many layers, calculating partial derivatives will take a lot of time. Backpropagation can speed up the calculation of these partial derivatives. When calculating the partial derivatives of each layer, some values that have been calculated in the next layer are needed. If you calculate from the first level onwards, there will be many values that need to be calculated repeatedly. Therefore, if you calculate from the last layer forward, each layer can pass the calculated value to the previous layer, and the previous layer can directly access it without having to recalculate it again, as shown below.

Backpropagation of binary classification neural network.

In fact, backpropagation is the differential chain rule.

$\frac{dy}{dx}=\frac{du}{dx}\cdot\frac{dy}{du}$

According to the above figure, we need to first calculate $\frac{\partial J}{\partial A^{[L]}}$ and then calculate $\frac{\partial J}{\partial Z^{[L]}}$ . Among them, the activation function of the last layer is the sigmoid function $\sigma$ .

$\frac{\partial J}{\partial A^{[L]}}=\frac{1}{m}\frac{\partial}{\partial A^{[L]}}[-Y\log{A^{[L]}}-(1-Y)\log{(1-A^{[L]})}] \\\\ \hphantom{\frac{\partial J}{\partial A^{[L]}}}=-\frac{1}{m}(\frac{Y}{A^{[L]}}-\frac{1-Y}{1-A^{[L]}}) \\\\ \frac{\partial A^{[L]}}{\partial Z^{[L]}}=\frac{\partial}{\partial z^{[L]}}\sigma(Z^{[L]}) \\\\ \hphantom{\frac{\partial A^{[L]}}{\partial Z^{[L]}}}=\sigma(Z^{[L]})(1-\sigma(Z^{[L]})) \\\\ \hphantom{\frac{\partial A^{[L]}}{\partial Z^{[L]}}}=A^{[L]}(1-A^{[L]}) \\\\ \frac{\partial J}{\partial Z^{[L]}}=\frac{\partial A^{[L]}}{\partial Z^{[L]}}\frac{\partial J}{\partial A^{[L]}}$

We can then use the above results to calculate $\frac{\partial J}{\partial W^{[L]}}$ and $\frac{\partial J}{\partial b^{[L]}}$ .

$Z^{[L]}=W^{[L]}A^{[L-1]}+b^{[L]} \\\\ \frac{\partial Z^{[L]}}{\partial W^{[L]}}=A^{[L-1]} \\\\ \frac{\partial J}{\partial W^{[L]}}=\frac{\partial Z^{[L]}}{\partial W^{[L]}}\frac{\partial J}{\partial Z^{[L]}}=\frac{\partial J}{\partial Z^{[L]}}A^{[L-1]T} \\\\ \frac{\partial J}{\partial b^{[L]}}=\frac{\partial Z^{[L]}}{\partial b^{[L]}}\frac{\partial J}{\partial Z^{[L]}}=\frac{\partial J}{\partial Z^{[L]}}$

Finally, all partial derivatives are calculated as follows.

$L:\text{number of layers} \\\\ \ell=1,...,L \\\\ A^{[0]}=X \\\\ \frac{\partial J}{\partial Z^{[\ell]}}=\frac{\partial A^{[\ell]}}{\partial Z^{[\ell]}}\frac{\partial J}{\partial A^{[\ell]}}=\frac{\partial A^{[\ell]}}{\partial Z^{[\ell]}}\frac{\partial Z^{[\ell+1]}}{\partial A^{[\ell]}}\frac{\partial J}{\partial Z^{[\ell+1]}} \\\\ \frac{\partial J}{\partial A^{[\ell]}}=\frac{\partial Z^{[\ell+1]}}{\partial A^{[\ell]}}\frac{\partial J}{\partial Z^{[\ell+1]}}=W^{[\ell+1]T}\frac{\partial J}{\partial Z^{[\ell+1]}} \\\\ \frac{\partial J}{\partial W^{[\ell]}}=\frac{\partial J}{\partial Z^{[\ell]}}A^{[\ell-1]T} \\\\ \frac{\partial J}{\partial b^{[\ell]}}=\displaystyle\sum_{i=1}^{m}\frac{\partial J}{\partial Z^{[\ell]}}$

In the following code, linear_backward() implements the linear backward part of the figure.

def linear_backward(dZ, cache):
    """
    Implements the linear portion of backward propagation for a single layer.

    Parameters
    ----------
    dZ: (ndarray (size of current layer, number of examples)) - gradient of the cost with respect to the linear output
    cache: (tuple) - containing W, A_prev, b from the forward propagation

    Returns
    -------
    dA_prev: (ndarray (size of previous layer, number of examples)) - gradient of the cost with respect to the activation from the previous layer
    dW: (ndarray (size of current layer, size of previous layer)) - gradient of the cost with respect to W
    db: (ndarray (size of current layer, 1)) - gradient of the cost with respect to b
    """

    A_prev, W, b = cache
    dW = dZ @ A_prev.T
    db = np.sum(dZ, axis=1, keepdims=True)
    dA_prev = W.T @ dZ
    return dA_prev, dW, db

The following code implements the differentiation $g'$ of four activation functions. It will be multiplied $\frac{\partial J}{\partial A^{[\ell]}}$ and then passed back $\frac{\partial J}{\partial Z^{[\ell]}}$ , which is the activation backward part of the picture.

def sigmoid_backward(dA, cache):
    """
    Implements the backward propagation for a single sigmoid unit.

    Parameters
    ----------
    dA: (ndarray of any shape) - post-activation gradient
    cache: (ndarray) - Z from the forward propagation

    Returns
    --------
    dZ: (ndarray of the same shape as A) - gradient of the cost with respect to Z
    """

    Z = cache
    g = 1 / (1 + np.exp(-Z))
    g_prime = g * (1 - g)
    dZ = dA * g_prime
    return dZ


def tanh_backward(dA, cache):
    """
    Implements the backward propagation for a single tanh unit.

    Parameters
    ----------
    dA: (ndarray of any shape) - post-activation gradient
    cache: (ndarray) - Z from the forward propagation

    Returns
    -------
    dZ: (ndarray of the same shape as A) - gradient of the cost with respect to Z
    """

    Z = cache
    g = (np.exp(Z) - np.exp(-Z)) / (np.exp(Z) + np.exp(-Z))
    g_prime = 1 - g ** 2
    dZ = dA * g_prime
    return dZ


def relu_backward(dA, cache):
    """
    Implements the backward propagation for a single ReLU unit.

    Parameters
    ----------
    dA: (ndarray of any shape) - post-activation gradient
    cache: (ndarray) - Z from the forward propagation

    Returns
    -------
    dZ: (ndarray of the same shape as A) - gradient of the cost with respect to Z
    """

    Z = cache
    dZ = np.array(dA, copy=True)
    dZ[Z < 0] = 0
    return dZ


def leaky_relu_backward(dA, cache, negative_slope=0.01):
    """
    Implements the backward propagation for a single Leaky ReLU unit.

    Parameters
    ----------
    dA: (ndarray of any shape) - post-activation gradient
    cache: (ndarray) - Z from the forward propagation
    negative_slope: (float) - the slope for negative values

    Returns
    -------
    dZ: (ndarray of the same shape as A) - gradient of the cost with respect to Z
    """

    Z = cache
    dZ = np.array(dA, copy=True)
    dZ[Z < 0] = negative_slope
    return dZ

The linear_activation_backward() in the following code implements the one layer part of the diagram.

def linear_activation_backward(dA, cache, activation_function):
    """
    Implements the backward propagation for the linear and activation layer.

    Parameters
    ----------
    dA: (ndarray (size of current layer, number of examples)) - post-activation gradient for current layer
    cache: (tuple) - containing linear_cache (A_prev, W, b) and activation_cache (Z) for backpropagation
    activation_function: (str) - the activation function to be used

    Returns
    -------
    dA_prev: (ndarray (size of previous layer, number of examples)) - gradient of the cost with respect to the activation from the previous layer
    dW: (ndarray (size of current layer, size of previous layer)) - gradient of the cost with respect to W
    db: (ndarray (size of current layer, 1)) - gradient of the cost with respect to b
    """

    linear_cache, activation_cache = cache
    if activation_function == 'sigmoid':
        dZ = sigmoid_backward(dA, activation_cache)
    elif activation_function == 'tanh':
        dZ = tanh_backward(dA, activation_cache)
    elif activation_function == 'relu':
        dZ = relu_backward(dA, activation_cache)
    elif activation_function == 'leaky_relu':
        dZ = leaky_relu_backward(dA, activation_cache)
    else:
        raise ValueError(f'Activation function {activation_function} not supported.')
    dA_prev, dW, db = linear_backward(dZ, linear_cache)
    return dA_prev, dW, db

Finally, model_backward() in the following code implements the entire backpropagation.

def model_backward(AL, Y, caches, activation_functions):
    """
    Implements the backward propagation for the entire network.

    Parameters
    ----------
    AL: (ndarray (output size, number of examples)) - the output of the last layer
    Y: (ndarray (output size, number of examples)) - true labels
    caches: (list of tuples) - containing linear_cache (A_prev, W, b) and activation_cache (Z) for each layer
    activation_functions: (list) - the activation function for each layer. The first element is unused.

    Returns
    -------
    gradients: (dict) with keys where 0 <= l <= len(activation_functions) - 1:
        dA{l-1}: (ndarray (size of previous layer, number of examples)) - gradient of the cost with respect to the activation for previous layer l - 1
        dWl: (ndarray (size of current layer, size of previous layer)) - gradient of the cost with respect to W for layer l
        dbl: (ndarray (size of current layer, 1)) - gradient of the cost with respect to b for layer l
    """

    gradients = {}
    L = len(activation_functions)
    m = AL.shape[1]
    dAL = -(1 / m) * (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
    dA_prev = dAL
    for l in reversed(range(1, L)):
        current_cache = caches[l - 1]
        dA_prev, dW, db = linear_activation_backward(dA_prev, current_cache, activation_functions[l])
        gradients[f'dA{l - 1}'] = dA_prev
        gradients[f'dW{l}'] = dW
        gradients[f'db{l}'] = db
    return gradients

Putting All Together

After performing backpropagation, the partial derivatives of J with respect to all parameters W and b will be obtained. Then, you can call the following code to update all W and b.

def update_parameters(parameters, gradients, learning_rate):
    """
    Updates parameters using the gradient descent update rule.

    Parameters
    ----------
    parameters: (dict) - containing the parameters
    gradients: (dict) - containing the gradients
    learning_rate: (float) - the learning rate

    Returns
    -------
    params: (dict) - containing the updated parameters
    """

    updated_parameters = parameters.copy()
    L = len(updated_parameters) // 2
    for l in range(L):
        updated_parameters[f'W{l + 1}'] = parameters[f'W{l + 1}'] - learning_rate * gradients[f'dW{l + 1}']
        updated_parameters[f'b{l + 1}'] = parameters[f'b{l + 1}'] - learning_rate * gradients[f'db{l + 1}']
    return updated_parameters

nn_model() in the following code implements the entire model. It first performs forward propagation, then performs backpropagation, and finally updates the parameters.

def nn_model(X, Y, init_parameters, layer_activation_functions, learning_rate, num_iterations):
    """
    Implements a neural network.

    Parameters
    ----------
    X: (ndarray (input size, number of examples)) - input data
    Y: (ndarray (output size, number of examples)) - true labels
    init_parameters: (dict) - the initial parameters for the network
    layer_activation_functions: (list) - the activation function for each layer. The first element is unused.
    learning_rate: (float) - the learning rate
    num_iterations: (int) - the number of iterations

    Returns
    -------
    parameters: (dict) - the learned parameters
    costs: (list) - the costs at every 100th iteration
    """

    costs = []
    parameters = init_parameters.copy()

    for i in range(num_iterations):
        AL, caches = model_forward(X, parameters, layer_activation_functions)
        cost = compute_cost(AL, Y)
        gradients = model_backward(AL, Y, caches, layer_activation_functions)
        parameters = update_parameters(parameters, gradients, learning_rate)

        if i % 100 == 0 or i == num_iterations:
            costs.append(cost)

    return parameters, costs

After training the parameters, we can use the following nn_model_predict() to make predictions.

def nn_model_predict(X, parameters, activation_functions):
    """
    Predicts the output of the neural network.

    Parameters
    ----------
    X: (ndarray (input size, number of examples)) - input data
    parameters: (dict) - the learned parameters
    activation_functions: (list) - the activation function for each layer. The first element is unused.

    Returns
    -------
    predictions: (ndarray (1, number of examples)) - the predicted labels
    """

    probabilities, _ = model_forward(X, parameters, activation_functions)
    predictions = probabilities.copy()
    predictions[predictions > 0.5] = 1
    predictions[predictions <= 0.5] = 0
    return predictions

Example

We will use an example to show how to use our model. First, we load the training data x_orig and y. x_orig is an array containing 100 images. Each image is 64 x 64 in size and has three channels. y is an array containing 0 or 1, 1 means there is a cat in the picture, 0 means it is not a cat.

x_orig, y = load_data()

print(f'x_orig shape: {x_orig.shape}')
print(f'y shape: {y.shape}')

# Output
x_orig shape: ndarray(100, 64, 64, 3)
y shape: ndarray(1, 100)

Previously we listed the dimensions of X as (n_h, m), so each picture is a row vector. Below we take the dimensions of x_orig and convert the values 0 to 255 into values from 0 to 1. We don’t need to convert the dimensions of y because the dimensions of y are already the same as $A^{[L]}$ .

x_flatten = x_orig.reshape(x_orig.shape[0], -1).T
x = x_flatten / 255.

print("x shape: " + str(x.shape))

# Output
x shape: ndarray(12288, 100)

First, we need to decide the number of layers of the model and the number of neurons in each layer. Below we set the model to have an input layer, three layers in the hidden layer, and an output layer. We also need to decide the activation function of each layer, where layer_activation_functions[0] corresponds to the input layer, so it will not be used.

After these decisions are made, we can initialize all parameters W and b, and then call nn_model() to train the model. Finally, obtain the trained parameters.

layer_dims = [12288, 20, 7, 10, 1]
init_parameters = initialize_parameters(layer_dims)
layer_activation_functions = ['none', 'relu', 'relu', 'relu', 'sigmoid']
learning_rate = 0.0075
parameters, costs = nn_model(x, y, init_parameters, layer_activation_functions, learning_rate, 3000)

With the trained parameters, we can use them to predict other pictures.

x_new_orig = load_new_data()
x_new_flatten = x_new_orig.reshape(x_new_orig.shape[0], -1).T
x_new = x_new_flatten / 255.
y_new = nn_model_predict(x_new, parameters, layer_activation_functions)

Multi-class Classification

For information about multi-class classification neural network , please refer to the following article.

- Deep Learning
- Neural Networks

Multiple Classification Neural Network

ByWayne
12/01/2025

Conclusion

The backpropagation of Neural network involves the calculation of partial derivatives, so it is difficult to understand. Now we no longer need to implement backpropagation ourselves, but use machine learning libraries, such as PyTorch or TensorFlow. But understanding these details can help us better understand the operation of the neural network.

References

Andrew Ng, Deep Learning Specialization, Coursera.
西内啓，統計学が最強の学問である［数学編］――データ分析と機械学習のための新しい教科書，ダイヤモンド社。