神經網路（Neural Networks）與二元分類（Binary Classification）

由於深度學習（deep learning）在近年來蓬勃發展，使得神經網路（Neural Network）熱門起來。它已經被用於解決各式各樣的問題。本文章將以二元分類神經網路（binary classification neural network）來詳細介紹 neural network。

完整程式碼可以在下載。

神經網路（Neural Networks）
激勵函數（Activation Functions）
二元分類（Binary Classification）
多元分類（Multi-class Classification）
結語
參考

神經網路（Neural Networks）

神經網路（Neural networks）是由大量的神經元（neurons）連結而成。一個 neural network 由三個 layer 組成，分別為接收資料的輸入層（input layer）、輸出結果的輸出層（output layer），以及中間大量 neurons 鏈結組成數層的隱藏層（hidden layer），如下圖。

Input layer 和 output layer 都是各只有一層，而 hidden layer 則是可以數層。此外，當我們說下圖中是一個三層的 neural network，是指 hidden layer 中 layer 的數量再加上一層 output layer，而 input layer 不計算在內。

A neural network with 3 layers (2 hidden layers and 1 output layer).

每一個 neuron 中會有輸入向量 $\vec{x}$ 、向量的權重（wights） $\vec{w}$ ，一個純量的偏差（bias）b，以及一個非線性函數（non-linear function）g。所以，一個 neuron 的功能是，求得 $\vec{x}$ 與 $\vec{w}$ 的內積再加上 b 後得到 z，然後再將 z 帶入 g 後得到一個輸出值 a。而這個輸出值 a 會成為下一層中 neurons 的其中一個輸入值 x_i。這個 non-linear function g 稱為激勵函數（activation function），而 a 稱為 activation value。

A neuron with sigmoid function as its activation function.

所以，數個至大量的 neurons 組成一層，該層取得前一層的輸出值作為輸入值，經由該層內的 neurons 運算後的輸出值會作為下一層的輸入值。這樣子一層一層地連接起來成為包含大量 neurons 的 hidden layer。

激勵函數（Activation Functions）

如果一個 neuron 不包含 non-linear function 的話，也就是它只有做線性運算，即使將大量的 neurons 鍵結起來，那也只是一個多元線性回歸（multiple linear regression）。Linear regression models 是無法解決現實世界中複雜的問題，但是複雜的問題可以用 non-linear functions 來找出近似值。在 neuron 中的 non-linear function 稱為激勵函數（activation functions）。

以下我們將介紹四個常用的 activation functions。我們也會介紹它的導數求法，因為之後在 backpropagation 中，我們會需要求出它們的導數。

Sigmoid 函數與其導數

Sigmoid 函數將輸入值轉換成一個介於 0 到 1 之間的值，如下圖。它常常被用在 binary classification 的 neural network 裡的 output layer。我們可以將這個輸出值看作是一個機率。例如，我們想要一個 neural network 可以判斷一張圖片裡是不是貓。1 表示是貓，而 0 表示不是貓。當 output layer 輸出值大於 0.5 時，則預測為 1；當小於等於 0.5 時，則預測為 0。

$\sigma(z)=\frac{1}{1+e^{-z}}, 0<\sigma(z)<1$

Sigmoid 函數的實作如下。

def sigmoid(Z):
    """
    Implements the sigmoid function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the sigmoid function

    Returns
    -------
    A: (ndarray of same shape as Z) or (scalar) - output from the sigmoid function
    """

    A = 1 / (1 + np.exp(-Z))
    return A

Sigmoid 函數的導數如下。

$\frac{d}{dz}\sigma(z) =\frac{d}{dz}\frac{1}{1+e^{-z}} \\\\ \hphantom{\frac{d}{dz}\sigma(z)}=-(1+e^{-z})^2\cdot(-e^{-z}) \\\\ \hphantom{\frac{d}{dz}\sigma(z)}=\frac{1}{1+e^{-z}}\frac{e^{-z}}{1+e^{-z}} \\\\ \hphantom{\frac{d}{dz}\sigma(z)}=\frac{1}{1+e^{-z}}\frac{1+e^{-z}-1}{1+e^{-z}} \\\\ \hphantom{\frac{d}{dz}\sigma(z)}=\frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}}) \\\\ \hphantom{\frac{d}{dz}\sigma(z)}=\sigma(z)(1-\sigma(z))$

Sigmoid 函數的導數實作如下。

def sigmoid_derivative(Z):
    """
    Implements the derivative of the sigmoid function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the sigmoid function

    Returns
    --------
    dZ: (ndarray of the same shape as Z) or (scalar) - derivative of the sigmoid function with respect to Z
    """

    g = 1 / (1 + np.exp(-Z))
    dZ = g * (1 - g)
    return dZ

Tanh 函數與其導數

Tanh 函數和 sigmoid 函數很像，但是 tanh 函數的輸出值是介於 -1 到 1 之間，如下圖。

$tanh(z)=\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}}, -1<tanh(z)<1$

Tanh 函數的實作如下。

def tanh(Z):
    """
    Implements the tanh function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the tanh function

    Returns
    -------
    A: (ndarray of same shape as Z) or (scalar) - output from the tanh function
    """

    A = (np.exp(Z) - np.exp(-Z)) / (np.exp(Z) + np.exp(-Z))
    return A

Tanh 函數的導數求解如下：

$\frac{d}{dz}tanh(z)=\frac{d}{dz}\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}} \\\\ \hphantom{\frac{d}{dz}tanh(z)}=\frac{(e^{z}+e^{-z})\cdot\frac{d}{dz}(e^{z}-e^{-z})-(e^{z}-e^{-z})\cdot\frac{d}{dz}(e^{z}+e^{-z})}{(e^{z}+e^{-z})^2} \\\\ \hphantom{\frac{d}{dz}tanh(z)}=\frac{(e^{z}+e^{-z})(e^{z}+e^{-z})-(e^{z}-e^{-z})(e^{z}-e^{-z})}{(e^{z}+e^{-z})^2} \\\\ \hphantom{\frac{d}{dz}tanh(z)}=\frac{(e^{z}+e^{-z})^2}{(e^{z}+e^{-z})^2}-\frac{(e^{z}-e^{-z})^2}{(e^{z}+e^{-z})^2} \\\\ \hphantom{\frac{d}{dz}tanh(z)}=1-tanh^{2}(z)$

Tanh 函數的導數實作如下。

def tanh_derivative(Z):
    """
    Implements the derivative of the tanh function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the tanh function

    Returns
    -------
    dZ: (ndarray of the same shape as Z) or (scalar) - derivative of the tanh function with respect to Z
    """

    g = (np.exp(Z) - np.exp(-Z)) / (np.exp(Z) + np.exp(-Z))
    dZ = 1 - g ** 2
    return dZ

ReLU 函數與其導數

ReLU (rectified linear unit) 函數是廣泛地被使用在 neural networks。當 z 小於等於 0 時，輸出 0；當 z 大於 0 時，輸出 z。可以看出 ReLU 的執行效率非常地快。

$relu(z)=max(0,z)$

ReLU 的函數實作如下。

def relu(Z):
    """
    Implements the ReLU function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the ReLU function

    Returns
    -------
    A: (ndarray of same shape as Z) or (scalar) - output from the ReLU function
    """

    A = np.maximum(0, Z)
    return A

ReLU 的導數如下。當 z 小於 0 時，導數為 0；當 z 大於 0 時，導數為 1；當 z 等於 0 時，導數為未定義。在實作時，慣例上，當 z 等於 0 時，導數設為 1。

$\frac{d}{dz}relu(z)=\begin{cases} 1 & \text{if } z \ge 0 \\ 0 & \text{if } z<0 \end{cases}$

ReLU 函數的導數實作如下。

def relu_derivative(Z):
    """
    Implements the derivative of the ReLU function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the ReLU function

    Returns
    -------
    dZ: (ndarray of the same shape as Z) or (scalar) - derivative of the ReLU function with respect to Z
    """

    dZ = np.array(Z, copy=True)
    dZ[Z < 0] = 0
    return dZ

Leaky ReLU 函數與其導數

Leaky ReLU 函數是 ReLU 函數的一種變形。當 z 小於 0 時，輸出 $\lambda z$ ，其中 $\lambda$ 是一個介於 0 到 1 之間的值。

$leaky\_relu(z)=max(\lambda z,z) \text{, where }0<\lambda<1$

Leaky ReLU 函數的實作如下。

def leaky_relu(Z, negative_slope=0.01):
    """
    Implements the leaky ReLU function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the leaky ReLU function
    negative_slope: (float) - the slope for negative values

    Returns
    -------
    A: (ndarray of same shape as Z) or (scalar) - output from the leaky ReLU function
    """

    A = np.maximum(0, Z) + negative_slope * np.minimum(0, Z)
    return A

Leaky ReLU 的導數如下。當 z 小於 0 時，導數為 $\lambda$ ；當 z 大於 0 時，導數為 1；當 z 等於 0 時，導數為未定義。在實作時，慣例上，當 z 等於 0 時，導數設為 1。

$\frac{d}{dz}leaky\_relu(z)=\begin{cases} 1 & \text{if } z \ge 1 \\ \lambda & \text{if } z<0, \text{where } 0<\lambda<1 \end{cases}$

Leaky ReLU 函數的導數實作如下。

def leaky_relu_derivative(Z, negative_slope=0.01):
    """
    Implements the derivative of the leaky ReLU function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the leaky ReLU function
    negative_slope: (float) - the slope for negative values

    Returns
    -------
    dZ: (ndarray of the same shape as Z) or (scalar) - derivative of the leaky ReLU function with respect to Z
    """

    dZ = np.array(Z, copy=True)
    dZ[Z < 0] = negative_slope
    return dZ

二元分類（Binary Classification）

下圖是一個 binary classification 的 neural network。因為它是一個 binary classification，所以它的 output layer 裡的 activation function 是 sigmoid function $\sigma$ 。我們可以看到圖中有相當多的變數。每一層有自己的 activation function，而每一個 neuron 有自己的參數 w 和 b。因此，將這些變數向量化，使用矩陣來表示的話，可以大大地簡化式子，如圖中的黃色部分。

Binary classification with neural network.

所以，每一層的向量化後的公式以及陣列的維度如下。

$Z^{[\ell]}=W^{[\ell]}A^{[\ell-1]}+b^{[\ell]} \\\\ A^{[\ell]}=g^{[\ell]}(Z^{[\ell]}) \\\\ \ell: \ell \text{-th layer} \\\\ m: \text{number of examples} \\\\ L: \text{number of layers} \\\\ n_h:\text{number of inputs} \\\\ n^{[\ell]}:\text{number of units in } \ell \text{-th layer}$

Layer	Shape of W	Shape of X	Shape of b	Shape of Z	Shape of A
1	$W^{[1]}:(n^{[1]},n_h)$	$X:(n_h,m)$	$b^{[1]}:(n^{[1]},1)$	$Z^{[1]}:(n^{[1]},m)$	$A^{[1]}:(n^{[1]},m)$
2	$W^{[2]}:(n^{[2]},n^{[1]})$	$A^{[1]}:(n^{[1]},m)$	$b^{[2]}:(n^{[2]},1)$	$Z^{[2]}:(n^{[2]},m)$	$A^{[2]}:(n^{[2]},m)$
$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$
L	$W^{[L]}:(n^{[L]},n^{[L-1]})$	$A^{[L-1]}:(n^{[L-1]},m)$	$b^{[L]}:(n^{[L]},1)$	$Z^{[L]}:(n^{[L]},m)$	$A^{[L]}:(n^{[L]},m)$

梯度下降（Gradient Descent）

Neural network 的 gradient descent 如下。由於每一層都有對應的參數 W 和 b，因此我們必須要計算 J 對每一層 W 和 b 的偏導數。

$L:\text{number of layers} \\\\ m:\text{number of examples} \\\\ \text{Parameters}: W^{[\ell]},b^{[\ell]},\ell=1,...,L \\\\ \text{Loss function}: J(W,b)=\frac{1}{m}\displaystyle\sum_{i=1}^{m}\mathcal{L}(a^{[L](i)},y^{(i)}) \\\\ \text{repeat until convergence \{} \\\\ \hphantom{xxxx}\text{Compute predict }(\hat{y}^{(i)},i=1,...,m) \\\\ \hphantom{xxxx}W^{[\ell]}:=W^{[\ell]}-\frac{\partial J}{\partial W^{[\ell]}}, \ell=1,...,L \\\\ \hphantom{xxxx}b^{[\ell]}:=b^{[\ell]}-\frac{\partial J}{\partial b^{[\ell]}}, \ell=1,...,L \\\\ \text{\}}$

Gradient descent 的流程如下圖。

初始化所有參數 W 和 b。
計算 $\hat{y}$ ，並儲存過程中的 W、b、A、和 Z。因為在下一步計算偏導數時，我們會需要這些值。這部分是 forward propagation。
計算 J 對每一個 W 和 b 的偏導數。這部分是 backward propagation。
更新所有參數 W 和 b。
反覆執行步驟 2 至 4，共執行 num_iterations 次。

Gradient descent of binary classification neural network.

損失函數（Loss Function）

在 binary classification 的 neural network 裡，output layer 的 activation function 是 sigmoid function。因此，我們可以使用 sigmoid regression 的 loss function 作為 binary classification neural network 的 loss function。Sigmoid regression 是使用 cross-entropy loss 作為它的 loss function。

$J(W,b)=\frac{1}{m}\displaystyle\sum_{i=1}^{m}\mathcal{L}(a^{[L](i)},y^{(i)}) \\\\ \mathcal{L}(a^{[L](i)},y^{(i)})=-y^{(i)}\log a^{[L](i)}-(1-y^{(i)})\log (1-a^{[L](i)}) \\\\ L: \text{number of layers}\\\\ a^{[L](i)}:\text{the activation in the last layer for } i \text{-th example}$

以下程式碼實作了 loss function。

def compute_cost(AL, Y):
    """
    Computes the cross-entropy loss.

    Parameters
    ----------
    AL: (ndarray (1, number of examples)) - the output of the last layer
    Y: (ndarray (1, number of examples)) - true labels

    Returns
    -------
    cost: (float) - the cross-entropy cost
    """

    m = Y.shape[1]
    cost = -(1 / m) * np.sum(Y * np.log(AL) + (1 - Y) * np.log(1 - AL), axis=1, keepdims=True)
    cost = np.squeeze(cost)
    return cost

參數初始化

我們有之前有列出所有參數 W 和 b 的維度。它們的維度與每一層 neurons 的數量相關，因此在初始化參數時，我們必須要先決定每一層 neurons 的數量。

以下程式碼中，我們使用亂數來初始化參數 W 和 b。

def initialize_parameters(layer_dims):
    """
    Initializes parameters for a deep neural network.

    Parameters
    ----------
    layer_dims: (list) - the number of units of each layer in the network.

    Returns
    -------
    (dict) with keys where 1 <= l <= len(layer_dims) - 1:
        Wl: (ndarray (layer_dims[l], layer_dims[l-1])) - weight matrix for layer l
        bl: (ndarray (layer_dims[l], 1)) - bias vector for layer l
    """

    parameters = {}
    for l in range(1, len(layer_dims)):
        parameters[f'W{l}'] = np.random.randn(layer_dims[l], layer_dims[l - 1]) / np.sqrt(layer_dims[l - 1])
        parameters[f'b{l}'] = np.zeros((layer_dims[l], 1))
    return parameters

前向傳播（Forward Propagation）

前向傳播（forward propagation）是 neural network 的 gradient descent 裡的前半段。每一層的輸出 A 會是下一層的輸入，所以 A 會被一層一層地傳遞，每一層會改變 A 的值。最後的 $A^{[L]}$ 也就會是 $\hat{y}$ 。每一層都會將算出來的值儲存到 caches 裡，因為下半段的 backpropagation 將會需要它們。

我們執行完整個 gradient descent 後，會得到最終的參數 W_final 和 b_final。假設我們想要用此模型來預測 X_new，我們將輸入值 X_new 以及參數 W_final 和 b_final 代入 forward propagation，最後得到的 $A_{new}^{[L]}=\hat{y}_{new}$ 就是 X_new 的預測值（prediction）。

Forward propagation of binary classification neural network.

以下程式碼中，linear_forward() 實作了流程中每一層的 linear forward 部分。linear_forwared() 不但回傳 Z，還會傳回 A、W、和 b 給呼叫者，由呼叫者將它們存入 caches 中。

def linear_forward(A_prev, W, b):
    """
    Implements the linear part of a layer's forward propagation.

    Parameters
    ----------
    A_prev: (ndarray (size of previous layer, number of examples)) - activations from previous layer
    W: (ndarray (size of current layer, size of previous layer)) - weight matrix
    b: (ndarray (size of current layer, 1)) - bias vector

    Returns
    -------
    Z: (ndarray (size of current layer, number of examples)) - the input to the activation function
    cache: (tuple) - containing A_prev, W, b for backpropagation
    """

    Z = W @ A_prev + b
    cache = (A_prev, W, b)
    return Z, cache

以下程式碼中，我們實作了四種 activation functions。這些實作與文章開頭中的 activation function 實作是幾乎一樣的，差別在於這邊還會回傳 Z 給呼叫者。呼叫者會將它存入 caches 中。

def sigmoid(Z):
    """
    Implements the sigmoid activation.

    Parameters
    ----------
    Z: (ndarray of any shape) - input to the activation function

    Returns
    -------
    A: (ndarray of same shape as Z) - output of the activation function
    cache: (ndarray) - returning Z for backpropagation
    """

    A = 1 / (1 + np.exp(-Z))
    cache = Z
    return A, cache


def tanh(Z):
    """
    Implements the tanh activation.

    Parameters
    ----------
    Z: (ndarray of any shape) - input to the activation function

    Returns
    -------
    A: (ndarray of same shape as Z) - output of the activation function
    cache: (ndarray) - returning Z for backpropagation
    """

    A = (np.exp(Z) - np.exp(-Z)) / (np.exp(Z) + np.exp(-Z))
    cache = Z
    return A, cache


def relu(Z):
    """
    Implements the ReLU activation.

    Parameters
    ----------
    Z: (ndarray of any shape) - input to the activation function

    Returns
    -------
    A: (ndarray of same shape as Z) - output of the activation function
    cache: (ndarray) - returning Z for backpropagation
    """

    A = np.maximum(0, Z)
    cache = Z
    return A, cache


def leaky_relu(Z, negative_slope=0.01):
    """
    Implements the Leaky ReLU activation.

    Parameters
    ----------
    Z: (ndarray of any shape) - input to the activation function
    negative_slope: (float) - the slope for negative values

    Returns
    -------
    A: (ndarray of same shape as Z) - output of the activation function
    cache: (ndarray) - returning Z for backpropagation
    """

    A = np.maximum(0, Z) + negative_slope * np.minimum(0, Z)
    cache = Z
    return A, cache

以下程式碼中，linear_activation_forward() 實作了上圖中的一層，它會先呼叫 linear_forward() 來取得 Z，再將 Z 傳給一個 activation function 來取得 A。最終，將 A 和 cache 回傳給呼叫者。

def linear_activation_forward(A_prev, W, b, activation_function):
    """
    Implements the forward propagation for the linear and activation layer.

    Parameters
    ----------
    A_prev: (ndarray (size of previous layer, number of examples)) - activations from previous layer
    W: (ndarray (size of current layer, size of previous layer)) - weight matrix
    b: (ndarray (size of current layer, 1)) - bias vector
    activation_function: (str) - the activation function to be used

    Returns
    -------
    A: (ndarray (size of current layer, number of examples)) - the output of the activation function
    cache: (tuple) - containing linear_cache (A_prev, W, b) and activation_cache (Z) for backpropagation
    """

    Z, linear_cache = linear_forward(A_prev, W, b)
    if activation_function == 'sigmoid':
        A, activation_cache = sigmoid(Z)
    elif activation_function == 'tanh':
        A, activation_cache = tanh(Z)
    elif activation_function == 'relu':
        A, activation_cache = relu(Z)
    elif activation_function == 'leaky_relu':
        A, activation_cache = leaky_relu(Z)
    else:
        raise ValueError(f'Activation function {activation_function} not supported.')
    cache = (linear_cache, activation_cache)
    return A, cache

以下程式碼中，model_forward() 實作了整個 forward propagation。最後，它會回傳 $A^{[L]}$ 和所有的 caches。

def model_forward(X, parameters, activation_functions):
    """
    Implements forward propagation for the entire network.

    Parameters
    ----------
    X: (ndarray (input size, number of examples)) - input data
    parameters: (dict) - output of initialize_parameters()
    activation_functions: (list) - the activation function for each layer. The first element is unused.

    Returns
    -------
    AL: (ndarray (output size, number of examples)) - the output of the last layer
    caches: (list of tuples) - containing caches for each layer
    """

    caches = []
    A = X
    L = len(activation_functions)
    for l in range(1, L):
        A_prev = A
        A, cache = linear_activation_forward(A_prev, parameters[f'W{l}'], parameters[f'b{l}'], activation_functions[l])
        caches.append(cache)
    return A, caches

反向傳播（Backpropagation or Backward Propagation）

在 gradient descent 中，我們必須要計算出 J(W, b) 對每一層的 W 和 b 的偏導數，來更新參數 W 和 b。當 neural network 有很多層的時候，計算偏導數會花費大量的時間。反向傳播（backpropagation）可以加快這些偏導數的計算速度。在計算每一層的偏導數時，都會需要下一層已經計算好的一些數值。如果，從第一層往後計算的話，有很多的數值會需要重複去算。因此，從最後一層往前算的話，每一層可以將算過的數值傳遞給前一層，前面那一層就可以直接取用，不需要再重新算過一次，如下圖。

Backpropagation of binary classification neural network.

其實 backpropagation 就是微分的連鎖率（chain rule）。

$\frac{dy}{dx}=\frac{du}{dx}\cdot\frac{dy}{du}$

根據上圖，我們要先計算 $\frac{\partial J}{\partial A^{[L]}}$ ，然後再計算 $\frac{\partial J}{\partial Z^{[L]}}$ 。其中，最後一層的 activation function 是 sigmoid function $\sigma$ 。

$\frac{\partial J}{\partial A^{[L]}}=\frac{1}{m}\frac{\partial}{\partial A^{[L]}}[-Y\log{A^{[L]}}-(1-Y)\log{(1-A^{[L]})}] \\\\ \hphantom{\frac{\partial J}{\partial A^{[L]}}}=-\frac{1}{m}(\frac{Y}{A^{[L]}}-\frac{1-Y}{1-A^{[L]}}) \\\\ \frac{\partial A^{[L]}}{\partial Z^{[L]}}=\frac{\partial}{\partial z^{[L]}}\sigma(Z^{[L]}) \\\\ \hphantom{\frac{\partial A^{[L]}}{\partial Z^{[L]}}}=\sigma(Z^{[L]})(1-\sigma(Z^{[L]})) \\\\ \hphantom{\frac{\partial A^{[L]}}{\partial Z^{[L]}}}=A^{[L]}(1-A^{[L]}) \\\\ \frac{\partial J}{\partial Z^{[L]}}=\frac{\partial A^{[L]}}{\partial Z^{[L]}}\frac{\partial J}{\partial A^{[L]}}$

接著我們就可以利用上面的結果來計算 $\frac{\partial J}{\partial W^{[L]}}$ 和 $\frac{\partial J}{\partial b^{[L]}}$ 。

$Z^{[L]}=W^{[L]}A^{[L-1]}+b^{[L]} \\\\ \frac{\partial Z^{[L]}}{\partial W^{[L]}}=A^{[L-1]} \\\\ \frac{\partial J}{\partial W^{[L]}}=\frac{\partial Z^{[L]}}{\partial W^{[L]}}\frac{\partial J}{\partial Z^{[L]}}=\frac{\partial J}{\partial Z^{[L]}}A^{[L-1]T} \\\\ \frac{\partial J}{\partial b^{[L]}}=\frac{\partial Z^{[L]}}{\partial b^{[L]}}\frac{\partial J}{\partial Z^{[L]}}=\frac{\partial J}{\partial Z^{[L]}}$

最後，所有的偏導數計算如下。

$L:\text{number of layers} \\\\ \ell=1,...,L \\\\ A^{[0]}=X \\\\ \frac{\partial J}{\partial Z^{[\ell]}}=\frac{\partial A^{[\ell]}}{\partial Z^{[\ell]}}\frac{\partial J}{\partial A^{[\ell]}}=\frac{\partial A^{[\ell]}}{\partial Z^{[\ell]}}\frac{\partial Z^{[\ell+1]}}{\partial A^{[\ell]}}\frac{\partial J}{\partial Z^{[\ell+1]}} \\\\ \frac{\partial J}{\partial A^{[\ell]}}=\frac{\partial Z^{[\ell+1]}}{\partial A^{[\ell]}}\frac{\partial J}{\partial Z^{[\ell+1]}}=W^{[\ell+1]T}\frac{\partial J}{\partial Z^{[\ell+1]}} \\\\ \frac{\partial J}{\partial W^{[\ell]}}=\frac{\partial J}{\partial Z^{[\ell]}}A^{[\ell-1]T} \\\\ \frac{\partial J}{\partial b^{[\ell]}}=\displaystyle\sum_{i=1}^{m}\frac{\partial J}{\partial Z^{[\ell]}}$

以下程式碼中，linear_backward() 實作了圖中 linear backward 的部分。

def linear_backward(dZ, cache):
    """
    Implements the linear portion of backward propagation for a single layer.

    Parameters
    ----------
    dZ: (ndarray (size of current layer, number of examples)) - gradient of the cost with respect to the linear output
    cache: (tuple) - containing W, A_prev, b from the forward propagation

    Returns
    -------
    dA_prev: (ndarray (size of previous layer, number of examples)) - gradient of the cost with respect to the activation from the previous layer
    dW: (ndarray (size of current layer, size of previous layer)) - gradient of the cost with respect to W
    db: (ndarray (size of current layer, 1)) - gradient of the cost with respect to b
    """

    A_prev, W, b = cache
    dW = dZ @ A_prev.T
    db = np.sum(dZ, axis=1, keepdims=True)
    dA_prev = W.T @ dZ
    return dA_prev, dW, db

以下的程式碼實作了四個 activation functions 的微分 $g'$ 。它會乘上 $\frac{\partial J}{\partial A^{[\ell]}}$ ，然後回傳 $\frac{\partial J}{\partial Z^{[\ell]}}$ ，也就是圖中的 activation backward 部分。

def sigmoid_backward(dA, cache):
    """
    Implements the backward propagation for a single sigmoid unit.

    Parameters
    ----------
    dA: (ndarray of any shape) - post-activation gradient
    cache: (ndarray) - Z from the forward propagation

    Returns
    --------
    dZ: (ndarray of the same shape as A) - gradient of the cost with respect to Z
    """

    Z = cache
    g = 1 / (1 + np.exp(-Z))
    g_prime = g * (1 - g)
    dZ = dA * g_prime
    return dZ


def tanh_backward(dA, cache):
    """
    Implements the backward propagation for a single tanh unit.

    Parameters
    ----------
    dA: (ndarray of any shape) - post-activation gradient
    cache: (ndarray) - Z from the forward propagation

    Returns
    -------
    dZ: (ndarray of the same shape as A) - gradient of the cost with respect to Z
    """

    Z = cache
    g = (np.exp(Z) - np.exp(-Z)) / (np.exp(Z) + np.exp(-Z))
    g_prime = 1 - g ** 2
    dZ = dA * g_prime
    return dZ


def relu_backward(dA, cache):
    """
    Implements the backward propagation for a single ReLU unit.

    Parameters
    ----------
    dA: (ndarray of any shape) - post-activation gradient
    cache: (ndarray) - Z from the forward propagation

    Returns
    -------
    dZ: (ndarray of the same shape as A) - gradient of the cost with respect to Z
    """

    Z = cache
    dZ = np.array(dA, copy=True)
    dZ[Z < 0] = 0
    return dZ


def leaky_relu_backward(dA, cache, negative_slope=0.01):
    """
    Implements the backward propagation for a single Leaky ReLU unit.

    Parameters
    ----------
    dA: (ndarray of any shape) - post-activation gradient
    cache: (ndarray) - Z from the forward propagation
    negative_slope: (float) - the slope for negative values

    Returns
    -------
    dZ: (ndarray of the same shape as A) - gradient of the cost with respect to Z
    """

    Z = cache
    dZ = np.array(dA, copy=True)
    dZ[Z < 0] = negative_slope
    return dZ

以下程式碼中的 linear_activation_backward() 實作了圖中一層中的部分。

def linear_activation_backward(dA, cache, activation_function):
    """
    Implements the backward propagation for the linear and activation layer.

    Parameters
    ----------
    dA: (ndarray (size of current layer, number of examples)) - post-activation gradient for current layer
    cache: (tuple) - containing linear_cache (A_prev, W, b) and activation_cache (Z) for backpropagation
    activation_function: (str) - the activation function to be used

    Returns
    -------
    dA_prev: (ndarray (size of previous layer, number of examples)) - gradient of the cost with respect to the activation from the previous layer
    dW: (ndarray (size of current layer, size of previous layer)) - gradient of the cost with respect to W
    db: (ndarray (size of current layer, 1)) - gradient of the cost with respect to b
    """

    linear_cache, activation_cache = cache
    if activation_function == 'sigmoid':
        dZ = sigmoid_backward(dA, activation_cache)
    elif activation_function == 'tanh':
        dZ = tanh_backward(dA, activation_cache)
    elif activation_function == 'relu':
        dZ = relu_backward(dA, activation_cache)
    elif activation_function == 'leaky_relu':
        dZ = leaky_relu_backward(dA, activation_cache)
    else:
        raise ValueError(f'Activation function {activation_function} not supported.')
    dA_prev, dW, db = linear_backward(dZ, linear_cache)
    return dA_prev, dW, db

最後，以下程式碼中的 model_backward() 實作了整個 backpropagation。

def model_backward(AL, Y, caches, activation_functions):
    """
    Implements the backward propagation for the entire network.

    Parameters
    ----------
    AL: (ndarray (output size, number of examples)) - the output of the last layer
    Y: (ndarray (output size, number of examples)) - true labels
    caches: (list of tuples) - containing linear_cache (A_prev, W, b) and activation_cache (Z) for each layer
    activation_functions: (list) - the activation function for each layer. The first element is unused.

    Returns
    -------
    gradients: (dict) with keys where 0 <= l <= len(activation_functions) - 1:
        dA{l-1}: (ndarray (size of previous layer, number of examples)) - gradient of the cost with respect to the activation for previous layer l - 1
        dWl: (ndarray (size of current layer, size of previous layer)) - gradient of the cost with respect to W for layer l
        dbl: (ndarray (size of current layer, 1)) - gradient of the cost with respect to b for layer l
    """

    gradients = {}
    L = len(activation_functions)
    m = AL.shape[1]
    dAL = -(1 / m) * (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
    dA_prev = dAL
    for l in reversed(range(1, L)):
        current_cache = caches[l - 1]
        dA_prev, dW, db = linear_activation_backward(dA_prev, current_cache, activation_functions[l])
        gradients[f'dA{l - 1}'] = dA_prev
        gradients[f'dW{l}'] = dW
        gradients[f'db{l}'] = db
    return gradients

整合全部

執行過 backpropagation 後，會得到 J 對所有參數 W 和 b 的偏導數。然後，就可以呼叫以下的程式碼來更新所有 W 和 b。

def update_parameters(parameters, gradients, learning_rate):
    """
    Updates parameters using the gradient descent update rule.

    Parameters
    ----------
    parameters: (dict) - containing the parameters
    gradients: (dict) - containing the gradients
    learning_rate: (float) - the learning rate

    Returns
    -------
    params: (dict) - containing the updated parameters
    """

    updated_parameters = parameters.copy()
    L = len(updated_parameters) // 2
    for l in range(L):
        updated_parameters[f'W{l + 1}'] = parameters[f'W{l + 1}'] - learning_rate * gradients[f'dW{l + 1}']
        updated_parameters[f'b{l + 1}'] = parameters[f'b{l + 1}'] - learning_rate * gradients[f'db{l + 1}']
    return updated_parameters

以下程式碼中的 nn_model() 實作了整個模型。它先執行 forward propagation，然後執行 backpropagation，最後更新參數。

def nn_model(X, Y, init_parameters, layer_activation_functions, learning_rate, num_iterations):
    """
    Implements a neural network.

    Parameters
    ----------
    X: (ndarray (input size, number of examples)) - input data
    Y: (ndarray (output size, number of examples)) - true labels
    init_parameters: (dict) - the initial parameters for the network
    layer_activation_functions: (list) - the activation function for each layer. The first element is unused.
    learning_rate: (float) - the learning rate
    num_iterations: (int) - the number of iterations

    Returns
    -------
    parameters: (dict) - the learned parameters
    costs: (list) - the costs at every 100th iteration
    """

    costs = []
    parameters = init_parameters.copy()

    for i in range(num_iterations):
        AL, caches = model_forward(X, parameters, layer_activation_functions)
        cost = compute_cost(AL, Y)
        gradients = model_backward(AL, Y, caches, layer_activation_functions)
        parameters = update_parameters(parameters, gradients, learning_rate)

        if i % 100 == 0 or i == num_iterations:
            costs.append(cost)

    return parameters, costs

當訓練好參數後，我們可以用以下的 nn_model_predict() 來做預測。

def nn_model_predict(X, parameters, activation_functions):
    """
    Predicts the output of the neural network.

    Parameters
    ----------
    X: (ndarray (input size, number of examples)) - input data
    parameters: (dict) - the learned parameters
    activation_functions: (list) - the activation function for each layer. The first element is unused.

    Returns
    -------
    predictions: (ndarray (1, number of examples)) - the predicted labels
    """

    probabilities, _ = model_forward(X, parameters, activation_functions)
    predictions = probabilities.copy()
    predictions[predictions > 0.5] = 1
    predictions[predictions <= 0.5] = 0
    return predictions

範例

我們將藉由一個範例來展示如何使用我們的模型。首先，我們先將訓練資料 x_orig 和 y 載入。x_orig 是一個包含 100 張圖片的陣列。每一張圖片的大小是 64 x 64，而且有三個 channels。y 是一個包含 0 或 1 的陣列，1 表示圖片裡是貓，0 表示不是貓。

x_orig, y = load_data()

print(f'x_orig shape: {x_orig.shape}')
print(f'y shape: {y.shape}')

# Output
x_orig shape: ndarray(100, 64, 64, 3)
y shape: ndarray(1, 100)

之前我們有列出 X 的維度是 (n_h, m)，所以每一張圖片是一個行向量。以下我們將 x_orig 的維度，並將數值 0 至 255 轉換成 0 至 1 的值。我們不需要轉換 y 的維度，因為 y 的維度已經和 $A^{[L]}$ 一樣了。

x_flatten = x_orig.reshape(x_orig.shape[0], -1).T
x = x_flatten / 255.

print("x shape: " + str(x.shape))

# Output
x shape: ndarray(12288, 100)

首先，我們要先決定模型的層數，以及每一層 neurons 個數。以下我們設定模型有一個 input layer、hidden layer 裡有三層、以及一個 output layer。我們還要決定每一層的 activation function，其中 layer_activation_functions[0] 對應 input layer，所以不會被使用到。

這些決定好後，我們就可以初始化所有的參數 W 和 b，然後呼叫 nn_model() 來訓練模型。最後，取得訓練好的參數。

layer_dims = [12288, 20, 7, 10, 1]
init_parameters = initialize_parameters(layer_dims)
layer_activation_functions = ['none', 'relu', 'relu', 'relu', 'sigmoid']
learning_rate = 0.0075
parameters, costs = nn_model(x, y, init_parameters, layer_activation_functions, learning_rate, 3000)

有了訓練好的參數後，我們就可以用來預測其他的圖片。

x_new_orig = load_new_data()
x_new_flatten = x_new_orig.reshape(x_new_orig.shape[0], -1).T
x_new = x_new_flatten / 255.
y_new = nn_model_predict(x_new, parameters, layer_activation_functions)

多元分類（Multi-class Classification）

關於多元分類神經網路（multi-class classification neural network），請參考以下文章。

- Deep Learning
- Neural Networks

多元分類神經網路（Multiple Classification Neural Network）

ByWayne
12/01/2025

結語

Neural network 的 backpropagation 牽涉到偏導數的計算，所以較難理解。現在我們已經不需要自己實作 backpropagation，而是使用 machine learning 函式庫，如 PyTorch 或 TensorFlow。但是了解這些細節，可以使我們更加理解 neural network 的運作。

參考

Andrew Ng, Deep Learning Specialization, Coursera.
西内啓，機器學習的數學基礎 : AI、深度學習打底必讀，旗標。