神經網路(Neural Networks)與二元分類(Binary Classification)

Photo by Henrique Ferreira on Unsplash
Photo by Henrique Ferreira on Unsplash
由於深度學習(deep learning)在近年來蓬勃發展,使得神經網路(Neural Network)熱門起來。它已經被用於解決各式各樣的問題。本文章將以二元分類神經網路(binary classification neural network)來詳細介紹 neural network。

由於深度學習(deep learning)在近年來蓬勃發展,使得神經網路(Neural Network)熱門起來。它已經被用於解決各式各樣的問題。本文章將以二元分類神經網路(binary classification neural network)來詳細介紹 neural network。

完整程式碼可以在 下載。

神經網路(Neural Networks)

神經網路(Neural networks)是由大量的神經元(neurons)連結而成。一個 neural network 由三個 layer 組成,分別為接收資料的輸入層(input layer)、輸出結果的輸出層(output layer),以及中間大量 neurons 鏈結組成數層的隱藏層(hidden layer),如下圖。

Input layer 和 output layer 都是各只有一層,而 hidden layer 則是可以數層。此外,當我們說下圖中是一個三層的 neural network,是指 hidden layer 中 layer 的數量再加上一層 output layer,而 input layer 不計算在內。

A neural network with 3 layers (2 hidden layers and 1 output layer).
A neural network with 3 layers (2 hidden layers and 1 output layer).

每一個 neuron 中會有輸入向量 \vec{x}、向量的權重(wights)\vec{w},一個純量的偏差(bias)b,以及一個非線性函數(non-linear function)g。所以,一個 neuron 的功能是,求得 \vec{x}\vec{w} 的內積再加上 b 後得到 z,然後再將 z 帶入 g 後得到一個輸出值 a。而這個輸出值 a 會成為下一層中 neurons 的其中一個輸入值 xi。這個 non-linear function g 稱為激勵函數(activation function),而 a 稱為 activation value。

A neuron with sigmoid function as its activation function.
A neuron with sigmoid function as its activation function.

所以,數個至大量的 neurons 組成一層,該層取得前一層的輸出值作為輸入值,經由該層內的 neurons 運算後的輸出值會作為下一層的輸入值。這樣子一層一層地連接起來成為包含大量 neurons 的 hidden layer。

激勵函數(Activation Functions)

如果一個 neuron 不包含 non-linear function 的話,也就是它只有做線性運算,即使將大量的 neurons 鍵結起來,那也只是一個多元線性回歸(multiple linear regression)。Linear regression models 是無法解決現實世界中複雜的問題,但是複雜的問題可以用 non-linear functions 來找出近似值。在 neuron 中的 non-linear function 稱為激勵函數(activation functions)

以下我們將介紹四個常用的 activation functions。我們也會介紹它的導數求法,因為之後在 backpropagation 中,我們會需要求出它們的導數。

Sigmoid 函數與其導數

Sigmoid 函數將輸入值轉換成一個介於 0 到 1 之間的值,如下圖。它常常被用在 binary classification 的 neural network 裡的 output layer。我們可以將這個輸出值看作是一個機率。例如,我們想要一個 neural network 可以判斷一張圖片裡是不是貓。1 表示是貓,而 0 表示不是貓。當 output layer 輸出值大於 0.5 時,則預測為 1;當小於等於 0.5 時,則預測為 0。

\sigma(z)=\frac{1}{1+e^{-z}}, 0<\sigma(z)<1

Sigmoid function.
Sigmoid function.

Sigmoid 函數的實作如下。

def sigmoid(Z):
    """
    Implements the sigmoid function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the sigmoid function

    Returns
    -------
    A: (ndarray of same shape as Z) or (scalar) - output from the sigmoid function
    """

    A = 1 / (1 + np.exp(-Z))
    return A

Sigmoid 函數的導數如下。

\frac{d}{dz}\sigma(z) =\frac{d}{dz}\frac{1}{1+e^{-z}} \\\\ \hphantom{\frac{d}{dz}\sigma(z)}=-(1+e^{-z})^2\cdot(-e^{-z}) \\\\ \hphantom{\frac{d}{dz}\sigma(z)}=\frac{1}{1+e^{-z}}\frac{e^{-z}}{1+e^{-z}} \\\\ \hphantom{\frac{d}{dz}\sigma(z)}=\frac{1}{1+e^{-z}}\frac{1+e^{-z}-1}{1+e^{-z}} \\\\ \hphantom{\frac{d}{dz}\sigma(z)}=\frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}}) \\\\ \hphantom{\frac{d}{dz}\sigma(z)}=\sigma(z)(1-\sigma(z))

Sigmoid 函數的導數實作如下。

def sigmoid_derivative(Z):
    """
    Implements the derivative of the sigmoid function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the sigmoid function

    Returns
    --------
    dZ: (ndarray of the same shape as Z) or (scalar) - derivative of the sigmoid function with respect to Z
    """

    g = 1 / (1 + np.exp(-Z))
    dZ = g * (1 - g)
    return dZ

Tanh 函數與其導數

Tanh 函數和 sigmoid 函數很像,但是 tanh 函數的輸出值是介於 -1 到 1 之間,如下圖。

tanh(z)=\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}}, -1<tanh(z)<1

Tanh function.
Tanh function.

Tanh 函數的實作如下。

def tanh(Z):
    """
    Implements the tanh function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the tanh function

    Returns
    -------
    A: (ndarray of same shape as Z) or (scalar) - output from the tanh function
    """

    A = (np.exp(Z) - np.exp(-Z)) / (np.exp(Z) + np.exp(-Z))
    return A

Tanh 函數的導數求解如下:

\frac{d}{dz}tanh(z)=\frac{d}{dz}\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}} \\\\ \hphantom{\frac{d}{dz}tanh(z)}=\frac{(e^{z}+e^{-z})\cdot\frac{d}{dz}(e^{z}-e^{-z})-(e^{z}-e^{-z})\cdot\frac{d}{dz}(e^{z}+e^{-z})}{(e^{z}+e^{-z})^2} \\\\ \hphantom{\frac{d}{dz}tanh(z)}=\frac{(e^{z}+e^{-z})(e^{z}+e^{-z})-(e^{z}-e^{-z})(e^{z}-e^{-z})}{(e^{z}+e^{-z})^2} \\\\ \hphantom{\frac{d}{dz}tanh(z)}=\frac{(e^{z}+e^{-z})^2}{(e^{z}+e^{-z})^2}-\frac{(e^{z}-e^{-z})^2}{(e^{z}+e^{-z})^2} \\\\ \hphantom{\frac{d}{dz}tanh(z)}=1-tanh^{2}(z)

Tanh 函數的導數實作如下。

def tanh_derivative(Z):
    """
    Implements the derivative of the tanh function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the tanh function

    Returns
    -------
    dZ: (ndarray of the same shape as Z) or (scalar) - derivative of the tanh function with respect to Z
    """

    g = (np.exp(Z) - np.exp(-Z)) / (np.exp(Z) + np.exp(-Z))
    dZ = 1 - g ** 2
    return dZ

ReLU 函數與其導數

ReLU (rectified linear unit) 函數是廣泛地被使用在 neural networks。當 z 小於等於 0 時,輸出 0;當 z 大於 0 時,輸出 z。可以看出 ReLU 的執行效率非常地快。

relu(z)=max(0,z)

ReLU function.
ReLU function.

ReLU 的函數實作如下。

def relu(Z):
    """
    Implements the ReLU function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the ReLU function

    Returns
    -------
    A: (ndarray of same shape as Z) or (scalar) - output from the ReLU function
    """

    A = np.maximum(0, Z)
    return A

ReLU 的導數如下。當 z 小於 0 時,導數為 0;當 z 大於 0 時,導數為 1;當 z 等於 0 時,導數為未定義。在實作時,慣例上,當 z 等於 0 時,導數設為 1。

\frac{d}{dz}relu(z)=\begin{cases} 1 & \text{if } z \ge 0 \\ 0 & \text{if } z<0  \end{cases}

ReLU 函數的導數實作如下。

def relu_derivative(Z):
    """
    Implements the derivative of the ReLU function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the ReLU function

    Returns
    -------
    dZ: (ndarray of the same shape as Z) or (scalar) - derivative of the ReLU function with respect to Z
    """

    dZ = np.array(Z, copy=True)
    dZ[Z < 0] = 0
    return dZ

Leaky ReLU 函數與其導數

Leaky ReLU 函數是 ReLU 函數的一種變形。當 z 小於 0 時,輸出 \lambda z,其中 \lambda 是一個介於 0 到 1 之間的值。

leaky\_relu(z)=max(\lambda z,z) \text{, where }0<\lambda<1

Leaky ReLU function.
Leaky ReLU function.

Leaky ReLU 函數的實作如下。

def leaky_relu(Z, negative_slope=0.01):
    """
    Implements the leaky ReLU function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the leaky ReLU function
    negative_slope: (float) - the slope for negative values

    Returns
    -------
    A: (ndarray of same shape as Z) or (scalar) - output from the leaky ReLU function
    """

    A = np.maximum(0, Z) + negative_slope * np.minimum(0, Z)
    return A

Leaky ReLU 的導數如下。當 z 小於 0 時,導數為 \lambda;當 z 大於 0 時,導數為 1;當 z 等於 0 時,導數為未定義。在實作時,慣例上,當 z 等於 0 時,導數設為 1。

\frac{d}{dz}leaky\_relu(z)=\begin{cases} 1 & \text{if } z \ge 1 \\ \lambda & \text{if } z<0, \text{where } 0<\lambda<1 \end{cases}

Leaky ReLU 函數的導數實作如下。

def leaky_relu_derivative(Z, negative_slope=0.01):
    """
    Implements the derivative of the leaky ReLU function.

    Parameters
    ----------
    Z: (ndarray of any shape) or (scalar) - input to the leaky ReLU function
    negative_slope: (float) - the slope for negative values

    Returns
    -------
    dZ: (ndarray of the same shape as Z) or (scalar) - derivative of the leaky ReLU function with respect to Z
    """

    dZ = np.array(Z, copy=True)
    dZ[Z < 0] = negative_slope
    return dZ

二元分類(Binary Classification)

下圖是一個 binary classification 的 neural network。因為它是一個 binary classification,所以它的 output layer 裡的 activation function 是 sigmoid function \sigma。我們可以看到圖中有相當多的變數。每一層有自己的 activation function,而每一個 neuron 有自己的參數 wb。因此,將這些變數向量化,使用矩陣來表示的話,可以大大地簡化式子,如圖中的黃色部分。

Binary classification with neural network.
Binary classification with neural network.

所以,每一層的向量化後的公式以及陣列的維度如下。

Z^{[\ell]}=W^{[\ell]}A^{[\ell-1]}+b^{[\ell]} \\\\ A^{[\ell]}=g^{[\ell]}(Z^{[\ell]}) \\\\ \ell: \ell \text{-th layer} \\\\ m: \text{number of examples} \\\\ L: \text{number of layers} \\\\ n_h:\text{number of inputs} \\\\ n^{[\ell]}:\text{number of units in } \ell \text{-th layer}

LayerShape of WShape of XShape of bShape of ZShape of A
1W^{[1]}:(n^{[1]},n_h) X:(n_h,m) b^{[1]}:(n^{[1]},1) Z^{[1]}:(n^{[1]},m) A^{[1]}:(n^{[1]},m)
2W^{[2]}:(n^{[2]},n^{[1]}) A^{[1]}:(n^{[1]},m) b^{[2]}:(n^{[2]},1) Z^{[2]}:(n^{[2]},m) A^{[2]}:(n^{[2]},m)
\vdots \vdots \vdots \vdots \vdots \vdots
LW^{[L]}:(n^{[L]},n^{[L-1]}) A^{[L-1]}:(n^{[L-1]},m) b^{[L]}:(n^{[L]},1) Z^{[L]}:(n^{[L]},m) A^{[L]}:(n^{[L]},m)

梯度下降(Gradient Descent)

Neural network 的 gradient descent 如下。由於每一層都有對應的參數 Wb,因此我們必須要計算 J 對每一層 Wb 的偏導數。

L:\text{number of layers} \\\\ m:\text{number of examples} \\\\ \text{Parameters}: W^{[\ell]},b^{[\ell]},\ell=1,...,L \\\\ \text{Cost function}: J(W,b)=\frac{1}{m}\displaystyle\sum_{i=1}^{m}\mathcal{L}(a^{[L](i)},y^{(i)}) \\\\ \text{repeat until convergence \{} \\\\ \hphantom{xxxx}\text{Compute predict }(\hat{y}^{(i)},i=1,...,m) \\\\ \hphantom{xxxx}W^{[\ell]}:=W^{[\ell]}-\frac{\partial J}{\partial W^{[\ell]}}, \ell=1,...,L \\\\ \hphantom{xxxx}b^{[\ell]}:=b^{[\ell]}-\frac{\partial J}{\partial b^{[\ell]}}, \ell=1,...,L \\\\ \text{\}}

Gradient descent 的流程如下圖。

  1. 初始化所有參數 W 和 b。
  2. 計算 \hat{y},並儲存過程中的 WbA、和 Z。因為在下一步計算偏導數時,我們會需要這些值。這部分是 forward propagation。
  3. 計算 J 對每一個 Wb 的偏導數。這部分是 backward propagation。
  4. 更新所有參數 Wb
  5. 反覆執行步驟 2 至 4,共執行 num_iterations 次。
Gradient descent of binary classification neural network.
Gradient descent of binary classification neural network.

成本函數(Cost Function)

在 binary classification 的 neural network 裡,output layer 的 activation function 是 sigmoid function。因此,我們可以使用 sigmoid regression 的 cost function 作為 binary classification neural network 的 cost function。Sigmoid regression 是使用 cross-entropy loss 作為它的 cost function。

J(W,b)=\frac{1}{m}\displaystyle\sum_{i=1}^{m}\mathcal{L}(a^{[L](i)},y^{(i)}) \\\\ \mathcal{L}(a^{[L](i)},y^{(i)})=-y^{(i)}\log a^{[L](i)}-(1-y^{(i)})\log (1-a^{[L](i)}) \\\\ L: \text{number of layers}\\\\ a^{[L](i)}:\text{the activation in the last layer for } i \text{-th example}

以下程式碼實作了 cost function。

def compute_cost(AL, Y):
    """
    Computes the cross-entropy loss.

    Parameters
    ----------
    AL: (ndarray (1, number of examples)) - the output of the last layer
    Y: (ndarray (1, number of examples)) - true labels

    Returns
    -------
    cost: (float) - the cross-entropy cost
    """

    m = Y.shape[1]
    cost = -(1 / m) * np.sum(Y * np.log(AL) + (1 - Y) * np.log(1 - AL), axis=1, keepdims=True)
    cost = np.squeeze(cost)
    return cost

參數初始化

我們有之前有列出所有參數 Wb 的維度。它們的維度與每一層 neurons 的數量相關,因此在初始化參數時,我們必須要先決定每一層 neurons 的數量。

以下程式碼中,我們使用亂數來初始化參數 Wb

def initialize_parameters(layer_dims):
    """
    Initializes parameters for a deep neural network.

    Parameters
    ----------
    layer_dims: (list) - the number of units of each layer in the network.

    Returns
    -------
    (dict) with keys where 1 <= l <= len(layer_dims) - 1:
        Wl: (ndarray (layer_dims[l], layer_dims[l-1])) - weight matrix for layer l
        bl: (ndarray (layer_dims[l], 1)) - bias vector for layer l
    """

    parameters = {}
    for l in range(1, len(layer_dims)):
        parameters[f'W{l}'] = np.random.randn(layer_dims[l], layer_dims[l - 1]) / np.sqrt(layer_dims[l - 1])
        parameters[f'b{l}'] = np.zeros((layer_dims[l], 1))
    return parameters

前向傳播(Forward Propagation)

前向傳播(forward propagation)是 neural network 的 gradient descent 裡的前半段。每一層的輸出 A 會是下一層的輸入,所以 A 會被一層一層地傳遞,每一層會改變 A 的值。最後的 A^{[L]} 也就會是 \hat{y}。每一層都會將算出來的值儲存到 caches 裡,因為下半段的 backpropagation 將會需要它們。

我們執行完整個 gradient descent 後,會得到最終的參數 Wfinalbfinal。假設我們想要用此模型來預測 Xnew,我們將輸入值 Xnew 以及參數 Wfinalbfinal 代入 forward propagation,最後得到的 A_{new}^{[L]}=\hat{y}_{new} 就是 Xnew 的預測值(prediction)。

Forward propagation of binary classification neural network.
Forward propagation of binary classification neural network.

以下程式碼中,linear_forward() 實作了流程中每一層的 linear forward 部分。linear_forwared() 不但回傳 Z,還會傳回 AW、和 b 給呼叫者,由呼叫者將它們存入 caches 中。

def linear_forward(A_prev, W, b):
    """
    Implements the linear part of a layer's forward propagation.

    Parameters
    ----------
    A_prev: (ndarray (size of previous layer, number of examples)) - activations from previous layer
    W: (ndarray (size of current layer, size of previous layer)) - weight matrix
    b: (ndarray (size of current layer, 1)) - bias vector

    Returns
    -------
    Z: (ndarray (size of current layer, number of examples)) - the input to the activation function
    cache: (tuple) - containing A_prev, W, b for backpropagation
    """

    Z = W @ A_prev + b
    cache = (A_prev, W, b)
    return Z, cache

以下程式碼中,我們實作了四種 activation functions。這些實作與文章開頭中的 activation function 實作是幾乎一樣的,差別在於這邊還會回傳 Z 給呼叫者。呼叫者會將它存入 caches 中。

def sigmoid(Z):
    """
    Implements the sigmoid activation.

    Parameters
    ----------
    Z: (ndarray of any shape) - input to the activation function

    Returns
    -------
    A: (ndarray of same shape as Z) - output of the activation function
    cache: (ndarray) - returning Z for backpropagation
    """

    A = 1 / (1 + np.exp(-Z))
    cache = Z
    return A, cache


def tanh(Z):
    """
    Implements the tanh activation.

    Parameters
    ----------
    Z: (ndarray of any shape) - input to the activation function

    Returns
    -------
    A: (ndarray of same shape as Z) - output of the activation function
    cache: (ndarray) - returning Z for backpropagation
    """

    A = (np.exp(Z) - np.exp(-Z)) / (np.exp(Z) + np.exp(-Z))
    cache = Z
    return A, cache


def relu(Z):
    """
    Implements the ReLU activation.

    Parameters
    ----------
    Z: (ndarray of any shape) - input to the activation function

    Returns
    -------
    A: (ndarray of same shape as Z) - output of the activation function
    cache: (ndarray) - returning Z for backpropagation
    """

    A = np.maximum(0, Z)
    cache = Z
    return A, cache


def leaky_relu(Z, negative_slope=0.01):
    """
    Implements the Leaky ReLU activation.

    Parameters
    ----------
    Z: (ndarray of any shape) - input to the activation function
    negative_slope: (float) - the slope for negative values

    Returns
    -------
    A: (ndarray of same shape as Z) - output of the activation function
    cache: (ndarray) - returning Z for backpropagation
    """

    A = np.maximum(0, Z) + negative_slope * np.minimum(0, Z)
    cache = Z
    return A, cache

以下程式碼中,linear_activation_forward() 實作了上圖中的一層,它會先呼叫 linear_forward() 來取得 Z,再將 Z 傳給一個 activation function 來取得 A。最終,將 A 和 cache 回傳給呼叫者。

def linear_activation_forward(A_prev, W, b, activation_function):
    """
    Implements the forward propagation for the linear and activation layer.

    Parameters
    ----------
    A_prev: (ndarray (size of previous layer, number of examples)) - activations from previous layer
    W: (ndarray (size of current layer, size of previous layer)) - weight matrix
    b: (ndarray (size of current layer, 1)) - bias vector
    activation_function: (str) - the activation function to be used

    Returns
    -------
    A: (ndarray (size of current layer, number of examples)) - the output of the activation function
    cache: (tuple) - containing linear_cache (A_prev, W, b) and activation_cache (Z) for backpropagation
    """

    Z, linear_cache = linear_forward(A_prev, W, b)
    if activation_function == 'sigmoid':
        A, activation_cache = sigmoid(Z)
    elif activation_function == 'tanh':
        A, activation_cache = tanh(Z)
    elif activation_function == 'relu':
        A, activation_cache = relu(Z)
    elif activation_function == 'leaky_relu':
        A, activation_cache = leaky_relu(Z)
    else:
        raise ValueError(f'Activation function {activation_function} not supported.')
    cache = (linear_cache, activation_cache)
    return A, cache

以下程式碼中,model_forward() 實作了整個 forward propagation。最後,它會回傳 A^{[L]} 和所有的 caches。

def model_forward(X, parameters, activation_functions):
    """
    Implements forward propagation for the entire network.

    Parameters
    ----------
    X: (ndarray (input size, number of examples)) - input data
    parameters: (dict) - output of initialize_parameters()
    activation_functions: (list) - the activation function for each layer. The first element is unused.

    Returns
    -------
    AL: (ndarray (output size, number of examples)) - the output of the last layer
    caches: (list of tuples) - containing caches for each layer
    """

    caches = []
    A = X
    L = len(activation_functions)
    for l in range(1, L):
        A_prev = A
        A, cache = linear_activation_forward(A_prev, parameters[f'W{l}'], parameters[f'b{l}'], activation_functions[l])
        caches.append(cache)
    return A, caches

反向傳播(Backpropagation or Backward Propagation)

在 gradient descent 中,我們必須要計算出 J(W, b) 對每一層的 Wb 的偏導數,來更新參數 Wb。當 neural network 有很多層的時候,計算偏導數會花費大量的時間。反向傳播(backpropagation)可以加快這些偏導數的計算速度。在計算每一層的偏導數時,都會需要下一層已經計算好的一些數值。如果,從第一層往後計算的話,有很多的數值會需要重複去算。因此,從最後一層往前算的話,每一層可以將算過的數值傳遞給前一層,前面那一層就可以直接取用,不需要再重新算過一次,如下圖。

Backpropagation of binary classification neural network.
Backpropagation of binary classification neural network.

其實 backpropagation 就是微分的連鎖率(chain rule)。

\frac{dy}{dx}=\frac{du}{dx}\cdot\frac{dy}{du}

根據上圖,我們要先計算 \frac{\partial J}{\partial A^{[L]}},然後再計算 \frac{\partial J}{\partial Z^{[L]}}。其中,最後一層的 activation function 是 sigmoid function \sigma

\frac{\partial J}{\partial A^{[L]}}=\frac{1}{m}\frac{\partial}{\partial A^{[L]}}[-Y\log{A^{[L]}}-(1-Y)\log{(1-A^{[L]})}] \\\\ \hphantom{\frac{\partial J}{\partial A^{[L]}}}=-\frac{1}{m}(\frac{Y}{A^{[L]}}-\frac{1-Y}{1-A^{[L]}}) \\\\ \frac{\partial A^{[L]}}{\partial Z^{[L]}}=\frac{\partial}{\partial z^{[L]}}\sigma(Z^{[L]}) \\\\ \hphantom{\frac{\partial A^{[L]}}{\partial Z^{[L]}}}=\sigma(Z^{[L]})(1-\sigma(Z^{[L]})) \\\\ \hphantom{\frac{\partial A^{[L]}}{\partial Z^{[L]}}}=A^{[L]}(1-A^{[L]}) \\\\ \frac{\partial J}{\partial Z^{[L]}}=\frac{\partial A^{[L]}}{\partial Z^{[L]}}\frac{\partial J}{\partial A^{[L]}}

接著我們就可以利用上面的結果來計算 \frac{\partial J}{\partial W^{[L]}}\frac{\partial J}{\partial b^{[L]}}

Z^{[L]}=W^{[L]}A^{[L-1]}+b^{[L]} \\\\ \frac{\partial Z^{[L]}}{\partial W^{[L]}}=A^{[L-1]} \\\\ \frac{\partial J}{\partial W^{[L]}}=\frac{\partial Z^{[L]}}{\partial W^{[L]}}\frac{\partial J}{\partial Z^{[L]}}=\frac{\partial J}{\partial Z^{[L]}}A^{[L-1]T} \\\\ \frac{\partial J}{\partial b^{[L]}}=\frac{\partial Z^{[L]}}{\partial b^{[L]}}\frac{\partial J}{\partial Z^{[L]}}=\frac{\partial J}{\partial Z^{[L]}}

最後,所有的偏導數計算如下。

L:\text{number of layers} \\\\ \ell=1,...,L \\\\ A^{[0]}=X \\\\ \frac{\partial J}{\partial Z^{[\ell]}}=\frac{\partial A^{[\ell]}}{\partial Z^{[\ell]}}\frac{\partial J}{\partial A^{[\ell]}}=\frac{\partial A^{[\ell]}}{\partial Z^{[\ell]}}\frac{\partial Z^{[\ell+1]}}{\partial A^{[\ell]}}\frac{\partial J}{\partial Z^{[\ell+1]}} \\\\ \frac{\partial J}{\partial A^{[\ell]}}=\frac{\partial Z^{[\ell+1]}}{\partial A^{[\ell]}}\frac{\partial J}{\partial Z^{[\ell+1]}}=W^{[\ell+1]T}\frac{\partial J}{\partial Z^{[\ell+1]}} \\\\ \frac{\partial J}{\partial W^{[\ell]}}=\frac{\partial J}{\partial Z^{[\ell]}}A^{[\ell-1]T} \\\\ \frac{\partial J}{\partial b^{[\ell]}}=\displaystyle\sum_{i=1}^{m}\frac{\partial J}{\partial Z^{[\ell]}}

以下程式碼中,linear_backward() 實作了圖中 linear backward 的部分。

def linear_backward(dZ, cache):
    """
    Implements the linear portion of backward propagation for a single layer.

    Parameters
    ----------
    dZ: (ndarray (size of current layer, number of examples)) - gradient of the cost with respect to the linear output
    cache: (tuple) - containing W, A_prev, b from the forward propagation

    Returns
    -------
    dA_prev: (ndarray (size of previous layer, number of examples)) - gradient of the cost with respect to the activation from the previous layer
    dW: (ndarray (size of current layer, size of previous layer)) - gradient of the cost with respect to W
    db: (ndarray (size of current layer, 1)) - gradient of the cost with respect to b
    """

    A_prev, W, b = cache
    dW = dZ @ A_prev.T
    db = np.sum(dZ, axis=1, keepdims=True)
    dA_prev = W.T @ dZ
    return dA_prev, dW, db

以下的程式碼實作了四個 activation functions 的微分 g'。它會乘上 \frac{\partial J}{\partial A^{[\ell]}},然後回傳 \frac{\partial J}{\partial Z^{[\ell]}},也就是圖中的 activation backward 部分。

def sigmoid_backward(dA, cache):
    """
    Implements the backward propagation for a single sigmoid unit.

    Parameters
    ----------
    dA: (ndarray of any shape) - post-activation gradient
    cache: (ndarray) - Z from the forward propagation

    Returns
    --------
    dZ: (ndarray of the same shape as A) - gradient of the cost with respect to Z
    """

    Z = cache
    g = 1 / (1 + np.exp(-Z))
    g_prime = g * (1 - g)
    dZ = dA * g_prime
    return dZ


def tanh_backward(dA, cache):
    """
    Implements the backward propagation for a single tanh unit.

    Parameters
    ----------
    dA: (ndarray of any shape) - post-activation gradient
    cache: (ndarray) - Z from the forward propagation

    Returns
    -------
    dZ: (ndarray of the same shape as A) - gradient of the cost with respect to Z
    """

    Z = cache
    g = (np.exp(Z) - np.exp(-Z)) / (np.exp(Z) + np.exp(-Z))
    g_prime = 1 - g ** 2
    dZ = dA * g_prime
    return dZ


def relu_backward(dA, cache):
    """
    Implements the backward propagation for a single ReLU unit.

    Parameters
    ----------
    dA: (ndarray of any shape) - post-activation gradient
    cache: (ndarray) - Z from the forward propagation

    Returns
    -------
    dZ: (ndarray of the same shape as A) - gradient of the cost with respect to Z
    """

    Z = cache
    dZ = np.array(dA, copy=True)
    dZ[Z < 0] = 0
    return dZ


def leaky_relu_backward(dA, cache, negative_slope=0.01):
    """
    Implements the backward propagation for a single Leaky ReLU unit.

    Parameters
    ----------
    dA: (ndarray of any shape) - post-activation gradient
    cache: (ndarray) - Z from the forward propagation
    negative_slope: (float) - the slope for negative values

    Returns
    -------
    dZ: (ndarray of the same shape as A) - gradient of the cost with respect to Z
    """

    Z = cache
    dZ = np.array(dA, copy=True)
    dZ[Z < 0] = negative_slope
    return dZ

以下程式碼中的 linear_activation_backward() 實作了圖中一層中的部分。

def linear_activation_backward(dA, cache, activation_function):
    """
    Implements the backward propagation for the linear and activation layer.

    Parameters
    ----------
    dA: (ndarray (size of current layer, number of examples)) - post-activation gradient for current layer
    cache: (tuple) - containing linear_cache (A_prev, W, b) and activation_cache (Z) for backpropagation
    activation_function: (str) - the activation function to be used

    Returns
    -------
    dA_prev: (ndarray (size of previous layer, number of examples)) - gradient of the cost with respect to the activation from the previous layer
    dW: (ndarray (size of current layer, size of previous layer)) - gradient of the cost with respect to W
    db: (ndarray (size of current layer, 1)) - gradient of the cost with respect to b
    """

    linear_cache, activation_cache = cache
    if activation_function == 'sigmoid':
        dZ = sigmoid_backward(dA, activation_cache)
    elif activation_function == 'tanh':
        dZ = tanh_backward(dA, activation_cache)
    elif activation_function == 'relu':
        dZ = relu_backward(dA, activation_cache)
    elif activation_function == 'leaky_relu':
        dZ = leaky_relu_backward(dA, activation_cache)
    else:
        raise ValueError(f'Activation function {activation_function} not supported.')
    dA_prev, dW, db = linear_backward(dZ, linear_cache)
    return dA_prev, dW, db

最後,以下程式碼中的 model_backward() 實作了整個 backpropagation。

def model_backward(AL, Y, caches, activation_functions):
    """
    Implements the backward propagation for the entire network.

    Parameters
    ----------
    AL: (ndarray (output size, number of examples)) - the output of the last layer
    Y: (ndarray (output size, number of examples)) - true labels
    caches: (list of tuples) - containing linear_cache (A_prev, W, b) and activation_cache (Z) for each layer
    activation_functions: (list) - the activation function for each layer. The first element is unused.

    Returns
    -------
    gradients: (dict) with keys where 0 <= l <= len(activation_functions) - 1:
        dA{l-1}: (ndarray (size of previous layer, number of examples)) - gradient of the cost with respect to the activation for previous layer l - 1
        dWl: (ndarray (size of current layer, size of previous layer)) - gradient of the cost with respect to W for layer l
        dbl: (ndarray (size of current layer, 1)) - gradient of the cost with respect to b for layer l
    """

    gradients = {}
    L = len(activation_functions)
    m = AL.shape[1]
    dAL = -(1 / m) * (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
    dA_prev = dAL
    for l in reversed(range(1, L)):
        current_cache = caches[l - 1]
        dA_prev, dW, db = linear_activation_backward(dA_prev, current_cache, activation_functions[l])
        gradients[f'dA{l - 1}'] = dA_prev
        gradients[f'dW{l}'] = dW
        gradients[f'db{l}'] = db
    return gradients

整合全部

執行過 backpropagation 後,會得到 J 對所有參數 Wb 的偏導數。然後,就可以呼叫以下的程式碼來更新所有 Wb

def update_parameters(parameters, gradients, learning_rate):
    """
    Updates parameters using the gradient descent update rule.

    Parameters
    ----------
    parameters: (dict) - containing the parameters
    gradients: (dict) - containing the gradients
    learning_rate: (float) - the learning rate

    Returns
    -------
    params: (dict) - containing the updated parameters
    """

    updated_parameters = parameters.copy()
    L = len(updated_parameters) // 2
    for l in range(L):
        updated_parameters[f'W{l + 1}'] = parameters[f'W{l + 1}'] - learning_rate * gradients[f'dW{l + 1}']
        updated_parameters[f'b{l + 1}'] = parameters[f'b{l + 1}'] - learning_rate * gradients[f'db{l + 1}']
    return updated_parameters

以下程式碼中的 nn_model() 實作了整個模型。它先執行 forward propagation,然後執行 backpropagation,最後更新參數。

def nn_model(X, Y, init_parameters, layer_activation_functions, learning_rate, num_iterations):
    """
    Implements a neural network.

    Parameters
    ----------
    X: (ndarray (input size, number of examples)) - input data
    Y: (ndarray (output size, number of examples)) - true labels
    init_parameters: (dict) - the initial parameters for the network
    layer_activation_functions: (list) - the activation function for each layer. The first element is unused.
    learning_rate: (float) - the learning rate
    num_iterations: (int) - the number of iterations

    Returns
    -------
    parameters: (dict) - the learned parameters
    costs: (list) - the costs at every 100th iteration
    """

    costs = []
    parameters = init_parameters.copy()

    for i in range(num_iterations):
        AL, caches = model_forward(X, parameters, layer_activation_functions)
        cost = compute_cost(AL, Y)
        gradients = model_backward(AL, Y, caches, layer_activation_functions)
        parameters = update_parameters(parameters, gradients, learning_rate)

        if i % 100 == 0 or i == num_iterations:
            costs.append(cost)

    return parameters, costs

當訓練好參數後,我們可以用以下的 nn_model_predict() 來做預測。

def nn_model_predict(X, parameters, activation_functions):
    """
    Predicts the output of the neural network.

    Parameters
    ----------
    X: (ndarray (input size, number of examples)) - input data
    parameters: (dict) - the learned parameters
    activation_functions: (list) - the activation function for each layer. The first element is unused.

    Returns
    -------
    predictions: (ndarray (1, number of examples)) - the predicted labels
    """

    probabilities, _ = model_forward(X, parameters, activation_functions)
    predictions = probabilities.copy()
    predictions[predictions > 0.5] = 1
    predictions[predictions <= 0.5] = 0
    return predictions

範例

我們將藉由一個範例來展示如何使用我們的模型。首先,我們先將訓練資料 x_origy 載入。x_orig 是一個包含 100 張圖片的陣列。每一張圖片的大小是 64 x 64,而且有三個 channels。y 是一個包含 0 或 1 的陣列,1 表示圖片裡是貓,0 表示不是貓。

x_orig, y = load_data()

print(f'x_orig shape: {x_orig.shape}')
print(f'y shape: {y.shape}')

# Output
x_orig shape: ndarray(100, 64, 64, 3)
y shape: ndarray(1, 100)

之前我們有列出 X 的維度是 (nh, m),所以每一張圖片是一個行向量。以下我們將 x_orig 的維度,並將數值 0 至 255 轉換成 0 至 1 的值。我們不需要轉換 y 的維度,因為 y 的維度已經和 A^{[L]} 一樣了。

x_flatten = x_orig.reshape(x_orig.shape[0], -1).T
x = x_flatten / 255.

print("x shape: " + str(x.shape))

# Output
x shape: ndarray(12288, 100)

首先,我們要先決定模型的層數,以及每一層 neurons 個數。以下我們設定模型有一個 input layer、hidden layer 裡有三層、以及一個 output layer。我們還要決定每一層的 activation function,其中 layer_activation_functions[0] 對應 input layer,所以不會被使用到。

這些決定好後,我們就可以初始化所有的參數 Wb,然後呼叫 nn_model() 來訓練模型。最後,取得訓練好的參數。

layer_dims = [12288, 20, 7, 10, 1]
init_parameters = initialize_parameters(layer_dims)
layer_activation_functions = ['none', 'relu', 'relu', 'relu', 'sigmoid']
learning_rate = 0.0075
parameters, costs = nn_model(x, y, init_parameters, layer_activation_functions, learning_rate, 3000)

有了訓練好的參數後,我們就可以用來預測其他的圖片。

x_new_orig = load_new_data()
x_new_flatten = x_new_orig.reshape(x_new_orig.shape[0], -1).T
x_new = x_new_flatten / 255.
y_new = nn_model_predict(x_new, parameters, layer_activation_functions)

多元分類(Multi-class Classification)

關於多元分類神經網路(multi-class classification neural network),請參考以下文章。

結語

Neural network 的 backpropagation 牽涉到偏導數的計算,所以較難理解。現在我們已經不需要自己實作 backpropagation,而是使用 machine learning 函式庫,如 PyTorch 或 TensorFlow。但是了解這些細節,可以使我們更加理解 neural network 的運作。

參考

發佈留言

發佈留言必須填寫的電子郵件地址不會公開。 必填欄位標示為 *

You May Also Like