卷積神經網路（Convolutional Neural Networks, CNN）

卷積神經網路（convolutional neural networks, CNN）是一個基於神經網路（neural networks）的電腦視覺和影像處理方法。在本文章中，我們將介紹 CNN 裡各種層（layers）的原理。

完整程式碼可以在下載。

卷積（Convolutions）
卷積神經網路（Convolutional Neural Networks）
LeNet-5
結語
參考

卷積（Convolutions）

卷積（convolution）在影像處理中是一種用於修改或分析影像的數學運算。它組合兩個函數（影像和 kernel）以產生第三個函數。該函數表示一個函數的形狀或特徵如何被另一個函數修改。Convolution 廣泛用於電腦視覺中的邊緣檢測（edge detection）、模糊（blurring）、銳利化（sharpening）、降噪（noise reduction）和特徵擷取（feature extraction）等任務。

數學定義如下。Convolution operation 的符號為 $\ast$ 。

$\displaystyle C(x,y)=(I\ast K)(x,y)=\sum_{a=-k}^{k}\sum_{b=-k}^{k}K(a,b)\cdot I(x-a,y-b) \\\\ I(x,y):\text{The intensity of the pixel at position }(x,y)\text{ in the image.} \\\\ K(a,b):\text{The kernel value at position }(a, b). \\\\ k:\text{The radius of the kernel (e.g., for a }3\times3\text{ kernel, }k=1\text{; for a }5\times5\text{ kernel, }k=2\text{).}$

我們用下圖中的例子來解釋 convolution operation。圖中有一個 5 x 5 的輸入影像，一個 3 x 3 的 kernel，以及一個 3 x 3 的輸出影像。我們想像有一個和 kernel 一樣大小的滑動視窗（slide window）在輸入影像上。Sliding window 從輸入影像的左上開始，視窗中的值與 kernel 做 element-wise 的相乘並全部加總起來，這就會是輸出影像左上的值，如圖中藍色部分。然後，sliding window 往右邊移動一格，並做同樣的乘法和加總，這就會是輸出影像的第二個值，如圖中的綠色。Sliding window 再往右邊移動一格並做同樣的乘法和加總。再來，由於 sliding window 無法再往右邊移動，於是回到最前向並往下移動一個。反覆地直到 sliding window 無法再往右和往下移動。

Kernel

不同的 kernel 被用於不同的任務。如下圖中的 kernel 是 Prewitt operator，它被用於邊緣檢測。

輸出影像的大小可由輸入影像的大小與 kernel 的大小計算出來。

$\text{Input image}:w\times h \\\\ \text{Kernel}:f\times f \\\\ \text{Output image}:w-f+1 \times h-f+1$

Padding

我們可以發現，每當執行 convolution operation 後，輸出的影像會越來越小。我們可以先對輸入影像做 padding，使得輸入影像變大後，再執行 convolution operation。當 padding 為一時，就是對輸入影像的外圍補上一圈的 0，如下。

所以當我們使用 padding 時，輸出影像的大小可由以下計算出來。

$\text{Input image}:w\times h \\\\ \text{Kernel}:f\times f \\\\ \text{Padding}:p \\\\ \text{Output image}:w+2p-f+1\times h+2p-f+1$

Stride

Sliding window 每次移動的格數稱為 stride。以下是 stride 等於二時，sliding window 移動的情況。

所以當我們使用 stride 時，輸出影像的大小可由以下計算出來。

$\text{Input image}:w\times h \\\\ \text{Kernel}:f\times f \\\\ \text{Padding}:p \\\\ \text{Stride}:s \\\\ \text{Output image}:\lfloor\frac{w+2p-f}{s}+1\rfloor \times \lfloor\frac{h+2p-f}{s}+1\rfloor$

卷積神經網路（Convolutional Neural Networks）

現在我們已經了解什麼是 convolutions。而，卷積神經網路（convolutional neural networks, CNN）就是將 convolution 應用在 neural networks 上。一個 CNN 就如同一個 neural network 一樣會有很多 layers，如下圖。CNN 中主要有三種 layers，接下來我們將會逐一介紹。

如果你還不熟悉 neural networks 的話，請先參考以下文章。

- Deep Learning
- Neural Networks

神經網路（Neural Networks）與二元分類（Binary Classification）

ByWayne
12/01/2025

卷積層（Convolution Layers）

一個卷積層（convolution layer）包含數個 kernels，在這邊我們會使用 filters 來稱呼 kernels。輸入的資料可能是三維的資料，與一個 filter 做 convolution operation 之後，輸出一個二維的資料。然而，一個 convolution layer 可以包含數個 filters，所以與數個 filters 做 convolution operation 之後，我們會得到數個二維資料。然後，再將這些二維資料堆疊起來，最終這個 convolution layer 會輸出一個三維的資料，如下圖。

在一個 convolution layer 中，training parameters 就是 filters，對應 neural networks 裡的 W, b。此外，在這個 convolution layer 中，我們還要設定它的 hyper parameters，這些先前已經介紹過了。所以，一個 convolution layer 中包含 training parameters 和 hyper parameters，因此我們必須要小心處理每個部分的維度大小。

$f^{[\ell]}:\text{filter size} \\\\ p^{[\ell]}:\text{padding} \\\\ s^{[\ell]}:\text{stride} \\\\ n_c^{[\ell]}:\text{number of filters} \\\\ \text{Each filter}:f^{[\ell]}\times f^{[\ell]}\times n_c^{[\ell-1]} \\\\ \text{Activations }a^{[\ell]}:n_h^{[\ell]}\times n_w^{[\ell]}\times n_c^{[\ell]} \\\\ \text{Weights }W^{[\ell]}:f^{[\ell]}\times f^{[\ell-1]}\times n_c^{[\ell-1]}\times n_c^{[\ell]} \\\\ \text{bias }b^{[\ell]}:1\times1\times1\times n_c^{[\ell]} \\\\ \text{Input }a^{[\ell-1]}:n_h^{[\ell-1]}\times n_w^{[\ell-1]}\times n_c^{[\ell-1]} \\\\ \text{Output }a^{[\ell]}:n_h^{[\ell]}\times n_w^{[\ell]}\times n_c^{[\ell]} \\\\ n_h^{[\ell]}=\lfloor\frac{n_h^{[\ell-1]}+2p^{[\ell]}-f^{[\ell]}}{s^{[\ell]}}+1\rfloor \\\\ n_w^{[\ell]}=\lfloor\frac{n_w^{[\ell-1]}+2p^{[\ell]}-f^{[\ell]}}{s^{[\ell]}}+1\rfloor$

池化層（Pooling Layers）

相較於 convolution layers，池化層（pooling layers）相對簡單很多。在一個 pooling layer 中，我們要設定兩個 hyper parameters、以及決定使用哪一種 pooling。下圖中，filter 大小為 2，而 stride 為 2。我們想像有一個 sliding window，其大小就是 filter 大小 2。此 pooling layer 是一個 max pooling layer，所以在 sliding window 中，輸出最大的值。然後，往右移動 stride 格數，也就是兩格，然後再輸出 sliding window 裡的最大值。

所以，pooling layer 只有 hyper parameters 而沒有 training parameters。也就是說，pooling layers 裡沒有要訓練的參數。

全連接層（Fully Connected Layers, FC layers）

全連接層（fully connected layers）就像是傳統的 neural networks，如下圖。因此，fully connected layers 裡有 training parameters W, b。

下面的式子也許會比較好理解。

$a^{[\ell-1]}:n_h^{[\ell-1]}\times1 \\\\ W^{[\ell]}:n_h^{[\ell]}\times n_h^{[\ell-1]} \\\\ b^{[\ell]}:n_n^{[\ell]}\times1 \\\\ a^{[\ell]}:n_h^{[\ell]}\times1 \\\\ a^{[\ell]}=\sigma(W^{[\ell]}\cdot a^{[\ell-1]}+b^{[\ell]})$

下面是一個 fully connected layer 範例。

在 CNN 中，再進入 fully connected layers 前，會先將多維的資料轉換成 n x 1 的資料，我們稱為 flatten。所以，在 fully connected layers 中的參數 W, b 的大小是很大。在一個 CNN 中，大部分 training parameters 會在 fully connected layers，而一小部分會在 convolution layer 的 filters。

LeNet-5

LeNet-5 是 Yann LeCun 在 1998 年提出的一個 CNN 架構。它被用於訓練 MNIST 資料集，並有不錯的成效。它的架構以及每一層的 hyper parameters，如下。

輸入為一個 32 x 32 x 1 的灰階影像。
Conv layer：filter 大小為 5，且有 6 個 filters，stride 為 1，padding 為 0。
Pooling layer：filter 大小為 2，stride 為 2。
Conv layer：filter 大小為 5，且有 16 個 filters，stride 為 1，padding 為 0。
Pooling layer：filter 大小為 2，stride 為 2。
FC layer：輸入大小為 400（16 x 5 x 5），輸出大小為 120。
FC layer：輸入大小為 120，輸出大小為 84。
FC layer：輸入大小為 84，輸出大小為 10。

LeNet-5 (source from GradientBased Learning Applied to Document
Recognition). — LeNet-5 (source from GradientBased Learning Applied to Document Recognition).

接下來，我們將使用 PyTorch 來實作 LeNet-5。首先，根據以上的 hyper parameters，我們建立好各層。對於 pooling layers，我們選用 max pooling。每個 conv layer 後，我們會 normalize 輸出。我們選用 ReLU 作為 activation function。再進入 FC layer 前，我們先 flatten 上一層的輸出。使用 PyTorch 實作的程式碼相當地精簡。

class LeNet5(nn.Module):
    def __init__(self):
        super(LeNet5, self).__init__()
        self.conv1 = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, stride=1, padding=0),
            nn.BatchNorm2d(num_features=6),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        self.conv2 = nn.Sequential(
            nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5, stride=1, padding=0),
            nn.BatchNorm2d(num_features=16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(in_features=16 * 5 * 5, out_features=120),
            nn.ReLU(),
            nn.Linear(in_features=120, out_features=84),
            nn.ReLU(),
            nn.Linear(in_features=84, out_features=10),
        )

    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.fc(x)
        return x

PyTorch 讓我們可以印出 model 的架構。

model = LeNet5()
model

# Output
LeNet5(
  (conv1): Sequential(
    (0): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
    (1): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (conv2): Sequential(
    (0): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
    (1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (fc): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=400, out_features=120, bias=True)
    (2): ReLU()
    (3): Linear(in_features=120, out_features=84, bias=True)
    (4): ReLU()
    (5): Linear(in_features=84, out_features=10, bias=True)
  )
)

接下來，我們將使用 NMIST 資料集來訓練我們的 LeNet-5。PyTorch 提供內建的函式來載入 NMIST 資料集。由於，LeNet-5 接收的輸入為 32 x 32 的灰階影像，因此我們建立一個 transform 將載入的 NMIST 圖片轉換成 32 x 32 的大小。NMIST 資料集本身就是灰階的影像。然後，再 normalize 載入的 NMIST 圖片。在這邊，你可以使用 mean=0.5 和 std=0.5 來 normalize 圖片。然而，這邊使用的 mean=0.1307 和 std=0.3081 是根據 NMIST 資料集計算出來的，所以使用這些值來 normalize NMIST 圖片會得到比較好的結果。

transform = transforms.Compose(
    [
        transforms.Resize((32, 32)),
        transforms.ToTensor(),
        transforms.Normalize(mean=(0.1307,), std=(0.3081,)),
    ]
)
train_data = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_data = datasets.MNIST(root='./data', train=False, download=True, transform=transform)
print(f'Train data: {len(train_data)}')
print(f'Test data: {len(test_data)}')
print(f'Image shape: {train_data[0][0].shape}')
print(f'Classes: {train_data.classes}')

train_loader = DataLoader(dataset=train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(dataset=test_data, batch_size=64, shuffle=False)

# Output
Train data: 60000
Test data: 10000
Image shape: torch.Size([1, 32, 32])
Classes: ['0 - zero', '1 - one', '2 - two', '3 - three', '4 - four', '5 - five', '6 - six', '7 - seven', '8 - eight', '9 - nine']

準備好資料集後，我們要來訓練我們的 model。我們使用 cross entropy loss 作為 cost function，將 learning rate 設定為 0.001。在載入資料集時，我們將資料集分為 64 張圖片為一個 batch。在訓練時，我們將一個 batch 傳入 model 的 forward propagation，再執行 model 的 back propagation。訓練完所有的 batches 後，重複訓練整個資料集 10 次。

cost = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

model.train()
epochs = 10
for epoch in range(epochs):
    running_loss = 0.0

    for batch, (inputs, targets) in enumerate(train_loader):
        inputs, targets = inputs.to(device), targets.to(device)

        # Forward pass
        outputs = model(inputs)
        loss = cost(outputs, targets)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        running_loss += loss.item()

    print(f"Epoch [{epoch + 1}/{epochs}], Loss: {running_loss / len(train_loader):.4f}")

# Output
Epoch [1/10], Loss: 0.1646
Epoch [2/10], Loss: 0.0545
Epoch [3/10], Loss: 0.0421
Epoch [4/10], Loss: 0.0341
Epoch [5/10], Loss: 0.0287
Epoch [6/10], Loss: 0.0251
Epoch [7/10], Loss: 0.0230
Epoch [8/10], Loss: 0.0186
Epoch [9/10], Loss: 0.0170
Epoch [10/10], Loss: 0.0157

現在 model 已經訓練好了。我們用測試資料集來測試 model 的 accuracy。

model.eval()
correct = 0
total = 0

with torch.no_grad():
    for inputs, targets in test_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        outputs = model(inputs)
        _, predicted = torch.max(outputs, 1)
        total += targets.size(0)
        correct += (predicted == targets).sum().item()

accuracy = correct / total * 100
print(f"Test Accuracy: {accuracy:.2f}%")

# Output
Test Accuracy: 98.97%

PyTorch 讓我們可以將 model 輸出成檔案。

torch.save(model, 'lenet_mnist.pt')

最後，我們可以 model 檔案載入，並用來預測其他圖片。

_model = torch.load('lenet_mnist.pt', weights_only=False)
with torch.no_grad():
    for inputs, targets in test_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        outputs = _model(inputs)
        _, predicted = torch.max(outputs, 1)
        total += targets.size(0)
        correct += (predicted == targets).sum().item()

accuracy = correct / total * 100
print(f"Test Accuracy: {accuracy:.2f}%")

# Output
Test Accuracy: 98.97%

結語

本文章簡單地介紹了 CNN 以及它各種 layer。此外，我們還實作了一個簡單的 LeNet-5。相信你已經對 CNN 有些概念性的了解。CNN 已被用在影像辨識、視訊分析等領域，與傳統的方法相比，有相當不錯的成效。

參考

Andrew Ng, Deep Learning Specialization, Coursera.
Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov.

Get source code of posts.

卷積神經網路（Convolutional Neural Networks, CNN）

Share

Table of Contents

卷積（Convolutions）

Kernel

Padding

Stride

卷積神經網路（Convolutional Neural Networks）

神經網路（Neural Networks）與二元分類（Binary Classification）

卷積層（Convolution Layers）

池化層（Pooling Layers）

全連接層（Fully Connected Layers, FC layers）

LeNet-5

結語

參考

Related Tags

Wayne

發佈留言取消回覆

YOLOv8 物件偵測教學

在 Android 上使用 ONNX Runtime 執行 YOLOv8 模型

在 Android 上使用 PyTorch 執行 YOLOv8 模型

Non Maximum Suppression (NMS)

多元分類神經網路（Multiple Classification Neural Network）

循環神經網路（Recurrent Neural Networks, RNN）

《閱讀心得》持續買進：資料科學家的投資終極解答，存錢及致富的實證方法

生成式預訓練 Transformer 模型（Generative Pre-trained Transformer, GPT）

雙向 Transformer 編碼器表徵（Bidirectional Encoder Representations from Transformers, BERT）

Transformer 模型

注意力模型（Attention Models）

Python 長條圖（Bar Charts）

Kotlin Coroutine 教學

Python 散佈圖／折線圖（Scatter/Line Charts）

Spring Boot + REST APIs + JPA 教學

Python 圓餅圖／環狀圖／放射環狀圖（Pie/Donut/Sunburst Charts）

Python 長條圖（Bar Charts）

Kotlin Coroutine 教學

Python 散佈圖／折線圖（Scatter/Line Charts）

Spring Boot + REST APIs + JPA 教學

卷積神經網路（Convolutional Neural Networks, CNN）

Share

Table of Contents

卷積（Convolutions）

Kernel

Padding

Stride

卷積神經網路（Convolutional Neural Networks）

卷積層（Convolution Layers）

池化層（Pooling Layers）

全連接層（Fully Connected Layers, FC layers）

LeNet-5

結語

參考

Related Tags

發佈留言 取消回覆

You May Also Like

發佈留言取消回覆