Vision Transformer 模型

在影像辨識領域，多年來一直以卷積神經網路（Convolutional Neural Networks, CNNs）為主。近年，Transformer 在自然語言處理（Natural Language Processing, NLP）領域大放異彩，進而也有將 Transformer 架構應用於影像處理的想法。Vision Transformer 便是基於 Transformer 的影響處理模型。

完整程式碼可以在下載。

Vision Transformer（ViT）架構
實驗
1. 模型變種（Model Variants）
2. 效能表現
實作
結語
參考

Vision Transformer（ViT）架構

Vision Transformer (ViT) 於 2020 年由 Google Research 提出。基於 NLP 中 Transformer 的成功經驗，將其套用至視覺領域。儘管 CNN 在圖像處理表現出色，但其設計本質是局部性的堆疊，對長距離特徵的建模能力有限。研究者希望以 Transformer 的全局注意力（global attention）機制，捕捉圖像中遠距離區塊間的依賴關係，進而提升模型的辨識能力。然而，由於 Transformer 原是為文字序列所設計，如何將一張圖像轉換成可供 Transformer 處理的序列資料，成為 ViT 所需解決的第一步挑戰。

ViT 的整體架構是 Transformer 的 encoder。如果你還不熟悉 Transformer 的話，請先參考以下文章。

- Artificial Intelligence
- Natural Language Processing

Transformer 模型

ByWayne
03/04/2025

作者們希望 ViT 可以使用 Transformer 的原始設計，而且盡可能地不要變動它。然而，傳統Transformer 的輸入是 tokens。所以，必須要將輸入圖像轉換成某種 Transformer 可以接受的輸入格式。

以圖是 ViT 模型的概覽。

Vision Transformer (source from An Image is Worth 16x16 Words- Transformers for Image Recognition at Scale). — Vision Transformer (source from An Image is Worth 16×16 Words- Transformers for Image Recognition at Scale).

Patch Embeddings

輸入圖像為 $x \in \mathbb{R}^{H\times W\times C}$ ，其中 $H$ 是高度， $W$ 是寬度， $C$ 是 channel 數（如，RGB 為 3 channels）。將圖像 $x$ 切割為固定大小的 patches，每一個 patch 的長寬是 $P\times P$ ，再算上 channel，每個 patch 是 $P\times P\times C$ 。然後，再將每個 patch 展平成 $1\times (P^2\cdot C)$ 的序列。也就是說，圖像 $x$ 會被 resized 成 $x_p \in \mathbb{R}^{N\times(P^2\cdot C)}$ ，其中 $N=HW/P^2$ 是 patches 的數量。

以傳統 Transformer 的角度來看，一個 patch 就像是一個 token。輸入圖像 $x$ 有 $N$ 個 patches，所以會有 $N$ 的 tokens。

$x \in \mathbb{R}^{H\times W\times C} \\\\ x_p \in \mathbb{R}^{N\times(P^2\cdot C)} \\\\ x_p^n \in \mathbb{R}^{1\times(P^2\cdot C)}, \quad n=1,...,N \\\\ N=HW/P^2 \\\\ H: \text{ Image height} \\\\ W: \text{ Image width} \\\\ C: \text{ The number of channels} \\\\ P: \text{ Patch height/width} \\\\ N: \text{ The number of patches}$

接下來，就是要將所有的 patches $x_p$ 轉換成 Transformer 內部 hidden size 相同的維度 $D$ 。透過一個可訓練線性投影（trainable linear projection）將維度 $P^2\cdot C$ 映射至 $D$ 。這個 trainable linear projection 的輸出稱為 patch embeddings。

以傳統 Transformer 的角度來看，trainable linear projection 對應於 Transformer 中的 input embedding。

$x_p^n \in \mathbb{R}^{1\times(P^2\cdot C)} \\\\ E \in \mathbb{R}^{(P^2\cdot C)\times D} \\\\ x_p^nE \in \mathbb{R}^{1\times D} \\\\ z_0=\begin{bmatrix} x_p^1E \\ x_p^2E \\ \cdots \\ x_p^NE \end{bmatrix}$

Classification Head

類似於 BERT 的 [CLS] token，在 patch embeddings 的第一個位置插入一個 learnable embedding $z_0^0=x_{class}$ 。在 Transformer encoder 輸出的 final hidden state 中， $x_0^0$ 相對應的第一個 state 是 $z_L^0$ 。 $z_L^0$ 將作為整個圖像表徵（image representation） $y$ 。

在預訓練（pre-training）和微調（fine-tuning）期間，都要在 $z_0^0$ 的位置上插入這個分類頭（classification head） $x_{class}$ 。Classification head 在 pre-training 時是由一個具有一個隱藏層的多層感知器（a MLP with one hidden layer）實作；而在 fine-tuning 時，則是由一個單一的線性層（a single linear layer）實作。

$z_0=\begin{bmatrix} x_{class} \\ x_p^1E \\ x_p^2E \\ \cdots \\ x_p^NE \end{bmatrix}$

Position Embeddings

為了保留位置資訊（position information），在 patch embeddings 後面加入 position embeddings $E_{pos}$ 。ViT 使用標準的 learnable 1D position embeddings，因為作者們並未觀察到使用更進階的 2D-aware position embeddings 能帶來顯著的效能提升。最終所得到的 embedding vectors 序列會作為 Transformer encoder 的輸入。

$z_0=\begin{bmatrix} x_{class} \\ x_p^1E \\ x_p^2E \\ \cdots \\ x_p^NE \end{bmatrix} + E_{pos}, \quad E_{pos} \in \mathbb{R}^{(N+1)\times D}$

整合 Transformer Encoder

將圖像 $x$ 切分成 $N$ 個 patches $x_p$ ，並映射成 patch embeddings。然後，再前面插入一個 classification head $x_class$，再加上 position embedding $E_{pos}$ 。最終得到的 embedding vectors 作為 Transformer encoder 的輸入。在 encoder 中最後一個 block 的輸出 $z_L$ 中，第一個 hidden state $z_L^0$ 將作為輸入圖像的 image representation。

$x_p=\begin{bmatrix} x_p^1 \\ x_p^2 \\ \cdots \\ x_p^N \end{bmatrix}, \quad \begin{matrix} x_p^n \in \mathbb{R}^{1\times(P^2\cdot C)}, & n=1,...,N \\ x_p \in \mathbb{R}^{N\times(P^2\cdot C)} \end{matrix}\\\\ z_0=\begin{bmatrix} x_{class} \\ x_p^1E \\ x_p^2E \\ \cdots \\ x_p^NE \end{bmatrix} + E_{pos}, \quad \begin{matrix} E \in \mathbb{R}^{(P^2\cdot C)\times D}, & x_p^nE \in \mathbb{R}^{1\times D} \\ E_{pos} \in \mathbb{R}^{(N+1)\times D}, & z_0 \in \mathbb{R}^{(N+1)\times D}, \end{matrix} \\\\ z_\ell^\prime=\text{MSA}(\text{LN}(z_{\ell-1}))+z_{\ell-1}, \quad \ell=1,...,L \\\\ z_\ell=\text{MLP}(\text{LN}(z_\ell^\prime))+z_\ell^\prime, \quad \ell=1,...,L \\\\ y=\text{LN}(z_L^0) \\\\ \text{MSA}:\text{ Multiheaded self-attention layers} \\\\ \text{MLP}:\text{ MLP blocks} \\\\ \text{LN}:\text{ Layer normalization}$

歸納偏差（Inductive Bias）

在學習理論中，模型之所以能從有限資料中泛化到未見過的資料，必須依賴某些先驗假設（prior assumptions）。這些假設就構成了模型的歸納偏差（inductive bias）。它幫助模型在資料不足或雜訊很多時仍能收斂到合理的解。

CNN 假設圖像具有局部性（locality）與平移等變性（translation equivariance），並透過卷積核（kernel）進行區域特徵萃取、權重共享與空間結構保留。也就是說，鄰近像素組成區域特徵、物件在圖上可任意出現，所以 kernel 滑動整張圖即可捕捉模式。這些 inductive biases 幫助 CNN 在小資料下也能有效學習。

ViT 將圖像轉為 patch 序列後交給 Transformer 自行建模特徵間的關係。這樣的設計哲學是，給予足夠資料與模型容量，模型應該能自己學出最合適的表示方式，而不依賴人為先驗。在 ViT 中，只有 MLP layers 有 locality 和 translation equivariance，而 self-attention layers 是 global。ViT 缺乏 locality 與 translation equivariance，幾乎沒有 inductive bias。因此，需要大量資料來自行學會這些模式。這導致 ViT 在資料不足時表現不佳，但在大規模資料訓練下，能展現更高的靈活性與泛化能力。

實驗

模型變種（Model Variants）

作者們訓練了數個個模型變種，其參數如下表所示。ViT 的模型名稱通常包含模型大小與 input patch size。例如，Vit-L/16 是指 Large variant 且 $16\times 16$ input patch size。

Model	Layers	Hidden size D	MLP size	Heads	Params
ViT-Base	12	768	3072	12	86M
ViT-Large	24	1024	4096	16	307M
ViT-Huge	32	1280	5120	16	632M

Details of Vision Transformer model variants (from An Image Is Worth 16×16 Words: Transformers For Image Recognition at Scale)

效能表現

ViT 在中小型資料集 ImageNet-1k 上 pre-training 的模型，表現不如同等規模的 ResNet。這是因為 CNN 內建了強烈的 inductive bias，如 translation equivariance 與 locality 特徵萃取能力，使得它能在有限資料下有效學習。而 ViT 必須從資料中自行學會這些性質，導致其資料效率較差。

不過，當 ViT 在大型資料集 ImageNet-21k 或 JFT-300M 上 pre-training 的模型，再轉移學習到目標任務時，其表現便能超越 ResNet。

Comparison to State of the Art (from An Image is Worth 16x16 Words- Transformers for Image Recognition at Scale). — Comparison to State of the Art (from An Image is Worth 16×16 Words- Transformers for Image Recognition at Scale).

實作

Patch Embedding

這部分的輸入是一個圖像，而輸出是 patch embeddings。所以，我們要先將圖像切成 $N$ 個 patches，再將每個 patch 展平成一維的相量，最後再應由一個 learnable linear projection 映射至維度 $D$ 。

這個流程相當於 convolutional layer 所做的事情，如下。

Learnable linear projection can be implemented by a convolutional layer.

所以，我們可以使用 convolution layer 將圖像由 $C\times H\times W$ 轉換成 $D\times \frac{H}{P}\times \frac{W}{P}$ ，再展平成 $D\times (\frac{H}{P}*\frac{W}{P})$ ，也就是 $D\times N$ 。最後，將兩個維度置換成 $N\times D$ 。

class PatchEmbedding(nn.Module):
    def __init__(self, patch_size=16, in_channels=3, embed_dim=768):
        """
        Patch Embedding Layer for Vision Transformer.

        Args:
            patch_size (int): Size of the patches to be extracted from the input image.
            in_channels (int): Number of input channels in the image (e.g., 3 for RGB).
            embed_dim (int): Dimension of the embedding space to which each patch will be projected.
        """

        super(PatchEmbedding, self).__init__()
        self.patch_size = patch_size
        self.proj = nn.Conv2d(
            in_channels=in_channels,
            out_channels=embed_dim,
            kernel_size=patch_size,
            stride=patch_size,
        )

    def forward(self, x):
        """
        Forward pass of the Patch Embedding Layer.

        Args:
            x (torch.Tensor): Input tensor of shape (B, C, H, W) where
                                - B is batch size
                                - C is number of channels
                                - H is height
                                - W is width

        Returns:
            x (torch.Tensor): Output tensor of shape (B, D, H/P, W/P) where
                                - D is the embedding dimension
                                - H/P and W/P are the height and width of the patches.
        """

        x = self.proj(x)  # (B, D, H/P, W/P)
        x = x.flatten(2)  # (B, D, H/P * W/P)
        x = x.transpose(1, 2)  # (B, H/P * W/P, D)
        return x

Transformer Encoder

以下是 Transformer encoder 各部分的實作。

class MultiHeadSelfAttention(nn.Module):
    def __init__(self, dim, num_heads=12, qkv_bias=True, dropout=0.1, attention_dropout=0.1):
        """
        Multi-Head Self-Attention Layer.

        Args:
            dim (int): Dimension of the input features.
            num_heads (int): Number of attention heads.
            qkv_bias (bool): Whether to add a bias term to the query, key, and value projections.
            dropout (float): Dropout rate applied to the output of the MLP and attention layers.
            attention_dropout (float): Dropout rate applied to the attention weights.
        """

        super().__init__()
        self.num_heads = num_heads
        self.head_dim = dim // num_heads
        self.scale = self.head_dim ** -0.5  # Scaled Dot-Product Attention 中的 √d_k

        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.attention_dropout = nn.Dropout(attention_dropout)
        self.projection = nn.Linear(dim, dim)
        self.projection_dropout = nn.Dropout(dropout)

    def forward(self, x):
        """
        Forward pass of the Multi-Head Self-Attention Layer.

        Args:
            x (torch.Tensor): Input tensor of shape (B, N, D) where
                                - B is batch size
                                - N is the number of patches (or tokens)
                                - D is the embedding dimension

        Returns:
            out (torch.Tensor): Output tensor of shape (B, N, D) after applying multi-head self-attention.
        """

        B, N, C = x.shape

        qkv = self.qkv(x)  # (B, N, 3C)
        qkv = qkv.reshape(B, N, 3, self.num_heads, self.head_dim)  # (B, N, 3, H, D)
        qkv = qkv.permute(2, 0, 3, 1, 4)  # (3, B, H, N, D)
        q, k, v = qkv[0], qkv[1], qkv[2]

        # Scaled Dot-Product Attention
        attn = (q @ k.transpose(-2, -1)) * self.scale  # (B, H, N, N)
        attn = attn.softmax(dim=-1)
        attn = self.attention_dropout(attn)

        out = (attn @ v)  # (B, H, N, D)
        out = out.transpose(1, 2).reshape(B, N, C)  # (B, N, D)
        out = self.projection(out)
        out = self.projection_dropout(out)
        return out

class MLPBlock(nn.Module):
    def __init__(self, in_dim, hidden_dim, dropout=0.1):
        """
        MLP Block for Transformer Encoder.

        Args:
            in_dim (int): Input dimension of the features.
            hidden_dim (int): Hidden dimension of the MLP.
            dropout (float): Dropout rate applied to the output of the MLP.
        """

        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden_dim)
        self.act = nn.GELU()
        self.fc2 = nn.Linear(hidden_dim, in_dim)
        self.drop = nn.Dropout(dropout)

    def forward(self, x):
        """
        Forward pass of the MLP Block.

        Args:
            x (torch.Tensor): Input tensor of shape (B, N, D) where
                                - B is batch size
                                - N is the number of patches (or tokens)
                                - D is the embedding dimension

        Returns:
            x (torch.Tensor): Output tensor of the same shape as input, after applying MLP.
        """

        x = self.fc1(x)
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        x = self.drop(x)
        return x

class EncoderBlock(nn.Module):
    def __init__(self, dim, num_heads, mlp_ratio=4.0, qkv_bias=True, dropout=0.1, attention_dropout=0.1):
        """
        Transformer Encoder Block.

        Args:
            dim (int): Dimension of the input features.
            num_heads (int): Number of attention heads.
            mlp_ratio (float): Ratio of the hidden dimension in the MLP block to the embedding dimension.
            qkv_bias (bool): Whether to add a bias term to the query, key, and value projections.
            dropout (float): Dropout rate applied to the output of the MLP and attention layers.
            attention_dropout (float): Dropout rate applied to the attention weights.
        """

        super().__init__()
        self.norm1 = nn.LayerNorm(dim)
        self.self_attention = MultiHeadSelfAttention(
            dim=dim,
            num_heads=num_heads,
            qkv_bias=qkv_bias,
            dropout=dropout,
            attention_dropout=attention_dropout,
        )
        self.norm2 = nn.LayerNorm(dim)
        self.mlp = MLPBlock(
            in_dim=dim,
            hidden_dim=int(dim * mlp_ratio),
            dropout=dropout,
        )

    def forward(self, x):
        """
        Forward pass of the Transformer Encoder Block.

        Args:
            x (torch.Tensor): Input tensor of shape (B, N, D) where
                                - B is batch size
                                - N is the number of patches (or tokens)
                                - D is the embedding dimension

        Returns:
            x (torch.Tensor): Output tensor of the same shape as input, after applying self-attention and MLP.
        """

        x = x + self.self_attention(self.norm1(x))
        x = x + self.mlp(self.norm2(x))
        return x