雙向 Transformer 編碼器表徵（Bidirectional Encoder Representations from Transformers, BERT）

雙向 Transformer 編碼器表徵（Bidirectional Encoder Representations from Transformers, BERT）是由 Google AI 在 2018 年提出的一個用於自然語言處理的預訓練技術。BERT 透過提供對語言更深入的語境理解，顯著推進了自然語言處理的發展。

完整程式碼可以在下載。

BERT 架構
輸入與輸出表徵（Input/Output Representations）
BERT 實作
預訓練 BERT（Pre-training BERT）
微調 BERT（Fine-tuning BERT）
結語
參考

BERT 架構

BERT 的全名為 bidirectional encoder representations from Transformers。顧名思義，BERT 的核心架構是 Transformers 的 encoder。如果你還不熟悉 Transformers 的話，請先參考以下文章。

- Artificial Intelligence
- Natural Language Processing

Transformer 模型

ByWayne
03/04/2025

下圖顯示 BERT 與 Transformers 的架構。可以清楚地看出，BERT 是由 Transformers 的 encoder 再加上一個輸出層。BERT 從未標記的 corpus 中預訓練出深度的雙向表徵（deep bidirectional representations），透過在所有 layers 中同時考量左側和右側的上下文。這也就是 Transformers 的 encoder 所做的事情。這個 representations 捕捉了輸入序列中的不同語義層面。Transformers 將這個 representations 傳入 decoder 的 cross multi-head attention 來預設下一個輸出。

然而，pre-trained BERT 模型只是 Transformers 的 encoder，並且只輸出這個 representations。我們可以利用這個輸出的 representations 進行一些 downstream tasks，如問答（question answering）和語言推理（language inference）。我們只需要 pre-trained BERT 模型透過加上一個額外的輸出層，再進行微調（fine-tuning），其可產生這些 downstream tasks 的模型，而無需對特定任務的架構做出大幅度修改。所以，BERT 包含了兩個階段：預訓練（pre-training）和微調（fine-tuning）。

Transformers v.s. BERT (source from Transformers paper).

輸入與輸出表徵（Input/Output Representations）

為了讓 BERT 能夠處理多種 downstream tasks，因此 BERT 的輸入序列可以是一個句子或是一對句子（如<question，answer>）。BERT 使用由 Google 在 2016 年提出的 WordPiece embeddings，詞彙表大小為 30,000 個 tokens。每一個輸入序列的第一個 token 總是一個特殊的 classification token（[CLS]）。而此 [CLS] 相對應的 final hidden state 會被用作該序列在 classification task 中的總體表徵（aggregate sequence representation）。

當輸入序列是一對句子時，我們要將兩個句子合併起來，並透過以下兩種方式區分句子：

第一：使用一個特殊 [SEP] token 將它們分開；
第二：為每個 token 添加一個 learned embedding，來標示該 token 屬於句子 A 還是句子 B。

對於給定的一個 token 而言，它的 input representations 是由該 token embedding、segment embedding 以及position embedding 三個 embeddings 相加而得，如下圖所示。

BERT input representation (source from BERT paper).

BERT 實作

以下是 BERT 模型的實作。如果還不了解 Transformers 或無法理解以下實作的話，請先參考以下文章。

- Artificial Intelligence
- Natural Language Processing

Transformer 模型

ByWayne
03/04/2025

此實作根本就是 Transformers 的 encoder，除了以下兩個地方之外：

在 Embeddings 中，多了一個 token_type_embeddings。這是用來區分句子 A 和句子 B。
在輸出時，多了一個 pool layer。這個 pool layer 擷取 output representations 的第一個 token。該 token 對應於輸入序列中的 [CLS] token。之前有提到，它會在 classification task 中，被作為 aggregate sequence representation。

最後 BERT 模型會輸出 representations 和 aggregate sequence representation。

class Embeddings(nn.Module):
    def __init__(self, vocab_size, token_type_size, max_position_embeddings, hidden_dim, dropout_prob):
        super(Embeddings, self).__init__()
        self.word_embeddings = nn.Embedding(vocab_size, hidden_dim)
        self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_dim)
        self.token_type_embeddings = nn.Embedding(token_type_size, hidden_dim)

        self.norm = nn.LayerNorm(hidden_dim, eps=1e-12)
        self.dropout = nn.Dropout(dropout_prob)

    def forward(self, input_ids, token_type_ids=None):
        """
        Compute the embeddings for the input tokens.

        Args
            x: (batch_size, seq_len)
            token_type_ids: (batch_size, seq_len)

        Returns
            embeddings: (batch_size, seq_len, hidden_dim)
        """

        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)

        seq_len = input_ids.size(1)
        position_ids = torch.arange(seq_len, dtype=torch.long, device=input_ids.device)
        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)  # (1, seq_len) -> (batch_size, seq_len)

        word_embeddings = self.word_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        embeddings = word_embeddings + position_embeddings + token_type_embeddings
        embeddings = self.norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings


class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, hidden_dim, dropout_prob):
        super(MultiHeadAttention, self).__init__()
        assert hidden_dim % num_heads == 0, "hidden_dim must be divisible by num_heads"

        self.num_heads = num_heads
        self.head_size = hidden_dim // num_heads
        self.all_head_size = hidden_dim

        self.query = nn.Linear(hidden_dim, self.all_head_size, bias=False)
        self.key = nn.Linear(hidden_dim, self.all_head_size, bias=False)
        self.value = nn.Linear(hidden_dim, self.all_head_size, bias=False)

        self.output = nn.Linear(hidden_dim, hidden_dim, bias=False)

        self.norm = nn.LayerNorm(hidden_dim, eps=1e-12)
        self.dropout = nn.Dropout(dropout_prob)

    def forward(self, hidden_states, mask=None):
        """
        Multi-head attention forward pass.

        Args
            hidden_states: (batch_size, seq_len, hidden_dim)
            mask: (batch_size, 1, 1, seq_len)
                  0 for real tokens, -inf for padding tokens

        Returns
            hidden_states: (batch_size, seq_len, hidden_dim)
        """

        query = self.transpose_for_scores(self.query(hidden_states))  # (batch_size, num_heads, seq_len, head_size)
        key = self.transpose_for_scores(self.key(hidden_states))  # (batch_size, num_heads, seq_len, head_size)
        value = self.transpose_for_scores(self.value(hidden_states))  # (batch_size, num_heads, seq_len, head_size)

        # Scaled dot-product attention
        scores = query @ key.transpose(-2, -1) / math.sqrt(self.head_size)  # (batch_size, num_heads, seq_len, seq_len)
        if mask is not None:
            scores = scores + mask
        attention_weights = F.softmax(scores, dim=-1)  # (batch_size, num_heads, seq_len, seq_len)
        attention_weights = self.dropout(attention_weights)
        attention = attention_weights @ value  # (batch_size, num_heads, seq_len, head_size)

        # Concatenate heads
        attention = attention.transpose(1, 2).contiguous()  # (batch_size, seq_len, num_heads, head_size)
        new_shape = attention.size()[:-2] + (self.all_head_size,)
        attention = attention.view(*new_shape)  # (batch_size, seq_len, all_head_size)

        # Linear projection
        projection_output = self.output(attention)  # (batch_size, seq_len, hidden_dim)
        projection_output = self.dropout(projection_output)

        hidden_states = self.norm(hidden_states + projection_output)
        return hidden_states

    def transpose_for_scores(self, x):
        """
        Args
            x: (batch_size, seq_len, all_head_size)
        Returns
            (batch_size, num_heads, seq_len, head_size)
        """

        new_x_shape = x.size()[:-1] + (self.num_heads, self.head_size)
        x = x.view(*new_x_shape)  # (batch_size, seq_len, num_heads, head_size)
        return x.permute(0, 2, 1, 3)  # (batch_size, num_heads, seq_len, head_size)


class PositionwiseFeedForward(nn.Module):
    def __init__(self, hidden_dim, d_ff):
        super(PositionwiseFeedForward, self).__init__()
        self.linear1 = nn.Linear(hidden_dim, d_ff, bias=True)
        self.linear2 = nn.Linear(d_ff, hidden_dim, bias=True)
        self.activation = nn.GELU()

    def forward(self, hidden_states):
        """
        Feed-forward network forward pass.

        Args
            hidden_states: (batch_size, seq_len, hidden_dim)

        Returns
            hidden_states: (batch_size, seq_len, hidden_dim)
        """

        hidden_states = self.linear2(self.activation(self.linear1(hidden_states)))
        return hidden_states


class EncoderLayer(nn.Module):
    def __init__(self, num_heads, hidden_dim, d_ff, dropout_prob):
        super(EncoderLayer, self).__init__()
        self.multi_head_attention = MultiHeadAttention(num_heads, hidden_dim, dropout_prob)
        self.ffn = PositionwiseFeedForward(hidden_dim, d_ff)
        self.norm = nn.LayerNorm(hidden_dim, eps=1e-12)
        self.dropout = nn.Dropout(dropout_prob)

    def forward(self, hidden_states, mask=None):
        """
        Encoder layer forward pass.

        Args
            hidden_states: (batch_size, seq_len, hidden_dim)
            mask: (batch_size, 1, seq_len)
                  0 for real tokens, -inf for padding tokens

        Returns
            hidden_states: (batch_size, seq_len, hidden_dim)
        """

        # Multi-head attention
        attention_output = self.multi_head_attention(hidden_states, mask=mask)

        # Feed-forward network
        ffn_output = self.ffn(attention_output)
        ffn_output = self.dropout(ffn_output)
        hidden_states = self.norm(hidden_states + ffn_output)

        return hidden_states


class Encoder(nn.Module):
    def __init__(self, hidden_dim, num_layers, num_heads, d_ff, dropout_prob):
        super(Encoder, self).__init__()
        self.layers = nn.ModuleList(
            [EncoderLayer(num_heads, hidden_dim, d_ff, dropout_prob) for _ in range(num_layers)]
        )

    def forward(self, hidden_states, mask=None):
        """
        Encoder forward pass.

        Args
            hidden_states: (batch_size, seq_len, hidden_dim)
            mask: (batch_size, 1, seq_len)
                  0 for real tokens, -inf for padding tokens

        Returns
            hidden_states: (batch_size, seq_len, hidden_dim)
        """

        for layer in self.layers:
            hidden_states = layer(hidden_states, mask=mask)
        return hidden_states


class Pooler(nn.Module):
    def __init__(self, hidden_dim):
        super(Pooler, self).__init__()
        self.linear = nn.Linear(hidden_dim, hidden_dim)

    def forward(self, hidden_states):
        """
        Pooler forward pass.

        Args
            hidden_states: (batch_size, seq_len, hidden_dim)

        Returns
            pooled_output: (batch_size, hidden_dim)
        """

        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.linear(first_token_tensor)
        pooled_output = F.tanh(pooled_output)
        return pooled_output


class Bert(nn.Module):
    def __init__(
        self, vocab_size, token_type_size, max_position_embeddings, hidden_dim, num_layers, num_heads, d_ff,
        dropout_prob
    ):
        super(Bert, self).__init__()
        self.embeddings = Embeddings(vocab_size, token_type_size, max_position_embeddings, hidden_dim, dropout_prob)
        self.encoder = Encoder(hidden_dim, num_layers, num_heads, d_ff, dropout_prob)
        self.pooler = Pooler(hidden_dim)

    def forward(self, input_ids, token_type_ids=None, mask=None):
        """
        Forward pass for the BERT model.

        Args
            input_ids: (batch_size, seq_len)
            token_type_ids: (batch_size, seq_len)
            mask: (batch_size, seq_len)

        Returns
            encoder_output: (batch_size, seq_len, hidden_dim)
            pooled_output: (batch_size, hidden_dim)
        """

        if mask is not None:
            extended_mask = mask.unsqueeze(1).unsqueeze(2)
            extended_mask = extended_mask.to(dtype=torch.float32)
            # Convert 1 -> 0, 0 -> large negative (mask out)
            extended_mask = (1.0 - extended_mask) * -10000.0
        else:
            extended_mask = None

        embedding_output = self.embeddings(input_ids, token_type_ids)
        encoder_output = self.encoder(embedding_output, mask=extended_mask)
        pooled_output = self.pooler(encoder_output)

        return encoder_output, pooled_output

預訓練 BERT（Pre-training BERT）

Pre-training BERT 包含了兩個任務，一個是遮罩語言模型（Masked Language Modeling, MLM），另一個是下一句預測（Next Sentence Prediction, NSP）。以下我們會分別介紹這兩個任務的細節。

遮罩語言模型（Masked Language Modeling, MLM）

我們可以合理地推論一個深度的雙向模型（deep bidirectional model）必然比單純的由左至右模型（left-to-right model），或是將左至右與右至左模型做淺層拼接（shallow concatenation）的方式更為強大。然而，傳統的條件式語言模型（conditional language models）只能以左至右或右至左的方式來訓練，因為如果允許雙向條件式建模（bidirectional conditioning），模型會間接「看見」自身要預測的詞彙，從而導致模型可輕易地從多層上下文資訊直接預測目標詞彙。

為了訓練出 deep bidirectional representations，我們直接隨機地將輸入序列中的部分 tokens 進行遮罩（mask），然後讓模型預測那些 masked tokens。此過程稱為遮罩語言模型（Masked LM, MLM）任務。Masked tokens 所對應的 final hidden states 會被送入一個輸出層的 softmax 函數，用以對整個詞彙表（vocabulary）進行預測，類似標準語言模型的做法。

儘管這樣的方式能夠讓我們取得 bidirectional pre-trained model，但其缺點在於 pre-training 階段與 fine-tuning 階段存在一定的差異，因為實際 fine-tuning 時不會出現 [MASK] 這種特殊的 token。為了降低這個問題的影響，我們並不總是將要被遮罩的 token 直接替換為 [MASK] token。

在生成訓練資料時，我們隨機選取 15% 的 token 位置作為預測目標。如果第 i 個 token 被選中，我們會：

以 10% 的機率保持第 i 個 token 不變。
以 80% 的機率將第 i 個 token 替換為 [MASK] token。
以 10% 的機率將第 i 個 token 替換為隨機的 token。

下圖顯示，如何將兩個子句組合起來，並且經由上述的方式來生成給 MLM 用的 training example。

下一句預測（Next Sentence Prediction, NSP）

許多 downstream tasks，如問 question answering 和 language inference，都仰賴兩個句子之間關係的理解，而這點在傳統語言模型的 pre-training 中並未被直接建模。為了訓練出能理解句子關係的模型，我們還要預訓練下一句預測（Next Sentence Prediction, NSP）任務。

當我們為每筆 pre-training example 選擇句子 A 和 B 時，有 50% 的機率，B 是實際在 corpus 中緊接在 A 之後的句子（標記為 IsNext）；另外 50% 的機率，B 是從 corpus 中隨機選取的一個句子（標記為 NotNext）。[CLS] 相對應的 final hidden state（也就是 aggregate sequence representation）會用來進行 NSP。

下圖中，我們對前半部的句子，挑選下一個句子，並組合成 training examples。圖中的上半部是挑選在實際 corpus 中，緊接在後的句子，因此標記為 IsNext。下半部是隨機從 corpus 中挑選的句子，標記為 NotNext。

實作

BERT 使用 WordPiece，但為了簡化範例程式碼，我們單純地 tokenize 字，並且設定一個很小的 vocabulary，如下。

tkn2idx = {
    "[PAD]": 0, "[CLS]": 1, "[SEP]": 2, "[MASK]": 3,
    "i": 4, "like": 5, "dogs": 6, "cats": 7,
    "they": 8, "are": 9, "playful": 10,
    "[UNK]": 11,
}

idx2tkn = {v: k for k, v in tkn2idx.items()}


def tokenize(text):
    tokens = text.split()
    token_ids = [tkn2idx.get(t, tkn2idx["[UNK]"]) for t in tokens]
    return token_ids

然後，我們使用以下的 corpus。

corpus = [
    "i like dogs",
    "they are playful",
    "i like cats",
    "they are cute"
]

接下來，我們用以下程式碼來建立 pre-training dataset。在選擇句子對時，50% 的機率選擇下一個句子，50% 的機率隨機選擇一個句子。token_type_ids 用 0 表示在 input_ids 中該位子的 token 是屬於句子 A，而 1 表示屬於句子 B。另外，mlm_labels 用 -100 表示在 input_ids 中該位子的 token 沒有被 masked，若該位子在 input_ids 中被取代為 [MASK] token 的話，則該位子在 mlm_labels 則用被 masked 的 token。

def create_example_for_mlm_nsp(sentence_a, sentence_b, is_next, max_seq_len=12, mask_prob=0.15):
    cls_id = tkn2idx["[CLS]"]
    sep_id = tkn2idx["[SEP]"]
    mask_id = tkn2idx["[MASK]"]

    tokens_a = tokenize(sentence_a)
    tokens_b = tokenize(sentence_b)

    input_ids = [cls_id] + tokens_a + [sep_id] + tokens_b + [sep_id]
    token_type_ids = [0] * (len(tokens_a) + 2) + [1] * (len(tokens_b) + 1)

    if len(input_ids) > max_seq_len:
        input_ids = input_ids[:max_seq_len]
        token_type_ids = token_type_ids[:max_seq_len]

    # -100 for non-masked positions, and the original token for masked positions
    mlm_labels = [-100] * len(input_ids)

    num_to_mask = max(1, int((len(input_ids) - 3) * mask_prob))  # 3 for [CLS], [SEP], [SEP]
    candidate_mask_positions = [i for i, tid in enumerate(input_ids) if tid not in [cls_id, sep_id]]
    random.shuffle(candidate_mask_positions)
    mask_positions = candidate_mask_positions[:num_to_mask]

    for pos in mask_positions:
        mlm_labels[pos] = input_ids[pos]

        # BERT strategy: 80% replace with [MASK], 10% random, 10% keep
        r = random.random()
        if r < 0.8:
            input_ids[pos] = mask_id
        elif r < 0.9:
            input_ids[pos] = random.randint(4, len(tkn2idx) - 2)  # exclude special tokens
        else:
            pass

    nsp_label = 1 if is_next else 0
    return input_ids, token_type_ids, mlm_labels, nsp_label


def build_pretraining_dataset(corpus, num_examples):
    dataset = []
    n = len(corpus)
    for _ in range(num_examples):
        idx_a = random.randint(0, n - 1)
        sentence_a = corpus[idx_a]

        # 50%: pick a real next sentence; 50%: pick a random sentence
        if random.random() < 0.5:
            idx_b = (idx_a + 1) % n
            sentence_b = corpus[idx_b]
            is_next = True
        else:
            idx_b = random.randint(0, n - 1)
            while idx_b == idx_a:
                idx_b = random.randint(0, n - 1)
            sentence_b = corpus[idx_b]
            is_next = False

        input_ids, token_type_ids, mlm_labels, nsp_label = create_example_for_mlm_nsp(sentence_a, sentence_b, is_next)
        dataset.append((input_ids, token_type_ids, mlm_labels, nsp_label))

    return dataset

在之前的 Bert 程式碼中，Bert.forward() 最終輸出 output representations 和 [CLS] 對應的 final hidden state。對於 MLM task，我們希望模型可以預測出被遮罩的位子的 token。對於 NSP，我們希望模型可以預測出第二個句子是否是實際上的下一句。

因此在以下的程式碼中，我們在輸出層後，將 output representations 轉換為用來預測被遮罩的 token，並將 [CLS] 對應的 final hidden state 用來預設是否為下一句。

另外，由於要將 output representations 轉換為用來預測被遮罩的 token，因此我們將模型中的 bert.embeddings.word_embeddings.weight 設定給 predictions.weight。

class BertForPreTraining(nn.Module):
    def __init__(
        self, vocab_size, token_type_size, max_position_embeddings, hidden_dim, num_layers, num_heads, d_ff,
        dropout_prob
    ):
        super(BertForPreTraining, self).__init__()
        self.bert = Bert(
            vocab_size, token_type_size, max_position_embeddings, hidden_dim, num_layers, num_heads, d_ff, dropout_prob
        )
        # Tying the MLM head's weight to the word embedding
        self.cls = PreTrainingHeads(vocab_size, hidden_dim, self.bert.embeddings.word_embeddings.weight)

    def forward(self, input_ids, token_type_ids=None, mask=None):
        """
        Pre-training BERT

        Args
            input_ids: (batch_size, seq_len)
            token_type_ids: (batch_size, seq_len)
            mask: (batch_size, seq_len)

        Returns
            prediction_scores: (batch_size, seq_len, vocab_size)
            seq_relationship_scores: (batch_size, 2)
        """

        sequence_output, pooled_output = self.bert(input_ids, token_type_ids, mask=mask)
        prediction_scores, seq_relationship_scores = self.cls(sequence_output, pooled_output)
        return prediction_scores, seq_relationship_scores


class BertForSequenceClassification(nn.Module):
    def __init__(self, bert, num_labels, hidden_dim):
        super(BertForSequenceClassification, self).__init__()
        self.bert = bert
        # A classification head: we typically use the [CLS] pooled output
        self.classifier = nn.Linear(hidden_dim, num_labels)

    def forward(self, input_ids, token_type_ids=None, mask=None, labels=None):
        """
        Sequence classification with BERT

        Args
            input_ids: (batch_size, seq_len)
            token_type_ids: (batch_size, seq_len)
            mask: (batch_size, seq_len)
            labels: (batch_size)

        Returns
            logits: (batch_size, num_classes)
            loss: (optional) Cross entropy loss
        """

        sequence_output, pooled_output = self.bert(input_ids, token_type_ids=token_type_ids, mask=mask)
        logits = self.classifier(pooled_output)

        loss = None
        if labels is not None:
            loss = F.cross_entropy(logits, labels)

        return logits, loss

我們用以下程式碼來執行 pre-training。

def collate_pretraining_batch(examples):
    pad_id = tkn2idx["[PAD]"]
    max_len = max(len(ex[0]) for ex in examples)

    batch_input_ids = []
    batch_token_type_ids = []
    batch_mlm_labels = []
    batch_nsp_labels = []
    batch_mask = []

    for (input_ids, token_type_ids, mlm_labels, nsp_label) in examples:
        seq_len = len(input_ids)
        pad_len = max_len - seq_len
        batch_input_ids.append(input_ids + [pad_id] * pad_len)
        batch_token_type_ids.append(token_type_ids + [0] * pad_len)
        batch_mlm_labels.append(mlm_labels + [-100] * pad_len)
        batch_nsp_labels.append(nsp_label)
        batch_mask.append([1] * seq_len + [0] * pad_len)

    batch_input_ids = torch.tensor(batch_input_ids, dtype=torch.long)
    batch_token_type_ids = torch.tensor(batch_token_type_ids, dtype=torch.long)
    batch_mlm_labels = torch.tensor(batch_mlm_labels, dtype=torch.long)
    batch_nsp_labels = torch.tensor(batch_nsp_labels, dtype=torch.long)
    batch_mask = torch.tensor(batch_mask, dtype=torch.long)
    return batch_input_ids, batch_token_type_ids, batch_mlm_labels, batch_nsp_labels, batch_mask


def pretrain_bert():
    dataset = build_pretraining_dataset(corpus, num_examples=32)
    dataloader = DataLoader(dataset, batch_size=4, shuffle=True, collate_fn=collate_pretraining_batch)

    model = BertForPreTraining(
        vocab_size=len(tkn2idx),
        token_type_size=2,
        max_position_embeddings=64,
        hidden_dim=32,
        num_layers=2,
        num_heads=2,
        d_ff=64,
        dropout_prob=0.1,
    )
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

    model.train()
    EPOCHS = 100

    for epoch in range(EPOCHS):
        total_loss = 0
        for batch in dataloader:
            input_ids, token_type_ids, mlm_labels, nsp_labels, mask = batch
            optimizer.zero_grad()

            prediction_scores, seq_relationship_scores = model(input_ids, token_type_ids, mask)

            mlm_loss = F.cross_entropy(prediction_scores.view(-1, len(tkn2idx)), mlm_labels.view(-1), ignore_index=-100)
            nsp_loss = F.cross_entropy(seq_relationship_scores.view(-1, 2), nsp_labels.view(-1))
            loss = mlm_loss + nsp_loss

            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(dataloader)
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch + 1}/{EPOCHS}, Loss: {avg_loss:.4f}")

    return model

以下程式碼中，我們測試一下 pre-trained BERT。

def test_pretrain_bert(model):
    sent_a = "i like [MASK]"
    sent_b = "they are playful"

    input_ids, token_type_ids, mlm_labels, nsp_label = create_example_for_mlm_nsp(sent_a, sent_b, is_next=True)
    test_batch = collate_pretraining_batch([(input_ids, token_type_ids, mlm_labels, nsp_label)])
    input_ids_batch, token_type_ids_batch, mlm_labels_batch, nsp_labels_batch, mask_batch = test_batch

    model.eval()
    with torch.no_grad():
        prediction_scores, seq_relationship_scores = model(input_ids_batch, token_type_ids_batch, mask_batch)

    masked_index = (torch.tensor(input_ids) == tkn2idx["[MASK]"]).nonzero(as_tuple=True)[0]
    if len(masked_index) > 0:
        # We'll just look at the first masked token
        mask_position = masked_index[0].item()
        logits = prediction_scores[0, mask_position]  # shape [vocab_size]
        probs = F.softmax(logits, dim=-1)
        top5 = torch.topk(probs, 5)
        print("Top 5 predictions for [MASK]:")
        for prob, idx in zip(top5.values, top5.indices):
            print(f"  Token='{idx2tkn[idx.item()]}' prob={prob.item():.4f}")

    nsp_prob = F.softmax(seq_relationship_scores[0], dim=-1)
    print("NSP probabilities =", nsp_prob)

微調 BERT（Fine-tuning BERT）

在微調（fine-turning）的階段，我們會對一個 pre-trained BERT 來進行 fine-tuning，例如用我們剛剛 pre-trained BERT，或是用 Google pre-trained bert-base-uncased 或 bert-large-uncased。我們可以對一個 pre-trained BERT 來 fine-tune 一個特定的 downstream task。我們接來將展示如何 fine-tune BERT 成一個情感分類（sentiment classification）模型。

以下是 fine-tuning 用的資料。

# 1: positive, 0: negative
sentiment_data = [
    ("i like dogs", 1),
    ("i like cats", 1),
    ("they are playful", 1),
    ("they are bad", 0),  # 'bad' not in vocab, will become [UNK]
    ("i like [UNK]", 0),  # random negative label
]

然後，我們用以下的程式碼來建立 fine-tuning 用的 dataset。與建立 pre-training 的 dataset 時相似，不過我們這邊使用單一個句子，而不是句子對。

def create_example_for_classification(sentence):
    cls_id = tkn2idx["[CLS]"]
    sep_id = tkn2idx["[SEP]"]

    tokens = tokenize(sentence)

    input_ids = [cls_id] + tokens + [sep_id]
    token_type_ids = [0] * (len(tokens) + 2)

    return input_ids, token_type_ids


def build_sentiment_dataset(data):
    examples = []
    for sentence, label in data:
        input_ids, token_type_ids = create_example_for_classification(sentence)
        examples.append((input_ids, token_type_ids, label))
    return examples

相似於 pre-training task，sentiment classification 也需要一個特定的 layer 來處理 BERT 模型的輸出。我們的 sentiment classification 模型會預設句子是正面的（positive）或反面的（negative），所以它是對整個句子做出預測。因此，我們會使用 [CLS] 對應的輸出（也就是 aggregate sequence representation）來做預測。

class BertForSequenceClassification(nn.Module):
    def __init__(self, bert, num_labels, hidden_dim):
        super(BertForSequenceClassification, self).__init__()
        self.bert = bert
        # A classification head: we typically use the [CLS] pooled output
        self.classifier = nn.Linear(hidden_dim, num_labels)

    def forward(self, input_ids, token_type_ids=None, mask=None, labels=None):
        """
        Sequence classification with BERT

        Args
            input_ids: (batch_size, seq_len)
            token_type_ids: (batch_size, seq_len)
            mask: (batch_size, seq_len)
            labels: (batch_size)

        Returns
            logits: (batch_size, num_classes)
            loss: (optional) Cross entropy loss
        """

        sequence_output, pooled_output = self.bert(input_ids, token_type_ids=token_type_ids, mask=mask)
        logits = self.classifier(pooled_output)

        loss = None
        if labels is not None:
            loss = F.cross_entropy(logits, labels)

        return logits, loss

我們用以下的程式碼來 fine-tune 一個 pre-trained BERT。

def collate_pretraining_batch(examples):
    pad_id = tkn2idx["[PAD]"]
    max_len = max(len(ex[0]) for ex in examples)

    batch_input_ids = []
    batch_token_type_ids = []
    batch_mlm_labels = []
    batch_nsp_labels = []
    batch_mask = []

    for (input_ids, token_type_ids, mlm_labels, nsp_label) in examples:
        seq_len = len(input_ids)
        pad_len = max_len - seq_len
        batch_input_ids.append(input_ids + [pad_id] * pad_len)
        batch_token_type_ids.append(token_type_ids + [0] * pad_len)
        batch_mlm_labels.append(mlm_labels + [-100] * pad_len)
        batch_nsp_labels.append(nsp_label)
        batch_mask.append([1] * seq_len + [0] * pad_len)

    batch_input_ids = torch.tensor(batch_input_ids, dtype=torch.long)
    batch_token_type_ids = torch.tensor(batch_token_type_ids, dtype=torch.long)
    batch_mlm_labels = torch.tensor(batch_mlm_labels, dtype=torch.long)
    batch_nsp_labels = torch.tensor(batch_nsp_labels, dtype=torch.long)
    batch_mask = torch.tensor(batch_mask, dtype=torch.long)
    return batch_input_ids, batch_token_type_ids, batch_mlm_labels, batch_nsp_labels, batch_mask


def pretrain_bert():
    dataset = build_pretraining_dataset(corpus, num_examples=32)
    dataloader = DataLoader(dataset, batch_size=4, shuffle=True, collate_fn=collate_pretraining_batch)

    model = BertForPreTraining(
        vocab_size=len(tkn2idx),
        token_type_size=2,
        max_position_embeddings=64,
        hidden_dim=32,
        num_layers=2,
        num_heads=2,
        d_ff=64,
        dropout_prob=0.1,
    )
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

    model.train()
    EPOCHS = 100

    for epoch in range(EPOCHS):
        total_loss = 0
        for batch in dataloader:
            input_ids, token_type_ids, mlm_labels, nsp_labels, mask = batch
            optimizer.zero_grad()

            prediction_scores, seq_relationship_scores = model(input_ids, token_type_ids, mask)

            mlm_loss = F.cross_entropy(prediction_scores.view(-1, len(tkn2idx)), mlm_labels.view(-1), ignore_index=-100)
            nsp_loss = F.cross_entropy(seq_relationship_scores.view(-1, 2), nsp_labels.view(-1))
            loss = mlm_loss + nsp_loss

            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(dataloader)
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch + 1}/{EPOCHS}, Loss: {avg_loss:.4f}")

    return model

最後，我們可以用以下程式碼來測試一下我們剛剛 fine-tuning 好的 sentiment classification 模型。

def test_fine_tune_bert(model):
    text = "i like dogs"

    input_ids, token_type_ids = create_example_for_classification(text)
    mask = [1] * len(input_ids)

    input_ids_tensor = torch.tensor([input_ids], dtype=torch.long)
    token_type_ids_tensor = torch.tensor([token_type_ids], dtype=torch.long)
    mask_tensor = torch.tensor([mask], dtype=torch.long)

    model.eval()
    with torch.no_grad():
        logits, loss = model(input_ids_tensor, token_type_ids_tensor, mask_tensor)

    probs = F.softmax(logits, dim=-1)
    predicted_label = torch.argmax(probs, dim=-1).item()

    print("Probabilities =", probs)
    print("Predicted label =", predicted_label)

你可以用以下程式碼來執行 pre-training 和 fine-tuning。

if __name__ == "__main__":
    pretrain_model = pretrain_bert()
    test_pretrain_bert(pretrain_model)
    fine_tune_model = fine_tune_bert(pretrain_model.bert)
    test_fine_tune_bert(fine_tune_model)

結語

BERT 不僅是 NLP 領域的技術創新，更是推動整個人工智慧語言理解邁入新紀元的重要里程碑。透過 bidirectional Transformer 架構、 pre-training 與 fine-tuning，BERT 為各種語言任務提供了前所未有的準確性與彈性。

參考

Jacob Devlin, Ming-Wei Change, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Association for Computational Linguistics (NAACL).
BERT Source Code: https://github.com/google-research/bert.

Get source code of posts.

雙向 Transformer 編碼器表徵（Bidirectional Encoder Representations from Transformers, BERT）

Share

Table of Contents

BERT 架構

Transformer 模型

輸入與輸出表徵（Input/Output Representations）

BERT 實作

Transformer 模型

預訓練 BERT（Pre-training BERT）

遮罩語言模型（Masked Language Modeling, MLM）

下一句預測（Next Sentence Prediction, NSP）

實作

微調 BERT（Fine-tuning BERT）

結語

參考

Related Tags

Wayne

發佈留言取消回覆

Python 長條圖（Bar Charts）

Python 散佈圖／折線圖（Scatter/Line Charts）

Python 圓餅圖／環狀圖／放射環狀圖（Pie/Donut/Sunburst Charts）

Python 熱圖（Heatmaps）

Python 箱型圖／小提琴圖（Box/Violin Plot）

Python K 線圖（Candlestick Charts）

Bradley-Terry 模型

熵（Entropy）

Byte-Pair Encoding

策略梯度（Policy Gradient）

函數近似的 On-Policy 控制

Python 長條圖（Bar Charts）

Kotlin Coroutine 教學

Python 散佈圖／折線圖（Scatter/Line Charts）

Spring Boot + REST APIs + JPA 教學

Python 圓餅圖／環狀圖／放射環狀圖（Pie/Donut/Sunburst Charts）

Python 長條圖（Bar Charts）

Kotlin Coroutine 教學

Python 散佈圖／折線圖（Scatter/Line Charts）

Spring Boot + REST APIs + JPA 教學

雙向 Transformer 編碼器表徵（Bidirectional Encoder Representations from Transformers, BERT）

Share

Table of Contents

BERT 架構

輸入與輸出表徵（Input/Output Representations）

BERT 實作

預訓練 BERT（Pre-training BERT）

遮罩語言模型（Masked Language Modeling, MLM）

下一句預測（Next Sentence Prediction, NSP）

實作

微調 BERT（Fine-tuning BERT）

結語

參考

Related Tags

發佈留言 取消回覆

You May Also Like

發佈留言取消回覆