Sequence to Sequence(Seq2Seq)模型一個將 sequence 映射至另一個 sequence 的 neural network 模型。它徹底改變了自然語言處理(NLP)領域,使得翻譯、文本摘要和聊天機器人等任務的效果大幅提升。本篇文章將深入探討 Seq2Seq 模型的原理。
Table of Contents
Seq2Seq2 模型
Seq2Seq 模型本質上是一種 neural networks,專門用於將輸入 sequence 轉換為輸出 sequence。例如,將英文句子轉換為法文翻譯。而且,輸入 sequence 的長度與輸出 sequence 的長度可以不相同,也就是 RNN 的 many-to-many 類型。
Seq2Seq 模型的架構主要包含兩個部分:
- Encoder:處理輸入 sequence,其最後的 hidden state,稱為 encoder state。我們可以將這個 encoder state 想成是 encoder 將輸入 sequence 的資訊壓縮成一個向量。因此,此 encoder state 也稱為上下文向量(context vector)或思想向量(thought vector)。
- Decoder:利用 context vector 產生目標 sequence。
如下面的式子顯示,encoder 是用來計算 context vector ,而 decoder 是計算 conditional probability
。
下圖顯示 encoder 與 decoder 間的工作流程。Encoder 與 decoder 裡包含 single-layer 或 multi-layers 的 LSTMs。我們可以用一般的 RNN 替代圖中的 LSTM,但是這會有梯度消失(vanishing gradients)的問題,因此改用 GRU 或 LSTM 可以改善這個問題。
此外,在 sequence 的開頭與結尾會有額外的 <SOS> 和 <EOS> tokens,用來標示 sequence 的開始與結束。

實作
以下是 encoder 的實作。我們已經知道 encoder 裡主要是一個 single-layer 或 multi-layers 的 RNN。在這邊,我們使用 LSTM。此外,我們還需要一個 Embedding 來將 tokens 轉換成 word embeddings。Encoder 的 input_dim 是輸入 sequence 的 vocabulary size。
class Encoder(nn.Module):
def __init__(self, input_dim, embedding_dim, hidden_dim, num_layers=1):
super(Encoder, self).__init__()
self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx=PAD_INDEX)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, batch_first=True)
def forward(self, input):
"""
Args:
input: (batch_size, seq_len)
Returns
output: (batch_size, seq_len, hidden_dim)
hidden: (num_layers, batch_size, hidden_dim)
cell: (num_layers, batch_size, hidden_dim)
"""
embedding = self.embedding(input)
output, (hidden, cell) = self.lstm(embedding)
return output, hidden, cell以下是 decoder 的實作。Decoder 的 output_dim 是輸出 sequence 的 vocabulary size如同 encoder,decoder 也有一個 Embedding 和 LSTM。此外,它還需要一個 full connected layer 將 LSTM 的輸出的維度從 hidden_dim 轉換成 output_dim。
class Decoder(nn.Module):
def __init__(self, output_dim, embedding_dim, hidden_dim, num_layers=1):
super(Decoder, self).__init__()
self.embedding = nn.Embedding(output_dim, embedding_dim, padding_idx=PAD_INDEX)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, batch_first=True)
self.fc_out = nn.Linear(hidden_dim, output_dim)
def forward(self, input, hidden, cell):
"""
Args
input: (batch_size,)
hidden: (num_layers, batch_size, hidden_dim)
cell: (num_layers, batch_size, hidden_dim)
"""
embedding = self.embedding(input)
output, (hidden, cell) = self.lstm(embedding, (hidden, cell))
prediction = self.fc_out(output.squeeze(1))
return prediction, hidden, cell以下是 Seq2Seq 模型的實作。它在 forward pass 中,先將輸入 sequence 傳入 encoder,得到 hidden 和 cell。這兩個值就是所謂的 context vector 或 thought vector。之後,對 decoder 就會開始預測輸出 sequence。
class Seq2Seq(nn.Module):
def __init__(self, encoder, decoder):
super(Seq2Seq, self).__init__()
self.encoder = encoder
self.decoder = decoder
def forward(self, src, tgt, teacher_forcing_ratio=0.5):
"""
Args:
src: (batch_size, src_len)
tgt: (batch_size, tgt_len)
teacher_forcing_ratio: float - probability to use teacher forcing
"""
batch_size = src.shape[0]
tgt_len = tgt.shape[1]
tgt_vocab_size = self.decoder.fc_out.out_features
outputs = torch.zeros(batch_size, tgt_len, tgt_vocab_size)
_, hidden, cell = self.encoder(src)
input = tgt[:, 0]
for t in range(1, tgt_len):
input = input.unsqueeze(1)
output, hidden, cell = self.decoder(input, hidden, cell)
outputs[:, t, :] = output
teacher_force = torch.rand(1).item() < teacher_forcing_ratio
input = tgt[:, t] if teacher_force else output.argmax(1)
return outputs以下是 training 的實作。對每一個 training data 的輸入與輸出 sequences,我們要將每個字轉為對應的 index,而且每個句子的前後要加上 <SOS> 和 <EOS>。此外,我們還在句子後面加上 <PAD>,使得所有的輸入 sequence 長度一樣,所有的輸出 sequence 長度也一樣。
SOS_TOKEN = "<sos>"
EOS_TOKEN = "<eos>"
PAD_TOKEN = "<pad>"
SOS_INDEX = 0
EOS_INDEX = 1
PAD_INDEX = 2
english_sentences = [
"hello world",
"good morning",
"i love you",
"cat",
"dog",
"go home",
]
spanish_sentences = [
"hola mundo",
"buenos dias",
"te amo",
"gato",
"perro",
"ve a casa",
]
def build_vocab(sentences):
vocab = list(set([word for sentence in sentences for word in sentence.split(" ")]))
vocab = [SOS_TOKEN, EOS_TOKEN, PAD_TOKEN] + vocab
tkn2idx = {tkn: i for i, tkn in enumerate(vocab)}
idx2tkn = {i: tkn for tkn, i in tkn2idx.items()}
return tkn2idx, idx2tkn
def convert_sentences_to_idx(sentences, tkn2idx):
sentences_idx = [[tkn2idx[tkn] for tkn in sentence.split(" ")] for sentence in sentences]
for sentence_idx in sentences_idx:
sentence_idx.insert(0, tkn2idx[SOS_TOKEN])
sentence_idx.append(tkn2idx[EOS_TOKEN])
return sentences_idx
src_tkn2idx, src_idx2tkn = build_vocab(english_sentences)
tgt_tkn2idx, tgt_idx2tkn = build_vocab(spanish_sentences)
src_data = convert_sentences_to_idx(english_sentences, src_tkn2idx)
tgt_data = convert_sentences_to_idx(spanish_sentences, tgt_tkn2idx)
max_src_len = max([len(sentence) for sentence in src_data])
max_tgt_len = max([len(sentence) for sentence in tgt_data])
pair = []
for src, tgt in zip(src_data, tgt_data):
src += [src_tkn2idx[PAD_TOKEN]] * (max_src_len - len(src))
src_tensor = torch.tensor(src, dtype=torch.long)
tgt += [tgt_tkn2idx[PAD_TOKEN]] * (max_tgt_len - len(tgt))
tgt_tensor = torch.tensor(tgt, dtype=torch.long)
pair.append((src_tensor, tgt_tensor))
EMBEDDING_DIM = 16
HIDDEN_DIM = 32
NUM_LAYERS = 4
LEARNING_RATE = 0.01
EPOCHS = 50
encoder = Encoder(len(src_idx2tkn), EMBEDDING_DIM, HIDDEN_DIM, NUM_LAYERS)
decoder = Decoder(len(tgt_idx2tkn), EMBEDDING_DIM, HIDDEN_DIM, NUM_LAYERS)
model = Seq2Seq(encoder, decoder)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
criterion = nn.CrossEntropyLoss(ignore_index=PAD_INDEX)
def train():
model.train()
for epoch in range(EPOCHS):
total_loss = 0
for src_tensor, tgt_tensor in pair:
src_tensor = src_tensor.unsqueeze(0)
tgt_tensor = tgt_tensor.unsqueeze(0)
optimizer.zero_grad()
output = model(src_tensor, tgt_tensor, teacher_forcing_ratio=0.5)
output_dim = output.shape[-1]
output = output[:, 1:, :].reshape(-1, output_dim)
tgt = tgt_tensor[:, 1:].reshape(-1)
loss = criterion(output, tgt)
loss.backward()
optimizer.step()
total_loss += loss.item()
if epoch % 10 == 0:
print(f"epoch {epoch}, loss {total_loss / len(pair)}")訓練好了之後,我們開始使用模型來將英文翻譯成西班牙文。以下程式碼,我們用 Beam Search 來生成翻譯。
def translate_beam_search(sentence, beam_width=3, max_length=10):
model.eval()
src_idx = convert_sentences_to_idx([sentence], src_tkn2idx)[0]
src_idx += [src_tkn2idx[PAD_TOKEN]] * (max_src_len - len(src_idx))
src_tensor = torch.tensor(src_idx, dtype=torch.long)
with torch.no_grad():
_, hidden, cell = model.encoder(src_tensor)
beam = [([SOS_INDEX], hidden, cell, 0.0)]
completed_sentences = []
for _ in range(max_length):
new_beam = []
for tokens, hidden, cell, score in beam:
if tokens[-1] == EOS_INDEX:
completed_sentences.append((tokens, score))
new_beam.append((tokens, hidden, cell, score))
continue
input_index = torch.tensor([tokens[-1]], dtype=torch.long)
with torch.no_grad():
output, hidden, cell = model.decoder(input_index, hidden, cell)
log_probs = torch.log_softmax(output, dim=1).squeeze(0)
topk = torch.topk(log_probs, beam_width)
for tkn_idx, tkn_score in zip(topk.indices.tolist(), topk.values.tolist()):
new_tokens = tokens + [tkn_idx]
new_score = score + tkn_score
new_beam.append((new_tokens, hidden, cell, new_score))
new_beam.sort(key=lambda x: x[3], reverse=True)
beam = new_beam[:beam_width]
for tokens, hidden, cell, score in beam:
if tokens[-1] != EOS_INDEX:
completed_sentences.append((tokens, score))
completed_sentences.sort(key=lambda x: x[1], reverse=True)
best_tokens, best_score = completed_sentences[0]
if best_tokens[0] == SOS_INDEX:
best_tokens = best_tokens[1:]
if EOS_INDEX in best_tokens:
best_tokens = best_tokens[:best_tokens.index(EOS_INDEX)]
return " ".join([tgt_idx2tkn[idx] for idx in best_tokens])
def test():
test_sentences = [
"hello world",
"i love you",
"cat",
"go home",
]
for sentence in test_sentences:
translation = translate_beam_search(sentence)
print(f"src: {sentence}, tgt: {translation}")結語
現在相信你應該對於 Seq2Seq 模型有相當的了解。儘管 Seq2Seq 模型的效果不錯,但仍面臨處理長序列、有效維護上下文,以及計算效率等挑戰。後來的 Attention models 與 Transformer Networks 等新的技術被發明來解決這些問題。
參考
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Proceedings of 28th International Conference on Neural Information Processing Systems – Volume 2 (NIPS’14), pages 3104 -3112.
- Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, Dzmitry Bahdanau, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. 2014. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724-1734.









