Sequence to Sequence Model (Seq2Seq)

Sequence to Sequence (Seq2Seq) model is a neural network architecture that maps one sequence to another. It has revolutionized the field of Natural Language Processing (NLP), significantly enhancing the performance of tasks such as translation, text summarization, and chatbots. This article will dive deeply into the principles behind the Seq2Seq model.

The complete code for this chapter can be found in .

Seq2Seq Model
Implementation
Conclusion
References

Seq2Seq Model

Seq2Seq model is essentially a neural network specifically designed for converting an input sequence into an output sequence—for example, translating an English sentence into French. Furthermore, the input and output sequences can have different lengths, which makes Seq2Seq a “many-to-many” type of RNN architecture.

The Seq2Seq model architecture primarily consists of two components:

Encoder: Processes the input sequence and generates a final hidden state known as the encoder state. We can think of this encoder state as the encoder compressing all the information of the input sequence into a single vector. Hence, this encoder state is also called a context vector or thought vector.
Decoder: Uses the context vector to generate the target sequence.

As shown in the following equation, the encoder is responsible for computing the context vector v, while the decoder calculates the conditional probability $p(y_1,\cdots,y_{T^\prime}|x_1,\cdots,x_{T})$ .

$p(y_1,\cdots,y_{T^\prime}|x_1,\cdots,x_{T})=\displaystyle\prod_{t=1}^{T^\prime}p(y_t|v,y_1,\cdots,y_{t-1}) \\\\ (x_1,\cdots,x_{T}):\text{input sequence} \\\\ (y_1,\cdots,y_{T^\prime}):\text{output sequence} \\\\ T\text{ may differ from }T^\prime \\\\ v:\text{context vector}$

The figure below illustrates the workflow between the encoder and decoder. Both encoder and decoder consist of single-layer or multi-layer LSTMs. Although general RNNs could replace the LSTMs shown in the figure, doing so may lead to the problem of vanishing gradients. Therefore, using GRU or LSTM cells is preferable to mitigate this issue.

Additionally, special tokens such as <SOS> and <EOS> are added at the beginning and end of sequences to indicate the start and end, respectively.

Seq2Seq Model - Encoder and Deconder. — Seq2Seq Model – Encoder and Deconder.

Implementation

The following is the implementation of the encoder. As we already know, the encoder mainly consists of single-layer or multi-layer RNNs; here, we use LSTM. Additionally, we need an Embedding layer to convert tokens into word embeddings. The encoder’s input_dim represents the vocabulary size of the input sequence.

class Encoder(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, num_layers=1):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx=PAD_INDEX)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, batch_first=True)

    def forward(self, input):
        """
        Args:
            input: (batch_size, seq_len)

        Returns
            output: (batch_size, seq_len, hidden_dim)
            hidden: (num_layers, batch_size, hidden_dim)
            cell: (num_layers, batch_size, hidden_dim)
        """
        embedding = self.embedding(input)
        output, (hidden, cell) = self.lstm(embedding)
        return output, hidden, cell

The following is the implementation of the decoder. The decoder’s output_dim is the vocabulary size of the output sequence. Similar to the encoder, the decoder also has an Embedding and an LSTM layer. Additionally, it includes a fully connected layer that maps the output dimension from hidden_dim to output_dim.

class Decoder(nn.Module):
    def __init__(self, output_dim, embedding_dim, hidden_dim, num_layers=1):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(output_dim, embedding_dim, padding_idx=PAD_INDEX)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, batch_first=True)
        self.fc_out = nn.Linear(hidden_dim, output_dim)

    def forward(self, input, hidden, cell):
        """
        Args
            input: (batch_size,)
            hidden: (num_layers, batch_size, hidden_dim)
            cell: (num_layers, batch_size, hidden_dim)
        """
        embedding = self.embedding(input)
        output, (hidden, cell) = self.lstm(embedding, (hidden, cell))
        prediction = self.fc_out(output.squeeze(1))
        return prediction, hidden, cell

The following is the implementation of the Seq2Seq model. During the forward pass, it first passes the input sequence through the encoder to obtain the hidden and cell states. These two values form the so-called context vector or thought vector. Then, the decoder uses this context vector to start predicting the output sequence.

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, src, tgt, teacher_forcing_ratio=0.5):
        """
        Args:
            src: (batch_size, src_len)
            tgt: (batch_size, tgt_len)
            teacher_forcing_ratio: float - probability to use teacher forcing
        """

        batch_size = src.shape[0]
        tgt_len = tgt.shape[1]
        tgt_vocab_size = self.decoder.fc_out.out_features
        outputs = torch.zeros(batch_size, tgt_len, tgt_vocab_size)

        _, hidden, cell = self.encoder(src)
        input = tgt[:, 0]
        for t in range(1, tgt_len):
            input = input.unsqueeze(1)
            output, hidden, cell = self.decoder(input, hidden, cell)
            outputs[:, t, :] = output

            teacher_force = torch.rand(1).item() < teacher_forcing_ratio
            input = tgt[:, t] if teacher_force else output.argmax(1)

        return outputs

Below is the implementation of the training procedure. For each pair of input-output sequences in the training data, we convert each word to its corresponding index. We add <SOS> and <EOS> tokens at the beginning and end of each sentence. Additionally, we append <PAD> tokens to ensure all input sequences have the same length, and all output sequences also share a consistent length.

SOS_TOKEN = "<sos>"
EOS_TOKEN = "<eos>"
PAD_TOKEN = "<pad>"

SOS_INDEX = 0
EOS_INDEX = 1
PAD_INDEX = 2

english_sentences = [
    "hello world",
    "good morning",
    "i love you",
    "cat",
    "dog",
    "go home",
]

spanish_sentences = [
    "hola mundo",
    "buenos dias",
    "te amo",
    "gato",
    "perro",
    "ve a casa",
]


def build_vocab(sentences):
    vocab = list(set([word for sentence in sentences for word in sentence.split(" ")]))
    vocab = [SOS_TOKEN, EOS_TOKEN, PAD_TOKEN] + vocab
    tkn2idx = {tkn: i for i, tkn in enumerate(vocab)}
    idx2tkn = {i: tkn for tkn, i in tkn2idx.items()}
    return tkn2idx, idx2tkn


def convert_sentences_to_idx(sentences, tkn2idx):
    sentences_idx = [[tkn2idx[tkn] for tkn in sentence.split(" ")] for sentence in sentences]
    for sentence_idx in sentences_idx:
        sentence_idx.insert(0, tkn2idx[SOS_TOKEN])
        sentence_idx.append(tkn2idx[EOS_TOKEN])
    return sentences_idx


src_tkn2idx, src_idx2tkn = build_vocab(english_sentences)
tgt_tkn2idx, tgt_idx2tkn = build_vocab(spanish_sentences)

src_data = convert_sentences_to_idx(english_sentences, src_tkn2idx)
tgt_data = convert_sentences_to_idx(spanish_sentences, tgt_tkn2idx)
max_src_len = max([len(sentence) for sentence in src_data])
max_tgt_len = max([len(sentence) for sentence in tgt_data])
pair = []
for src, tgt in zip(src_data, tgt_data):
    src += [src_tkn2idx[PAD_TOKEN]] * (max_src_len - len(src))
    src_tensor = torch.tensor(src, dtype=torch.long)
    tgt += [tgt_tkn2idx[PAD_TOKEN]] * (max_tgt_len - len(tgt))
    tgt_tensor = torch.tensor(tgt, dtype=torch.long)
    pair.append((src_tensor, tgt_tensor))

EMBEDDING_DIM = 16
HIDDEN_DIM = 32
NUM_LAYERS = 4
LEARNING_RATE = 0.01
EPOCHS = 50

encoder = Encoder(len(src_idx2tkn), EMBEDDING_DIM, HIDDEN_DIM, NUM_LAYERS)
decoder = Decoder(len(tgt_idx2tkn), EMBEDDING_DIM, HIDDEN_DIM, NUM_LAYERS)
model = Seq2Seq(encoder, decoder)

optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
criterion = nn.CrossEntropyLoss(ignore_index=PAD_INDEX)


def train():
    model.train()

    for epoch in range(EPOCHS):
        total_loss = 0
        for src_tensor, tgt_tensor in pair:
            src_tensor = src_tensor.unsqueeze(0)
            tgt_tensor = tgt_tensor.unsqueeze(0)

            optimizer.zero_grad()
            output = model(src_tensor, tgt_tensor, teacher_forcing_ratio=0.5)

            output_dim = output.shape[-1]
            output = output[:, 1:, :].reshape(-1, output_dim)
            tgt = tgt_tensor[:, 1:].reshape(-1)
            loss = criterion(output, tgt)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        if epoch % 10 == 0:
            print(f"epoch {epoch}, loss {total_loss / len(pair)}")

After training, we start using the model to translate English sentences into Spanish. In the following code, we apply Beam Search to generate translations.

def translate_beam_search(sentence, beam_width=3, max_length=10):
    model.eval()

    src_idx = convert_sentences_to_idx([sentence], src_tkn2idx)[0]
    src_idx += [src_tkn2idx[PAD_TOKEN]] * (max_src_len - len(src_idx))
    src_tensor = torch.tensor(src_idx, dtype=torch.long)

    with torch.no_grad():
        _, hidden, cell = model.encoder(src_tensor)

    beam = [([SOS_INDEX], hidden, cell, 0.0)]
    completed_sentences = []

    for _ in range(max_length):
        new_beam = []
        for tokens, hidden, cell, score in beam:
            if tokens[-1] == EOS_INDEX:
                completed_sentences.append((tokens, score))
                new_beam.append((tokens, hidden, cell, score))
                continue

            input_index = torch.tensor([tokens[-1]], dtype=torch.long)
            with torch.no_grad():
                output, hidden, cell = model.decoder(input_index, hidden, cell)
                log_probs = torch.log_softmax(output, dim=1).squeeze(0)

            topk = torch.topk(log_probs, beam_width)
            for tkn_idx, tkn_score in zip(topk.indices.tolist(), topk.values.tolist()):
                new_tokens = tokens + [tkn_idx]
                new_score = score + tkn_score
                new_beam.append((new_tokens, hidden, cell, new_score))

        new_beam.sort(key=lambda x: x[3], reverse=True)
        beam = new_beam[:beam_width]

    for tokens, hidden, cell, score in beam:
        if tokens[-1] != EOS_INDEX:
            completed_sentences.append((tokens, score))

    completed_sentences.sort(key=lambda x: x[1], reverse=True)
    best_tokens, best_score = completed_sentences[0]

    if best_tokens[0] == SOS_INDEX:
        best_tokens = best_tokens[1:]
    if EOS_INDEX in best_tokens:
        best_tokens = best_tokens[:best_tokens.index(EOS_INDEX)]

    return " ".join([tgt_idx2tkn[idx] for idx in best_tokens])


def test():
    test_sentences = [
        "hello world",
        "i love you",
        "cat",
        "go home",
    ]
    for sentence in test_sentences:
        translation = translate_beam_search(sentence)
        print(f"src: {sentence}, tgt: {translation}")

Conclusion

By now, you should have a solid understanding of the Seq2Seq model. Although the Seq2Seq model performs well, it still faces challenges such as handling long sequences, effectively maintaining context, and computational efficiency. Later, new techniques such as Attention models and Transformer Networks were invented to address these issues.

References

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Proceedings of 28th International Conference on Neural Information Processing Systems – Volume 2 (NIPS’14), pages 3104 -3112.
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, Dzmitry Bahdanau, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. 2014. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724-1734.

Get source code of posts.

Sequence to Sequence Model (Seq2Seq)

Share

Table of Contents

Seq2Seq Model

Implementation

Conclusion

References

Related Tags

Wayne

Leave a Reply Cancel reply

Python Pie/Donut/Sunburst Charts

Python Candlestick Charts

Python Box/Violin Plots

Python Regression Line Plots

Python Choropleth Map

Python Heatmaps

Layer Normalization

Adam Optimizer

LoRA: Low-Rank Adaptation of Large Language Models

CLIP Model

Generative Pre-trained Transformer, GPT

Spring Security JWT Authentication with Google Sign-In Explained

How to Backup and Restore MySQL Databases in Spring Boot

Sending Push Notifications Using FCM in Spring Boot

Python Pie/Donut/Sunburst Charts

Kotlin Coroutine Flow Tutorial

Spring Security JWT Authentication with Google Sign-In Explained

How to Backup and Restore MySQL Databases in Spring Boot

Sending Push Notifications Using FCM in Spring Boot

Python Pie/Donut/Sunburst Charts

Sequence to Sequence Model (Seq2Seq)

Share

Table of Contents

Seq2Seq Model

Implementation

Conclusion

References

Related Tags

Leave a Reply Cancel reply

You May Also Like