GloVe Word Embeddings

Photo by Kelsey Curtis on Unsplash
Photo by Kelsey Curtis on Unsplash
GloVe is a word embedding model that constructs word vectors based on global co-occurrence statistics. Unlike Word2Vec, which relies on local context windows, GloVe captures the overall statistical relationships between words through matrix factorization. This approach enables GloVe to generate high-quality word representations that effectively encode semantic and syntactic relationships. This article will introduce the principles and training methods of GloVe.

GloVe is a word embedding model that constructs word vectors based on global co-occurrence statistics. Unlike Word2Vec, which relies on local context windows, GloVe captures the overall statistical relationships between words through matrix factorization. This approach enables GloVe to generate high-quality word representations that effectively encode semantic and syntactic relationships. This article will introduce the principles and training methods of GloVe.

The complete code for this chapter can be found in .

GloVe Model

GloVe is a word embeddings learning model proposed by J. Pennington et al. of Stanford in 2014. Unlike Word2Vec which uses the local context window method, GloVe uses the global matrix factorization method. It constructs word embeddings based on extracting statistical information from the word co-occurrence matrix. Therefore, GloVe can capture word relationships across the entire corpus through matrix factorization.

Building the Co-occurrence Matrix

A key idea in GloVe is that word meanings can be inferred from their co-occurrence probabilities.. Therefore, for GloVe, the co-occurrence matrix is ​​very important. Co-occurrence refers to the number of times word j appears in the context of word i.

Suppose we have a corpus containing the following three sentences:

  1. I like deep learning.
  2. I like machine learning.
  3. Deep learning is powerful.

Assume that the size of the context window is 1, which means one word before and after. Then we can build the following co-occurrence matrix XX_i is row; X_i is column. X_{ij} refers to the number of times word j appears in the context of word i.

Ilikedeeplearningmachineispowerful
I200000
like210100
deep012000
learning002110
machine010100
is000101
powerful000001

We use the following notation to define the co-occurrence matrix.

X:\text{word-word co-occurrence matrix} \\\\ X_{ij}:\text{the number of times word }j\text{ occurs in the context of word }i \\\\ X_i=\displaystyle\sum_kX_{ik}

Probability Ratios for Word Relationships

With the co-occurrence matrix, we can use it to define some word relationships. Let P_{ij} be the probability that word j appears in the context of word i.

P_{ij}=P(j|i)=\frac{X_{ij}}{X_i}

For example, in the table above, the probability that “learning” appears in the context of “deep” is:

P(learning|deep)=\frac{2}{3}

However, we cannot use P_{ij} to understand how word i and word j differ from other words when they appear together. Now we introduce another word k, then:

  • When k appears more frequently with i than with j, then \frac{P_{ik}}{P_{jk}} > 1.
  • When k appears with similar frequency as i and j, then \frac{P_{ik}}{P_{jk}} \approx 1.

Therefore, the ratio of co-occurrence probabilities \frac{P_{ik}}{P_{jk}} is a measure of how different i and j are relative to k. Therefore, GloVe’s word embeddings learn the ratio of co-occurrence probabilities rather than the co-occurrence probabilities themselves. Therefore, the ratio of co-occurrence probabilities can be expressed by the following model:

F(w_i,w_j,\tilde{w}_k)=\frac{P_{ik}}{P_{jk}} \\\\ w\in\mathbb{R^d}:\text{word vectors} \\\\ \tilde{w}\in\mathbb{R^d}:\text{sparate context word vectors}

We want F to capture the ratio information of \frac{P_{ik}}{P_{jk}}. The most natural way is to use the difference of vectors. Therefore, the above formula can be modified as follows:

F(w_i-w_j,\tilde{w}_k)=\frac{P_{ik}}{P_{jk}}

Because the right side of the equation is a scalar, and the left side is a vector. Therefore, modify the formula to:

F((w_i-w_j)^T\tilde{w}_k)=\frac{P_{ik}}{P_{jk}}

In the following formula, we set F((w_i-w_j)^T\tilde{w}_k) as:

F((w_i-w_j)^T\tilde{w}_k)=\frac{F(w_i^T\tilde{w}_k)}{F(w_j^T\tilde{w}_k)}

Then, we can derive the following formula.

F(w_i^T\tilde{w}_k)=P_{ik}=\frac{X_{ik}}{X_i} \\\\ F=\exp \\\\ w_i^T\tilde{w}_k=\log{P_{ik}}=\log{X_{ik}}-\log{X_i}

Finally, adding the additional bias, we get:

w_i^T\tilde{w}_k+b_i+\tilde{b}_k=\log{X_{ik}}

Loss Function

In the last equation above, we can see that it is very similar to the least squares method. GloVe proposes a new weighted least squares based on the above formula, as follows.

\displaystyle J=\sum_i^V\sum_j^Vf(X_{ij})(w_i^T\tilde{w}_j+b_i+\tilde{b}_j-\log{X_{ij}})^2

Where f is a weighting function, which is defined as follows:

f(x)=\begin{cases} (\frac{x}{x_{max}})^\alpha &\text{if } x < x_{max} \\ 1 &\text{otherwise} \end{cases}

J. Pennington et al. used x_{max}=100 and \alpha=\frac{3}{4} in the paper.

Final Word Embeddings

GloVe treats the co-occurrence matrix X as a symmetric matrix, that is X_{ij}=X_{ji}. Therefore, in theory, the information learned W and \tilde{W} are equal. The two are slightly different in distribution space only because of different initialization. Therefore, the final output will be the average of the two.

Implementation and Examples

We will use the Wikipedia article on Oolong as our corpus. In the following code, we crawl the article and split it into sentences and then into words.

wiki = wikipediaapi.Wikipedia(user_agent="waynestalk/1.0", language="en")
page = wiki.page("Oolong")
corpus = page.text

nltk.download("punkt")
sentences = nltk.sent_tokenize(corpus)
tokenized_corpus = [[word.lower() for word in nltk.word_tokenize(sentence) if word.isalpha()] for sentence in sentences]

vocab = set([word for sentence in tokenized_corpus for word in sentence])
word_to_index = {word: i for i, word in enumerate(vocab)}
index_to_word = {i: word for i, word in enumerate(vocab)}
len(vocab)

# Output
580

Next, we build the co-occurrence matrix with the context window size set to 2.

window_size = 2
vocab_size = len(vocab)
co_occurrence_matrix = torch.zeros((vocab_size, vocab_size))

for sentence in tokenized_corpus:
    for i, word in enumerate(sentence):
        word_index = word_to_index[word]
        for j in range(max(0, i - window_size), min(i + window_size + 1, len(sentence))):
            if i != j:
                context_index = word_to_index[sentence[j]]
                co_occurrence_matrix[word_index, context_index] += 1

Below is the implementation of the GloVe model.

class GloVe(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(GloVe, self).__init__()
        self.word_embedding = nn.Embedding(vocab_size, embedding_dim)
        self.context_embedding = nn.Embedding(vocab_size, embedding_dim)
        self.word_bias = nn.Embedding(vocab_size, 1)
        self.context_bias = nn.Embedding(vocab_size, 1)

        nn.init.uniform_(self.word_embedding.weight, a=-0.5, b=0.5)
        nn.init.uniform_(self.context_embedding.weight, a=-0.5, b=0.5)
        nn.init.zeros_(self.word_bias.weight)
        nn.init.zeros_(self.context_bias.weight)

    def forward(self, word_index, context_index, co_occurrence):
        word_emb = self.word_embedding(word_index)
        context_emb = self.context_embedding(context_index)
        word_b = self.word_bias(word_index).squeeze()
        context_b = self.context_bias(context_index).squeeze()

        weighting = self.weighting_function(co_occurrence)
        log_co_occurrence = torch.log(co_occurrence)
        dot = (word_emb * context_emb).sum(dim=1)
        loss = weighting * (dot + word_b + context_b - log_co_occurrence) ** 2
        return loss.sum()

    def weighting_function(self, x, x_max=100, alpha=0.75):
        return torch.where(x < x_max, (x / x_max) ** alpha, torch.ones_like(x))

Before starting training, we flatten the training data.

word_indices = []
context_indices = []
co_occurrences = []

for i in range(vocab_size):
    for j in range(vocab_size):
        if co_occurrence_matrix[i, j] > 0:
            word_indices.append(i)
            context_indices.append(j)
            co_occurrences.append(co_occurrence_matrix[i, j])

word_indices = torch.tensor(word_indices, dtype=torch.long)
context_indices = torch.tensor(context_indices, dtype=torch.long)
co_occurrences = torch.tensor(co_occurrences, dtype=torch.float)

Now we can train the model as follows.

embedding_dim = 1000
model = GloVe(vocab_size, embedding_dim)
optimizer = optim.Adam(model.parameters(), lr=0.01)
epochs = 500

start_time = time.time()

for epoch in range(epochs):
    loss = model(word_indices, context_indices, co_occurrences)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch == 0 or (epoch + 1) % 100 == 0:
        print(f"Epoch: {epoch}, Loss: {loss.item()}")

end_time = time.time()
print(f"Training time: {end_time - start_time} seconds")

# Output
Epoch: 0, Loss: 1147.0933837890625
Epoch: 99, Loss: 0.01006692461669445
Epoch: 199, Loss: 0.0013765881303697824
Epoch: 299, Loss: 0.007692785933613777
Epoch: 399, Loss: 0.031206317245960236
Epoch: 499, Loss: 0.027982018887996674
Training time: 2.2056429386138916 seconds

The final word embeddings is the average of word_embedding and context_embedding.

def get_final_embedding(word):
    word_index = torch.tensor(word_to_index[word], dtype=torch.long)
    w_vec = model.word_embedding(word_index).detach()
    c_vec = model.context_embedding(word_index).detach()
    return (w_vec + c_vec) / 2.0

In the following code, we use the trained word embeddings to calculate the similarity between two sentences.

sentence1 = "tea is popular in taiwan".split()
sentence2 = "oolong is famous in taiwan".split()
sentence1_embeddings = [get_final_embedding(word) for word in sentence1]
sentence2_embeddings = [get_final_embedding(word) for word in sentence2]
vector1 = torch.stack(sentence1_embeddings).mean(dim=0)
vector2 = torch.stack(sentence2_embeddings).mean(dim=0)
cosine_sim = nn.CosineSimilarity(dim=0)
similarity = cosine_sim(vector1, vector2).item()
print(f"Sentence 1: {sentence1}")
print(f"Sentence 2: {sentence2}")
print(f"Similarity between sentences: {similarity}")

# Output
Sentence 1: ['tea', 'is', 'popular', 'in', 'taiwan']
Sentence 2: ['oolong', 'is', 'famous', 'in', 'taiwan']
Similarity between sentences: 0.6013368964195251

Conclusion

GloVe is a powerful word embedding model that can effectively capture global word relationships through statistical co-occurrence analysis. Its ability to produce meaningful vector representations makes it a valuable tool in NLP applications such as text classification, sentiment analysis, and machine translation.

Reference

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like