Word2Vec:深入理解詞向量(Word Embeddings)與其訓練方法

Photo by Barcs Tamás on Unsplash
Photo by Barcs Tamás on Unsplash
Word2Vec 是一種用於學習詞向量(word embeddings)的模型,透過神經網路將詞彙與其語義轉換為向量。Word2Vec 提供 CBOW 與 Skip-gram 兩種訓練方法,並透過 Negative Sampling 和 Subsampling 技術提升效率。本文章將介紹 Word2Vec 的基本原理與訓練方法。

Word2Vec 是一種用於學習詞向量(word embeddings)的模型,透過 neural networks 將詞彙與其語義轉換為向量。Word2Vec 提供 CBOW 與 Skip-gram 兩種訓練方法,並透過 Negative Sampling 和 Subsampling 技術提升效率。本文章將介紹 Word2Vec 的基本原理與訓練方法。

完整程式碼可以在 下載。

詞向量(Word Embeddings)

在自然語言處理(NLP)中,詞向量(word embeddings)是單詞的一種表示方式。這種表示方式是用某種方式將單詞及其語義編碼成一個實數向量(real-valued vector)。這使得在向量空間中相近的單詞,其語義也相似。因此,在這樣的向量空間中,假設已知德國與其首都柏林和法國的向量,我們可以利用下面的式子推導出法國的首都巴黎的向量。

vec("Paris") \approx vec("Berlin")-vec("Germany")+vec("France")

每一個單詞會被表示成一個 vector;而一堆單詞則就會構成一個 word embedding matrix。在下面的 word embedding matrix 中,每一個單詞表示為一個 300 維度的 vector。

WordFeature 1Feature 2Feature 300
cat0.12-0.451.24
dog0.10-0.501.30
king0.80-0.220.65
queen0.85-0.200.70

Word2Vec 模型

Word2Vec 是由 Google 的 Tomas Mikolov et al. 在 2013 年提出的一種 word embeddings 學習模型。Word2Vec 模型可以從一個給定的語料庫(corpus)中學習一個 word embedding matrix。該模型包含兩種架構,分別為 Continuous Bad-of-Words model(CBOW)和 Continuous Skip-gram model。

Continuous Bag-of-Words Model (CBOW)

CBOW 的學習方法是,根據中心詞(center word)的上下文(context words)來預測 center word。如下圖中,center word 是 fox,而 context 的 window 大小為 2,所以 context 為 fox 的前面兩個字和後面兩個字。在訓練時,我們會計算 context words 的 average word embedding 作為 input,而 true label 是 center word fox。

CBOW.
CBOW.

前向傳播(Forward Propagation)

CBOW 使用只有一層 hidden layer 的 neural network 來學習 word embeddings。在訓練完成後,該 hidden layer 的 weight 就是 word embedding matrix。下圖是 CBOW 的 forward propagation。

CBOW Forward Propagation.
CBOW Forward Propagation.

CBOW 的 forward propagation 中的公式如下:

\displaystyle h=\frac{1}{C}\sum_{j=1}^{C}W^Tx_j \\\\ z=Uh+b \\\\ P(w_t|context)=\frac{e^{z_t}}{\sum_{v=1}^V e^{z_v}} \\\\ C:\text{number of context words} \\\\ V: \text{size of vocabulary} \\\\ d: \text{embedding dimension}

其中各個變數的維度如下:

Wx_ihUbz
(V,d)(V,1)(d,1)(V,d)(V,1)(V,1)

損失函數(Loss Function)

CBOW 的輸出是每一個 word 的機率,所以最後會使用 softmax 作為 output layer 的 activation function。因此,使用 cross-entropy loss 作為它的 loss function。

\mathcal{L}=-\displaystyle\sum_{(context,w_t)\in corpus}\log P(w_t|context)

反向傳播(Backward Propagation)

CBOW 的 backpropagation 中的公式如下:

\frac{\partial\mathcal{L}}{\partial z}=\hat{y}-y \\\\ \frac{\partial\mathcal{L}}{\partial U}=\frac{\partial\mathcal{L}}{\partial z}\frac{\partial z}{\partial U}=\frac{\partial\mathcal{L}}{\partial z}h^T \\\\ \frac{\partial\mathcal{L}}{\partial b}=\frac{\partial\mathcal{L}}{\partial z}\frac{\partial z}{\partial b}=\frac{\partial\mathcal{L}}{\partial z} \\\\ \frac{\partial\mathcal{L}}{\partial h}=\frac{\partial\mathcal{L}}{\partial z}\frac{\partial z}{\partial h}=U^T\frac{\partial\mathcal{L}}{\partial z} \\\\ \frac{\partial\mathcal{L}}{\partial h_j}=\frac{1}{C}\frac{\partial\mathcal{L}}{\partial h} \\\\ \frac{\partial\mathcal{L}}{\partial W}=\displaystyle\sum_{j=1}^{C}x_j\frac{\partial\mathcal{L}}{\partial h_j}

實作

我們使用 Wikipedia 的 Oolong 文章作為 corpus 來訓練模型。在下面的程式碼中,我們從維基百科下載文章 Oolong,將其按句子拆分,然後將每個句子按單詞拆分。

wiki = wikipediaapi.Wikipedia(user_agent="waynestalk/1.0", language="en")
page = wiki.page("Oolong")
corpus = page.text

nltk.download("punkt")
sentences = nltk.sent_tokenize(corpus)
tokenized_corpus = [[word.lower() for word in nltk.word_tokenize(sentence) if word.isalpha()] for sentence in sentences]

vocab = set([word for sentence in tokenized_corpus for word in sentence])
word_to_index = {word: i for i, word in enumerate(vocab)}
index_to_word = {i: word for i, word in enumerate(vocab)}
len(vocab)

# Output
580

我們使用 PyTorch 來實作 CBOW,可以看出其實作相當地簡單。其中 embedding 是 hidden layer,而 linear 是 output layer。在 foward() 中,我們將 input 轉換為向量,然後取這些向量的平均。最後,再將平均傳給 output layer。

class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_size):
        super(CBOW, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        self.linear = nn.Linear(embedding_size, vocab_size)

    def forward(self, context_words):
        vectors = self.embedding(context_words)
        average_vector = torch.mean(vectors, dim=0)
        return self.linear(average_vector)

再來,我們要準備訓練資料。對每個句子中的每個單詞和其前後各兩個單詞,我們可以將其視為一筆訓練資料。

window_size = 2
training_pairs = []

for sentence in tokenized_corpus:
    for i in range(window_size, len(sentence) - window_size):
        context = [sentence[j] for j in range(i - window_size, i + window_size) if j != i]
        training_pairs.append((context, sentence[i]))
training_pairs[:5]

# Output
[(['oolong', 'uk', 'simplified'], 'us'),
 (['uk', 'us', 'chinese'], 'simplified'),
 (['us', 'simplified', '乌龙茶'], 'chinese'),
 (['simplified', 'chinese', 'traditional'], '乌龙茶'),
 (['chinese', '乌龙茶', 'chinese'], 'traditional')]

以下程式碼中,CBOW 模型將從訓練資料中學習 word embeddings。我們設定 embedding 的 dimension 為 1000。所以,embedding matrix 會是 580 \times 1000

model = CBOW(len(vocab), 1000)

loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
epochs = 500

start_time = time.time()

for epoch in range(epochs):
    total_loss = 0
    for context, target in training_pairs:
        context_tensor = torch.tensor([word_to_index[word] for word in context], dtype=torch.long)
        target_tensor = torch.tensor([word_to_index[target]], dtype=torch.long)
        output = model(context_tensor)
        loss = loss_function(output.unsqueeze(0), target_tensor)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    if epoch == 0 or (epoch + 1) % 100 == 0:
        print(f"Epoch: {epoch}, Loss: {total_loss}")

end_time = time.time()
print(f"Training time: {end_time - start_time} seconds")

# Output
Epoch: 0, Loss: 6856.648630566895
Epoch: 99, Loss: 65.55343400325
Epoch: 199, Loss: 58.981725521754925
Epoch: 299, Loss: 55.93582288858761
Epoch: 399, Loss: 53.808607134100384
Epoch: 499, Loss: 52.07664276908599
Training time: 304.94574093818665 seconds

以下程式碼印出 oolong 的 word embedding。

word = "oolong"
word_index_tensor = torch.tensor(word_to_index[word], dtype=torch.long)
embedding_vector = model.embedding(word_index_tensor).detach().numpy()
print(f"Embedding {embedding_vector.shape} for '{word}': {embedding_vector}")

# Output
Embedding (1000,) for 'oolong': [ 1.41568875e+00 -3.54769737e-01 -1.37265265e+00 -6.58394694e-01
  8.31549525e-01 -9.42143202e-01  9.70315874e-01 -5.99202693e-01
  1.84273362e+00  9.20817614e-01 -5.58760583e-01  1.00353360e+00
 -2.15644687e-01 -4.58650626e-02 -2.28673637e-01  1.86233068e+00
...

以下程式碼顯示,我們可以用學習到的 embedding matrix,用 cosine similarity 來計算兩個句子的相似性。

sentence1 = "tea is popular in taiwan".split()
sentence2 = "oolong is famous in taiwan".split()
sentence1_embeddings = [model.embedding(torch.tensor(word_to_index[word], dtype=torch.long)).detach() for word in sentence1]
sentence2_embeddings = [model.embedding(torch.tensor(word_to_index[word], dtype=torch.long)).detach() for word in sentence2]
vector1 = torch.stack(sentence1_embeddings).mean(dim=0)
vector2 = torch.stack(sentence2_embeddings).mean(dim=0)
cosine_sim = nn.CosineSimilarity(dim=0)
similarity = cosine_sim(vector1, vector2).item()
print(f"Sentence 1: {sentence1}")
print(f"Sentence 2: {sentence2}")
print(f"Similarity between sentences: {similarity}")

# Output
Sentence 1: ['tea', 'is', 'popular', 'in', 'taiwan']
Sentence 2: ['oolong', 'is', 'famous', 'in', 'taiwan']
Similarity between sentences: 0.6053189635276794

我們也可以自己計算 cosine similarity,如下。

sentence1_embeddings = [model.embedding(torch.tensor(word_to_index[word], dtype=torch.long)).detach().numpy() for word in sentence1]
sentence2_embeddings = [model.embedding(torch.tensor(word_to_index[word], dtype=torch.long)).detach().numpy() for word in sentence2]
vector1 = np.mean(sentence1_embeddings, axis=0)
vector2 = np.mean(sentence2_embeddings, axis=0)
similarity = (np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2)))

在以下程式碼裡,我們用 principal component analysis(PCA)來顯示 word embeddings 的分佈。

word_embeddings_tensor = model.embedding.weight.detach()
U, S, V = torch.pca_lowrank(word_embeddings_tensor, q=2)
reduced_embeddings = U.numpy()

plt.figure(figsize=(8, 6))

for word, index in word_to_index.items():
    x, y = reduced_embeddings[index]
    plt.scatter(x, y, marker='o', color='blue')

tea_index = word_to_index['tea']
x, y = reduced_embeddings[tea_index]
plt.scatter(x, y, marker='o', color='red')
plt.text(x, y, 'tea', fontsize=8)

oolong_index = word_to_index['oolong']
x, y = reduced_embeddings[oolong_index]
plt.scatter(x, y, marker='o', color='red')
plt.text(x, y, 'oolong', fontsize=8)

plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.title("CBOW Word Embeddings Visualization of Corpus 'Oolong`")
plt.grid(True)
plt.show()
CBOW word embeddings visualization (PCA).
CBOW word embeddings visualization (PCA).

Continuous Skip-gram Model

Skip-gram 的學習發法與 CBOW 相反。它是用 center word 來預測 context words。如下圖中,center word 是 fox,而 context 的 window 大小為 2,所以 context 為 fox 的前面兩個字和後面兩個字。這樣就會有四筆訓練資料,center word 作為 input,而每筆資料的 output 分別為 context words。

Skip-gram.
Skip-gram.

前向傳播(Forward Propagation)

Skip-gram 使用只有一層 hidden layer 的 neural network 來學習 word embeddings。在訓練完成後,該 hidden layer 的 weight 就是 word embedding matrix。下圖是 Skip-gram 的 forward propagation。

Skip-gram Forward Propagation.
Skip-gram Forward Propagation.

Skip-gram 的 forward propagation 中的公式如下:

h=W^Tx \\\\ z=Uh+b \\\\ P(w_c|w_t)=\frac{e^{z_c}}{\sum_{v=1}^V e^{z_v}} \\\\ V: \text{size of vocabulary} \\\\ d: \text{embedding dimension}

其中各個變數的維度如下:

WxhUbz
(V,d)(V,1)(d,1)(V,d)(V,1)(V,1)

損失函數(Loss Function)

Skip-gram 的輸出是每一個 word 的機率,所以最後會使用 softmax 作為 output layer 的 activation function。因此,使用 cross-entropy loss 作為它的 loss function。

\mathcal{L}=-\displaystyle\sum_{(w_t,w_c)\in corpus}\log P(w_c|w_t)

反向傳播(Backward Propagation)

Skip-gram 的 backpropagation 中的公式如下:

\frac{\partial\mathcal{L}}{\partial z}=\hat{y}-y \\\\ \frac{\partial\mathcal{L}}{\partial U}=\frac{\partial\mathcal{L}}{\partial z}\frac{\partial z}{\partial U}=\frac{\partial\mathcal{L}}{\partial z}h^T \\\\ \frac{\partial\mathcal{L}}{\partial b}=\frac{\partial\mathcal{L}}{\partial z}\frac{\partial z}{\partial b}=\frac{\partial\mathcal{L}}{\partial z} \\\\ \frac{\partial\mathcal{L}}{\partial h}=\frac{\partial\mathcal{L}}{\partial z}\frac{\partial z}{\partial h}=U^T\frac{\partial\mathcal{L}}{\partial z} \\\\ \frac{\partial\mathcal{L}}{\partial W}=\frac{\partial\mathcal{L}}{\partial h}\frac{\partial\mathcal{h}}{\partial W}=U^T\frac{\partial\mathcal{L}}{\partial z}x^T

實作

我們使用 PyTorch 來實作 Skip-gram,如下。

class SkipGram(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(SkipGram, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size)

    def forward(self, word):
        vector = self.embedding(word)
        return self.linear(vector)

我們同樣使用 Wikipedia 的 Oolong 文章作為 corpus。每個句子中的每個單詞的 context 是其前後各兩個單詞,因此每個單詞有多個 context words。我們可以將每個單詞和其一個 context word 視為一筆訓練資料。

window_size = 2
training_pairs = []

for sentence in tokenized_corpus:
    for i, target_word in enumerate(sentence):
        context_indices = (list(range(max(i - window_size, 0), i)) +
                           list(range(i + 1, min(i + window_size, len(sentence)))))
        for context_index in context_indices:
            training_pairs.append((target_word, sentence[context_index]))
training_pairs[:5]

以下程式碼中,Skip-gram 模型將從訓練資料中學習 word embeddings。我們設定 embedding 的 dimension 為 1000。所以,embedding matrix 會是 580 \times 1000

model = SkipGram(len(vocab), 1000)

loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
epochs = 500

start_time = time.time()

for epoch in range(epochs):
    total_loss = 0
    for target, context in training_pairs:
        target_tensor = torch.tensor([word_to_index[target]], dtype=torch.long)
        context_tensor = torch.tensor([word_to_index[context]], dtype=torch.long)
        output = model(target_tensor)
        loss = loss_function(output, context_tensor)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    if epoch == 0 or (epoch + 1) % 100 == 0:
        print(f"Epoch: {epoch}, Loss: {total_loss}")

end_time = time.time()
print(f"Training time: {end_time - start_time} seconds")

# Output
Epoch: 0, Loss: 39987.801153186476
Epoch: 99, Loss: 30384.4197357425
Epoch: 199, Loss: 26906.040692283364
Epoch: 299, Loss: 24592.846442646056
Epoch: 399, Loss: 22867.418499057472
Epoch: 499, Loss: 21502.15208007075
Training time: 1074.5490338802338 seconds

在下圖中,我們用 principal component analysis(PCA)來顯示 word embeddings 的分佈。

Skip-gram word embeddings visualization (PCA).
Skip-gram word embeddings visualization (PCA).

CBOW 和 Skip-gram 的比較

在上述中,我們可以看到 CBOW 和 Skip-gram 對相同的 corpus 的訓練時間。CBOW 是 304 秒,而 Skip-gram 是 1074 秒。CBOW 的每一個 center word 和其 context words,成為一筆訓練資料。但是,Skip-gram 的每一個 center word 和其每一個 context word,都成為一筆訓練資料。因此 Skip-gram 的訓練資料比較大。

在 PCA 上可以看出,Skip-gram 學習的 word embeddings,相關的詞比較靠近一起,所以它能更細緻地學習詞與詞的關係。

提高 Word2Vec 訓練效率

Skip-gram 可以學習高質量的 word embeddings,它可以捕獲單詞在語法和語意上的關係。但是,在訓練大規模的 corpus 時,我們需要改進 skip-gram 的效能來加速訓練。

Negative Sampling

在 Skip-gram 中,給定一個 target word w_t,我們要預測它的 context word w_c。我們用 softmax 計算以下的機率。在分母的地方,我們要加總所有單詞的機率。而當 V 很大時,它將會花費相當多的時間。

P(w_c|w_t)=\frac{e^{(v_{w_c}\cdot v_{w_t})}}{\sum_{w\in V}e^{(v_{w}\cdot v_{w_t})}} \\\\ v_{w_t}:\text{the vector of the target word} \\\\ v_{w_c}:\text{the vector of the context word} \\\\ V:\text{the size of vocabulary}

與其計算整個字彙的機率,negative sampling 簡化問題,如下:

  • 對於每個 (target word, context word),我們希望最大化它們的相似性。
  • 對於每個 (target word, random word),我們希望最小化它們的相似性。

Negative sampling 對每一筆資料 (w_t,w_c),從字彙中挑選 k 個不在 target word w_t 的 context words 裡面的單詞。然後,使用以下的 loss function 來計算 loss。我們要最大化式子中的前半部,也就是 target word 和 context word 的機率。然後,最小化式子中的後半部,也就是 target word 和 negative words 的機率。

\log\sigma((v^{\prime}_{w_c})^Tv_{w_t})+\displaystyle\sum_{i=1}^k\mathbb{E}_{w_i\sim P_n(w)}\Big[\log\sigma(-(v^{\prime}_{wi})^Tv_{w_t})\Big] \\\\ v_w:\text{the input vector representations of }w \\\\ v^{\prime}_w:\text{the output vector representations of }w \\\\ k:\text{the number negative words} \\\\ \hphantom{k:}\text{5-20 for small training dataset} \\\\ \hphantom{k:}\text{2-5 for large training dataset} \mathbb{E}:\text{the expectation operator} \\\\ w_i\sim P_n(w):\text{mean }w_i\text{ is sampled from the negative sampling distribution }P_n(w)

那要如何挑選 negative words 呢?Negative sampling 是依據以下的分佈來挑選 negative samples。

P_n(w)=\frac{U(w)^{\frac{3}{4}}}{\sum_{w^\prime\in V}U(w^\prime)^{\frac{3}{4}}} \\\\ V:\text{the size of vocabulary} U(w):\text{the unigram probability of word }w \\\\ \frac{3}{4}:\text{Found by T. Mikolov et al. that can outperformed significantly the unigram}

因此,後半段的式子為:

\displaystyle\sum_{i=1}^k\mathbb{E}_{w_i\sim P_n(w)}\Big[\log\sigma(-(v^{\prime}_{wi})^Tv_{w_t})\Big]=\displaystyle\sum_{i=1}^k P_n(w_i)\Big[\log\sigma(-(v^{\prime}_{wi})^Tv_{w_t})\Big]

實作

首先如同在 Skip-gram 章節中準備好訓練資料。然後,我們要計算 negative sampling distribution P_n(w)

word_frequency = np.zeros(len(vocab))
for sentence in tokenized_corpus:
    for word in sentence:
        word_frequency[word_to_index[word]] += 1
word_distribution = word_frequency / word_frequency.sum()

unigram_distribution = word_distribution ** (3 / 4)
unigram_distribution = unigram_distribution / unigram_distribution.sum()
print(f"Unigram distribution: {unigram_distribution[:5]}")


def get_negative_samples(num_samples, context_index):
    negative_samples = []
    while len(negative_samples) < num_samples:
        sample_index = np.random.choice(len(vocab), p=unigram_distribution)
        if sample_index != context_index:
            negative_samples.append(sample_index)
    return negative_samples

接下來將 SkipGram 模型改為以下。程式碼中的 input_embedding 就是 v,而 output_embedding 就是 v^\prime

class SkipGramWithNegativeSampling(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(SkipGramWithNegativeSampling, self).__init__()
        self.input_embedding = nn.Embedding(vocab_size, embedding_dim)
        self.output_embedding = nn.Embedding(vocab_size, embedding_dim)
        nn.init.uniform_(self.input_embedding.weight, a=-0.5, b=0.5)
        nn.init.uniform_(self.output_embedding.weight, a=-0.5, b=0.5)

    def forward(self, target_word, context_word, negative_samples):
        target_embedding = self.input_embedding(target_word)
        context_embedding = self.output_embedding(context_word)
        negative_samples_embeddings = self.output_embedding(negative_samples)

        pos_score = (target_embedding * context_embedding).sum(dim=1)
        pos_loss = -torch.sigmoid(pos_score).log()

        neg_score = torch.bmm(negative_samples_embeddings, target_embedding.unsqueeze(2)).squeeze(2)
        neg_loss = -torch.sigmoid(-neg_score).log()
        neg_loss = neg_loss.sum(dim=1)

        return (pos_loss + neg_loss).mean()

然後,我們用以下程式碼來訓練 Skip-gram 模型。

model = SkipGramWithNegativeSampling(len(vocab), 1000)

num_negative_samples = 5
optimizer = optim.SGD(model.parameters(), lr=0.01)
epochs = 500

start_time = time.time()

for epoch in range(epochs):
    total_loss = 0
    for target, context in training_pairs:
        target_tensor = torch.tensor([word_to_index[target]], dtype=torch.long)
        context_index = word_to_index[context]
        context_tensor = torch.tensor([context_index], dtype=torch.long)
        negative_samples_tensor = torch.tensor(
            [get_negative_samples(num_negative_samples, context_index)], dtype=torch.long
        )
        loss = model(target_tensor, context_tensor, negative_samples_tensor)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    if epoch == 0 or (epoch + 1) % 100 == 0:
        print(f"Epoch: {epoch}, Loss: {total_loss}")

end_time = time.time()
print(f"Training time: {end_time - start_time} seconds")

# Output
Epoch: 0, Loss: 28936.812749773264
Epoch: 99, Loss: 3952.563685086556
Epoch: 199, Loss: 3894.3744740486145
Epoch: 299, Loss: 3726.271819884423
Epoch: 399, Loss: 3714.307072382886
Epoch: 499, Loss: 3639.7701731920242
Training time: 981.5462839603424 seconds

Subsampling of Frequent Words

在大規模的 corpus 中,出現頻率最高的單詞很容易出現上百萬次,如 in、the、a 等。而這些單詞往往提供很少的資訊,反而是一些少見的單詞提供很多的資訊,相對於這些高頻率的單詞,由於出現次數太多,這也會拖慢訓練的速度。所以,subsampling of frequent words 是要減少極常出現的單詞在訓練時的影響。

Subsampling 對每個單詞計算以下的機率,來決定是否要忽略該單詞。

P(w_i)=1-\sqrt{\frac{t}{f(w_i)}} \\\\ f(w_i):\text{the frequency of word }w_i \\\\ t:\text{a chosen threshold, typically around }10^{-5}

實作

在將 corpus 依句子再依單詞做分割後,我們用 subsampling 來將一些單詞直接移除,如下。

subsampling_threshold = 1e-5

subsampled_tokenized_corpus = []
for sentence in tokenized_corpus:
    new_sentence = []
    for word in sentence:
        index = word_to_index[word]
        frequency = word_frequency[index]
        if frequency > subsampling_threshold:
            drop_probability = 1 - np.sqrt(subsampling_threshold / frequency)
        else:
            drop_probability = 0

        if np.random.rand() > drop_probability:
            new_sentence.append(word)

    if len(new_sentence) > 0:
        subsampled_tokenized_corpus.append(new_sentence)

print(subsampled_tokenized_corpus[:5])

Skip-gram 可以同時引入 negative sampling 和 subsampling。所以,模型與訓練的程式碼與 negative sampling 章節中的一樣。你可以從本文章的完整程式碼中,取得這部分的程式碼。

最後,我們用引入 negative sampling 和 subsampling 的 Skip-gram 來學習 word embeddings。在移除一些高頻率出現的單詞後,訓練的時間縮短很多。

model = SkipGramWithNegativeSamplingAndSubsamplingOfFrequentWords(len(vocab), 1000)

num_negative_samples = 5
optimizer = optim.SGD(model.parameters(), lr=0.01)
epochs = 500

start_time = time.time()

for epoch in range(epochs):
    total_loss = 0
    for target, context in training_pairs:
        target_tensor = torch.tensor([word_to_index[target]], dtype=torch.long)
        context_index = word_to_index[context]
        context_tensor = torch.tensor([context_index], dtype=torch.long)
        negative_samples_tensor = torch.tensor(
            [get_negative_samples(num_negative_samples, context_index)], dtype=torch.long
        )
        loss = model(target_tensor, context_tensor, negative_samples_tensor)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    if epoch == 0 or (epoch + 1) % 100 == 0:
        print(f"Epoch: {epoch}, Loss: {total_loss}")

end_time = time.time()
print(f"Training time: {end_time - start_time} seconds")

# Output
Epoch: 0, Loss: 953.0108479261398
Epoch: 99, Loss: 56.98700549826026
Epoch: 199, Loss: 25.619574746116996
Epoch: 299, Loss: 28.435157721862197
Epoch: 399, Loss: 14.342244805768132
Epoch: 499, Loss: 15.597246480174363
Training time: 26.50890588760376 seconds

結語

Word2Vec 是 NLP 中最具影響力的 word embeddings 學習技術之一,透過 CBOW 和 Skip-gram 訓練出語義豐富的 word embeddings。此外,Negative Sampling 和 Subsampling 等技術進一步提升了訓練效率,使得 Word2Vec 成為許多 NLP 應用的基礎。

參考

  • Andrew Ng, Deep Learning Specialization, Coursera.
  • T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient Estimation of Word Representations in Vector Space. ICLR, 2013.
  • T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean. Distributed Representations of Words and Phrases and their Compositionality. NIPS, 2013.

發佈留言

發佈留言必須填寫的電子郵件地址不會公開。 必填欄位標示為 *

You May Also Like