Attention Mechanisms

Photo by Patrick Nguyen on Unsplash
Photo by Patrick Nguyen on Unsplash
Attention models have become a central concept in modern neural networks. Notably, popular architectures such as GPT models and Vision Transformers (ViT) are representative applications of attention models. This article will delve into the key attention mechanisms that underlie these models.

Attention models have become a central concept in modern neural networks. Notably, popular architectures such as GPT models and Vision Transformers (ViT) are representative applications of attention models. This article will delve into the key attention mechanisms that underlie these models.

Attention Mechanisms

Attention models were first introduced by Bahdanau et al. in 2015 for the task of machine translation. The attention mechanism proposed by Bahdanau is known as additive attention. In the same year, Luong et al. proposed three variants of attention mechanisms, among which dot product attention became the most widely used.

Later, in 2017, Vaswani et al. introduced the Transformer model. The self-attention and multi-head attention mechanisms in the Transformer are both based on scaled dot product attention.

In these three seminal papers, while the authors proposed formulas for computing attention, they rarely explained the motivation or mathematical foundations behind them in detail. This article focuses on analyzing the design logic and underlying principles of these formulas. As such, readers are expected to have basic familiarity with Bahdanau attention, Luong attention, and the Transformer architecture. If not, it is recommended to review the following articles before proceeding.

Queries, Keys, and Values

Suppose we have a database \mathcal{D} consisting of m entries, where each entry is a pair of a key and a value. We define the database as:

\mathcal{D}=\{(k_1,v_1),(k_2,v_2),\cdots,(k_m,v_m)\}

For example, consider a phone book as a database, represented as follows:

{("Smith","John"),("Johnson","Emily"),("Williams","David"),
 ("Brown","Sarah"),("Jones","Michael"),("Miller","Laura")}

Now, if we issue a query to this phone book with the input q=\text{Jones}, the system returns the corresponding value “Michael.” If the system allows fuzzy matching (e.g., prefix matching), a query like q=\text{Will} may return “David.”

In other words, our goal is to predict a target value \hat{v} given a query q. A naïve estimator would be to simply take the average of all target values in the training data:

\hat{v} = \frac{1}{m} \displaystyle\sum_{i=1}^{m} v^i

Nadaraya-Watson Estimator

In 1964, Nadaraya and Watson proposed a non-parametric regression model that estimates conditional expectations by performing a weighted average over sample data, without requiring knowledge of the underlying data distribution or model form:

\hat{y}(x) \approx \mathbb{E}[Y \mid X = x]

The corresponding estimator is given as follows, where K(x, x_i) is a kernel function that measures the similarity between the query point x and the sample point x_i, thereby determining the weight assigned to the corresponding y_i:

\hat{y}(x) = \frac{\displaystyle\sum_{i=1}^n K(x, x_i) \cdot y_i}{\displaystyle\sum_{i=1}^n K(x, x_i)}

A common choice for the kernel function is the Gaussian kernel, defined as:

K(q, x_i) = \exp\left( -\frac{\|q - x_i\|^2}{2\sigma^2} \right)

By using the Gaussian kernel as K(x, x_i), sample points closer to the query receive higher weights, enabling a smooth and localized weighted estimation. Substituting this into the Nadaraya-Watson estimator gives the following formula:

\hat{y}(q) = \frac{\displaystyle\sum_{i=1}^n \exp\left( -\frac{\|q - x_i\|^2}{2\sigma^2} \right) \cdot y_i}{\displaystyle\sum_{i=1}^n \exp\left( -\frac{\|q - x_i\|^2}{2\sigma^2} \right)}

Foundations of Attention

Attention mechanisms borrow the core idea from the Nadaraya-Watson estimator: using a weighted average to estimate a target value. We can incorporate this idea into a basic estimator as follows:

\hat{v}=\displaystyle\sum_{i=1}^{m} \alpha(q,k_i)v_i

Here, the weighting function \alpha(q,k_i) encodes the relevance between the query q and each sample key k_i, and then uses that to compute a weighted combination of the corresponding values v_i.

This forms the central intuition behind attention mechanisms: given a set of input Key-Value pairs, the model selectively attends to different keys based on a given Query (i.e., the current token of focus), and aggregates the values according to the resulting attention weights, producing a new semantic representation through weighted averaging.

We can therefore define attention over a database \mathcal{D} as:

\text{Attention}(q,\mathcal{D}) = \displaystyle\sum_{i=1}^{m} \alpha(q,k_i)v_i

Here, \alpha(q,k_i) \in \mathbb{R} represents the scalar attention weight assigned to the value v_i. This operation is also known as attention pooling, where the term “attention” reflects the model’s focus on items with higher weights \alpha. In other words, attention is a weighted linear combination over all the value vectors in the database \mathcal{D}.

In the earlier phone book example, a traditional query would assign a weight of 1 to a single entry and 0 to all others. However, in deep learning, it is more common to allow all weights to be non-negative and sum to 1:

\displaystyle \sum_{i=1}^{m} \alpha(q,k_i) = 1, \quad \alpha(q,k_i) \geq 0

To achieve such normalization, we introduce a scoring function a(q,k) that computes unnormalized relevance scores, and a distribution function p that normalizes these scores to produce the final attention weights:

\alpha(q,k_i) = p(a(q,k_j))

In deep learning, both the scoring function a and distribution function p are typically chosen to be differentiable, so the entire model can be trained via backpropagation. One of the most common choices for p is the softmax function, defined as:

\alpha(q,k_i) = \frac{\exp(a(q,k_i))}{\displaystyle\sum_{j=1}^{m}\exp(a(q,k_j))}

Substituting this weighting function into the attention pooling formula yields the standard attention expression:

Attention(q,\mathcal{D})=\alpha(q,k_i)v_i \\ \hphantom{Attention(q,\mathcal{D})}=p(a(q,k_i))v_i \\ \hphantom{Attention(q,\mathcal{D})}=\displaystyle\sum_{i=1}^{m}\frac{\exp(a(q,k_i))}{\sum_{j=1}^{m}\exp(a(q,k_j))}v_i

Since the database \mathcal{D} consists of Key-Value pairs, this is often written more explicitly as:

\text{Attention}(q, \mathcal{D}) = \text{Attention}(q, K, V)

Most attention mechanisms in current use are variations of this formulation, differing mainly in the choice of attention scoring function a(q,k). That is, the method used to measure the relevance between the query q and each key k.

Additive Attention

In 2015, Bahdanau et al. proposed additive attention, in which the attention scoring function a(q, k) is defined as follows. Here, W_q, W_k, and w_v are learnable parameters:

a(q,k)=w_v^\top \tanh(W_q q + W_k k) \in \mathbb{R} \\ W_q \in \mathbb{R}^{h \times q}, \quad W_k\in\mathbb{R}^{h \times k}, \quad w_v \in \mathbb{R}^h

The purpose of this scoring function is to measure the relevance (or more precisely, the compatibility) between a query q and a key k. In the context of Bahdanau attention, the key k corresponds to the encoder hidden state h_j, while the query q corresponds to the decoder’s previous hidden state s_{i-1}. This yields the following formulation:

e_{ij} = a(s_{i-1}, h_j) = w_v^\top \tanh(W_q s_{i-1} + W_k h_j)

The resulting score e_{ij} represents the compatibility between the decoder’s previous hidden state s_{i-1} (at time step i) and the encoder hidden state h_j.

Notably, this form of attention is conditional on the decoder’s prior hidden state s_{i-1}, and aggregates representations across all positions in the encoder output.

It is important to observe that additive attention allows the model to learn the attention scoring function directly through its learnable parameters.

Additionally, although the terms relevance and compatibility are closely related, there is a subtle distinction in usage. When discussing the overall process involving queries, keys, and values, the term relevance is often used to describe whether a piece of information matches the query intent. In contrast, within the context of attention scoring functions, compatibility is more commonly used to describe the degree of alignment between the query q and the key k.

Dot Product Attention

Following additive attention, Luong et al. (2015) proposed dot product attention, where the attention scoring function a(q,k) is defined as:

e_{ij} = a(s_i, h_j) = s_i^\top h_j

In Luong attention, the key k corresponds to the encoder hidden state h_j, while the query q corresponds to the decoder hidden state s_i. Thus, the general form of dot product attention is:

a(q, k) = q^\top k

Unlike Bahdanau attention, dot product attention removes all learnable parameters and directly uses the dot product between the query q and the key k as the compatibility score. This design greatly simplifies computation and leads to significantly improved training efficiency.

However, a natural question arises: Why is q^\top k a valid measure of compatibility between q and k? This design motivation can be understood from two perspectives:

  • The connection between dot product and the Gaussian kernel.
  • The connection between dot product and cosine similarity.

Dot Product and Gaussian Kernel

Suppose we adopt a Gaussian kernel as the attention scoring function a(q, k), and expand the squared distance term \|q - k_i\|^2 as follows:

a(q, k_i)=-\frac{1}{2} \|q - k_i\|^2=q^\top k_i - \frac{1}{2} \|k\|^2 - \frac{1}{2} \|q\|^2

The last term, -\frac{1}{2} \|q\|^2, depends only on the query q, and is therefore constant across all (q, k_i) pairs. This constant term is canceled out during normalization (e.g., via softmax), and can be safely ignored.

Additionally, layer normalization is often applied to the key vectors k_i, which constrains their L2 norm \|k_i\| to a narrow range, often approximately constant. Hence, the second term -\frac{1}{2} \|k_i\|^2 can also be neglected with minimal impact on the final attention scores.

With both constant terms removed, the attention scoring function simplifies to:

a(q, k_i) = q^\top k_i

This is precisely the formulation used in dot product attention. Therefore, dot product attention can be interpreted as a simplified version of Gaussian kernel attention.

For completeness, the L2 norm of a vector is defined as:

\|x\|_2 = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}

And layer normalization standardizes a feature vector to have zero mean and unit variance, via:

\text{LayerNorm}(x) = \frac{x - \mu}{\sigma}

As a result, the output vectors have a stable distribution, and their L2 norms become nearly constant. This justifies the simplification of the Gaussian kernel into a dot product.

If you are not familiar with layer normalization, it is recommended to review the following article.

Dot Product and Cosine Similarity

We can also understand dot product attention through the lens of cosine similarity.

The angle \theta between two vectors q and k can be measured by the cosine of their angle:

cos(\theta)=\frac{q \cdot k}{\|q\| \|k\|}

When the angle is close to 0°, \cos(\theta) \approx 1; when the angle is 90°, \cos(\theta) = 0; when the angle is 180°, \cos(\theta) < 0. This measure, known as cosine similarity, is commonly used to quantify the directional similarity between semantic vectors.

By multiplying both sides of the cosine similarity equation by the denominator \|q\| \|k\|, we recover the dot product:

q \cdot k = cos(\theta) \|q\| \|k\|

Thus, the dot product reflects both the directional similarity (through \cos(\theta)) and the magnitudes of the vectors. That is, when the vectors are aligned and have large magnitudes, the dot product becomes large.

This raises a natural concern: Can vector magnitudes mislead the attention mechanism? In other words, does dot product sometimes assign higher attention weights due to larger magnitudes rather than true semantic similarity?

For instance, if the vector for orange is [1, 0] and the vector for lemon is [2, 0], then:

\text{orange} \cdot \text{orange} < \text{orange} \cdot \text{lemon}

Even though the meanings are similar, the difference in magnitude causes a skew in dot product attention.

In high-dimensional space, random vectors are nearly orthogonal, so it’s unlikely for two unrelated vectors to be perfectly aligned. In practice, embedding vectors are learned through a global loss function, and the model’s learnable parameters adjust both direction and magnitude to reflect semantic relationships more accurately. Moreover, many models apply layer normalization or project vectors onto the unit sphere to mitigate the effect of magnitude.

Scaled Dot Product Attention

In 2017, Vaswani et al. proposed scaled dot product attention, where the attention scoring function a(q, k) is defined as follows, with d denoting the dimensionality of the query vector q:

a(q, k) = \frac{q^\top k}{\sqrt{d}}

Compared to the original dot product attention, this formulation introduces a scaling factor of \frac{1}{\sqrt{d}}. The purpose of this scaling is to prevent the dot product values from becoming excessively large in high-dimensional spaces, which would cause the softmax output to become numerically unstable, leading to issues such as gradient explosion or vanishing gradients.

But why divide specifically by \sqrt{d}?

Suppose we have two d-dimensional vectors \vec{q} and \vec{k}, and compute their dot product:

\vec{q} \cdot \vec{k} = \displaystyle \sum_{i=1}^{d} q_i k_i

Assume that each element q_i and k_i is an independent random variable with mean zero and variance one (e.g., drawn from a Gaussian distribution N(0, 1)). Then the product q_i \cdot k_i has mean 0 and variance 1, since:

\text{Var}(q_i \cdot k_i) = \text{Var}(q_i) \cdot \text{Var}(k_i) = 1 \cdot 1 = 1

Since the overall dot product is a sum of d independent terms, its variance is:

\text{Var}(q \cdot k) = \displaystyle \sum_{i=1}^{d} \text{Var}(q_i \cdot k_i) = d

In other words, the variance of q \cdot k grows linearly with dimensionality d. Without adjustment, this causes the inputs to the softmax function to have increasingly large magnitude in higher dimensions, which leads to sharp outputs. That is, the softmax becomes heavily peaked, assigning nearly all probability mass to a single element. This undermines gradient flow and impedes learning.

To stabilize the variance of the dot product to approximately 1, we simply divide by its standard deviation \sqrt{d}:

\text{Var}(q \cdot k)=d \\ \text{Var}(\frac{q \cdot k}{\sqrt{d}})=\frac{1}{d} \cdot \text{Var}(q \cdot k)=\frac{1}{d} \cdot d=1

By scaling in this way, the values passed into the softmax function are normalized, which keeps the numerical range stable and improves the efficiency of learning the attention weights.

Therefore, the core motivation behind scaled dot product attention is to maintain a consistent variance of dot products across different dimensionalities, so that the softmax function remains effective and well-behaved during training.

Conclusion

Although the attention scoring functions in early research were often guided by intuition and lacked rigorous mathematical justification, revisiting their connections to kernel methods and similarity measures allows us to gain a deeper understanding of the underlying principles and rationale behind these formulations.

References

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like