Layer Normalization

Normalization is a data transformation technique originating from statistics. It adjusts the mean and variance of data to make it more stable and predictable. In deep learning, normalization is widely used to improve the stability and efficiency of model training. This article explains the original concept of normalization, introduces the design and limitations of batch normalization, and explores how layer normalization addresses these issues to become a standard component in modern language models.

Normalization
1. Z-score Normalization
2. Min-max Normalization
Deep Learning and Normalization
Batch Normalization
Layer Normalization
Conclusion
References

Normalization

Normalization is a fundamental technique from statistics and data preprocessing. Its goal is to adjust the distribution of data so that it becomes more consistent and predictable in terms of numerical scale. In other words, normalization modifies the center and scale of the data without altering the relative relationships between data points, so that the data becomes more suitable for modeling and computation. In practical applications, normalization helps models learn more easily and allows features from different sources or scales to be compared and integrated effectively.

Here, we introduce two of the most common normalization methods.

Z-score Normalization

Z-score Normalization transforms the data into a distribution with a mean $\mu$ of 0 and a standard deviation $\sigma$ of 1:

$\hat{x}_i = \frac{x_i - \mu}{\sigma}$

This approach prevents the model from being biased by differences in the numerical scale of features and contributes to the stability of numerical optimization methods such as gradient descent.

Min-max Normalization

Min-max Normalization scales the data into a fixed range, such as [0, 1]:

$\hat{x}_i = \frac{x_i - \min(x)}{\max(x) - \min(x)}$

This method is commonly used in applications like image processing, where features need to be compressed within a specific range.

Deep Learning and Normalization

In deep neural networks, data passes through multiple layers of nonlinear transformations, each of which may alter the data distribution. This continual shift in distribution (known as internal covariate shift) makes model training more difficult. In particular, when the input distribution changes throughout training, each layer must constantly adapt to these changes.

Normalization is introduced to mitigate this problem. It stabilizes the distribution of signals and prevents extremely large or small values from disrupting the gradients. It also reduces the model’s sensitivity to weight initialization and learning rate settings. As a result, normalization accelerates convergence and can even improve generalization performance.

Batch Normalization

In 2015, Sergey Ioffe et al. introduced Batch Normalization. The core idea is to normalize the output of each layer along the feature dimension using statistics computed over a mini-batch, so that the mean of each channel’s output becomes 0 and the standard deviation becomes 1.

For the input $x_i$ to a neuron, the normalization process is defined as follows:

$\displaystyle \mu = \frac{1}{m} \sum_{i=1}^{m} x_i, \quad \sigma^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu)^2 \\\\ \hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \varepsilon}}, \quad y_i = \gamma \hat{x}_i + \beta \\\\ \mu:\text{ Mean} \\\\ \sigma^2:\text{ Variance} \\\\ m: \text{Batch size} \\\\ x_i: \text{Input to } i^{th} \text{ neuron} \\\\ y_i: \text{Output from } i^{th} \text{ neuron} \\\\ \varepsilon: \text{ A small scalar preventing division by 0} \\\\ \alpha,\beta:\text{ Trainable parameters}$

The trainable parameters $\gamma$ and $\beta$ allow the model to restore representational capacity after normalization. Batch Normalization significantly improves training stability, accelerates convergence, and makes training deeper networks more feasible.

However, Batch Normalization also has its limitations. It depends on batch size because small batches may result in unstable statistics. Moreover, it is generally not suitable for RNNs. In addition, there is a discrepancy between training and inference behavior, requiring extra handling such as maintaining a moving average of the statistics.

Below is an implementation of Batch Normalization.

import torch
import torch.nn as nn


class BatchNorm(nn.Module):
    def __init__(self, num_features, eps=1e-5, momentum=0.1):
        super().__init__()
        self.eps = eps
        self.momentum = momentum
        self.gamma = nn.Parameter(torch.ones(num_features))
        self.beta = nn.Parameter(torch.zeros(num_features))
        self.register_buffer('running_mean', torch.zeros(num_features))
        self.register_buffer('running_var', torch.ones(num_features))

    def forward(self, x):
        if self.training:
            mean = x.mean(dim=0)
            var = x.var(dim=0, unbiased=False)
            self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * mean.detach()
            self.running_var = (1 - self.momentum) * self.running_var + self.momentum * var.detach()
            x_hat = (x - mean[None, :]) / torch.sqrt(var[None, :] + self.eps)
        else:
            x_hat = (x - self.running_mean[None, :]) / torch.sqrt(self.running_var[None, :] + self.eps)
        return self.gamma[None, :] * x_hat + self.beta[None, :]


if __name__ == "__main__":
    torch.manual_seed(42)
    batch_size = 6
    feature_dim = 8

    x = torch.randn(batch_size, feature_dim)

    bn = BatchNorm(feature_dim)

    print("=== Training Mode ===")
    bn.train()
    y_bn_train = bn(x)
    print("BatchNorm mean/std:", y_bn_train.mean().item(), y_bn_train.std().item())

    print("\n=== Eval Mode ===")
    bn.eval()
    x_new = torch.randn(batch_size, feature_dim)
    y_bn_eval = bn(x_new)
    print("BatchNorm mean/std:", y_bn_eval.mean().item(), y_bn_eval.std().item())

In 2016, Jimmy Lei Ba et al. proposed Layer Normalization, which addresses the limitations of Batch Normalization, particularly in scenarios involving language models and recurrent networks. The key difference between the two lies in how the statistics are computed. Layer Normalization performs normalization over all hidden units within a single sample, rather than across samples in a batch. The formulation is as follows:

$\displaystyle \mu = \frac{1}{H} \sum_{i=1}^{H} h_i, \quad \sigma^2 = \frac{1}{H} \sum_{i=1}^{H} (h_i - \mu)^2 \\\\ \hat{h}_i = \frac{h_i - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad y_i = \gamma \hat{h}_i + \beta \\\\ H:\text{ Number of hidden units}$

Since the mean and variance are computed independently for each sample, Layer Normalization does not rely on the batch dimension. As a result, it exhibits consistent behavior during both training and inference and is well-suited for use in RNNs.

Below is an implementation of Layer Normalization.

import torch
import torch.nn as nn


class LayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()
        if isinstance(normalized_shape, int):
            normalized_shape = (normalized_shape,)
        self.normalized_shape = normalized_shape
        self.eps = eps
        self.gamma = nn.Parameter(torch.ones(*normalized_shape))
        self.beta = nn.Parameter(torch.zeros(*normalized_shape))

    def forward(self, x):
        dims = [-i for i in range(1, len(self.normalized_shape) + 1)]
        mean = x.mean(dim=dims, keepdim=True)
        var = x.var(dim=dims, keepdim=True, unbiased=False)
        x_hat = (x - mean) / torch.sqrt(var + self.eps)
        return self.gamma * x_hat + self.beta


if __name__ == "__main__":
    torch.manual_seed(42)
    batch_size = 6
    feature_dim = 8

    x = torch.randn(batch_size, feature_dim)

    ln = LayerNorm(feature_dim)

    print("=== Training Mode ===")
    ln.train()
    y_ln_train = ln(x)
    print("LayerNorm mean/std:", y_ln_train.mean().item(), y_ln_train.std().item())

    print("\n=== Eval Mode ===")
    ln.eval()
    x_new = torch.randn(batch_size, feature_dim)
    y_ln_eval = ln(x_new)
    print("LayerNorm mean/std:", y_ln_eval.mean().item(), y_ln_eval.std().item())

Conclusion

Modern NLP models, such as Transformer, BERT, the GPT series, T5, and LLaMA, almost universally adopt Layer Normalization as a standard component. It is commonly applied before or after the attention mechanism and feed-forward layers, typically in conjunction with residual connections. In these architectures, Layer Normalization plays a crucial role in stabilizing the learning process for long sequences. In particular, it helps maintain the stability of output signals across layers within self-attention structures.

References

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. In NIPS 2016 Deep Learning Symposium.
Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015.

Get source code of posts.