Recurrent Neural Networks (RNN)

Recurrent Neural Networks (RNN) is a neural network architecture that specializes in processing sequential data. RNN uses recurrent connections to retain previous output information, allowing the network to consider past states while processing the current step. This article will introduce the basic principles of RNN.

The complete code for this chapter can be found in .

RNN Architecture
RNN Types
Forward Propagation
Loss Function
Backward Propagation
Exploding Gradients
Parameter Initialization and Update
Putting All Together
Example
Vanishing Gradients
Conclusion
References

RNN Architecture

Recurrent Neural Networks (RNN) have a hidden state at each time step and use it as input for the next time step. This allows the RNN to capture some information from previous time steps. Therefore, compared with the traditional neural network, it is more suitable for dealing with sequential or context-dependent problems.

The figure below shows the architecture of the basic RNN (vanilla RNN). The left side is the unfolded representation, and the right side is the folded representation. Therefore, the neural network architecture in each time step on the right is the same.

In this article, we will use a character-level RNN model to explain the principle of RNN. The input for each time step $x^{<t>}$ is a character in the input text. The RNN cell will $x^{<t>}$ be passed in, and the RNN cell will produce a hidden state $a^{<t>}$ . On the one hand, it outputs $a^{<t>}$ as $y^{<t>}$ , and on the other hand, it is passed to the RNN cell of the next time step.

Each time step $a^{<t>}$ will affect the next time step $a^{<t+1>}$ . In other words, $a^{<t+1>}$ captures some information from the previous time step. Therefore, there is a dependency relationship between each time step.

RNN Types

The number of inputs $T_x$ and outputs $T_y$ of RNNs are not always the same, so there are five different types of RNNs. Each type of RNN is suitable for different applications, as follows. The vanilla RNN in this article is the many-to-many type on the bottom left.

Forward Propagation

If you are not familiar with the forward propagation of neural networks, please refer to the following article first.

- Deep Learning
- Neural Networks

Multiple Classification Neural Network

ByWayne
12/01/2025

The following figure shows our RNN cell. The input is $x^{<t>}$ and the hidden state of the previous time step $x^{<t-1>}$ . These two inputs, along with the parameters $Waa,Wax,ba$ , are processed and then passed through the activation function tanh, resulting in the output $a^{<t>}$ , which is used as the input for the RNN cell in the next time step. Additionally, $a^{<t>}$ is combined with the parameters $Wya,by$ , and this result is passed through the softmax activation function, producing $\hat{y}^{<t>}$ as the output for the current time step.

It is worth noting that the parameters $Waa,Wax,Wya,ba,by$ are shared parameters. In other words, every RNN cell at each time step uses the same set of parameters.

The formula in the RNN cell is as follows:

$z_x^{<t>}=Waa\cdot a^{<t-1>}+Wax\cdot x^{<t>}+ba \\\\ a^{<t>}=tanh(z_x^{<t>}) \\\\ z_y^{<t>}=Wya\cdot a^{<t>}+by \\\\ \hat{y}^{<t>}=softmax(z_y^{<t>}) \\\\ t:t\text{-th time step.}$

The dimensions of the RNN input $X$ and true labels $Y$ are as follows:

$X:(n_x,m,T_x)-\text{the inputs.} \\\\ Y:(n_y,m,T_y)-\text{the true labels.} \\\\ m:\text{the number of examples.} \\\\ n_x:\text{the number of units in }x^{(i)<t>}. \\\\ n_y:\text{the number of units in }y^{(i)<t>}. \\\\ n_a:\text{the number of units in hidden state.} \\\\ x^{(i)}:\text{the input of }i\text{-th example} \\\\ T_x:\text{the input sequence length.} \\\\ T_y:\text{the output sequece length.}$

In the RNN cell, the dimensions of each variable are as follows:

$Waa$	$a^{<t-1>}$	$Wax$	$x^{<t>}$	$ba$	$Wya$	$by$	$\hat{y}^{<t>}$
$(n_a,n_a)$	$(n_a,m)$	$(n_a,n_x)$	$(n_x,m)$	$(n_a,1)$	$(n_y,n_a)$	$(n_y,1)$	$(n_y,m)$

The following is the implementation of forward propagation of a vanilla RNN.

class VanillaRNN:
    def cell_forward(self, xt, at_prev, parameters):
        """
        Implements a single forward step of the RNN-cell.

        Parameters
        ----------
        xt: (ndarray (n_x, m)) - input data at timestep "t"
        at_prev: (ndarray (n_a, m)) - hidden state at timestep "t-1"
        parameters:
            Waa: (ndarray (n_a, n_a)) - weight matrix multiplying the hidden state at_prev
            Wax: (ndarray (n_a, n_x)) - weight matrix multiplying the input xt
            Wya: (ndarray (n_y, n_a)) - weight matrix relating the hidden-state to the output
            ba: (ndarray (n_a, 1)) - bias
            by: (ndarray (n_y, 1)) - bias

        Returns
        -------
        at: (ndarray (n_a, m)) - hidden state at timestep "t"
        yt: (ndarray (n_y, m)) - prediction at timestep "t"
        cache: (tuple) - returning (at, at_prev, xt, zxt, y_hat_t, zyt) for the backpropagation
        """

        Wax, Waa, ba = parameters["Wax"], parameters["Waa"], parameters["ba"]
        Wyx, by = parameters["Wya"], parameters["by"]

        zxt = Waa @ at_prev + Wax @ xt + ba
        at = np.tanh(zxt)
        zyt = Wyx @ at + by
        y_hat_t = softmax(zyt)

        cache = (at, at_prev, xt, zxt, y_hat_t, zyt)
        return at, y_hat_t, cache

    def forward(self, X, a0, parameters):
        """
        Implements the forward propagation of the RNN.

        Parameters
        ----------
        X: (ndarray (n_x, m, T_x)) - input data for each timestep
        a0: (ndarray (n_a, m)) - initial hidden state
        parameters:
            Waa: (ndarray (n_a, n_a)) - weight matrix multiplying the hidden state at_prev
            Wax: (ndarray (n_a, n_x)) - weight matrix multiplying the input xt
            Wyx: (ndarray (n_y, n_a)) - weight matrix relating the hidden-state to the output
            ba: (ndarray (n_a, 1)) - bias
            by: (ndarray (n_y, 1)) - bias

        Returns
        -------
        a: (ndarray (n_a, m, T_x)) - hidden states for each timestep
        Y_hat: (ndarray (n_y, m, T_x)) - predictions for each timestep
        caches: (tuple) - returning (list of cache, x) for the backpropagation
        """

        caches = []

        n_x, m, T_x = X.shape
        n_y, n_a = parameters["Wya"].shape

        A = np.zeros((n_a, m, T_x))
        Y_hat = np.zeros((n_y, m, T_x))

        at_prev = a0
        for t in range(T_x):
            at_prev, y_hat_t, cache = self.cell_forward(X[:, :, t], at_prev, parameters)
            A[:, :, t] = at_prev
            Y_hat[:, :, t] = y_hat_t
            caches.append(cache)

        return A, Y_hat, caches

Loss Function

Since our vanilla RNN is many-to-many, that is, it has multiple outputs, when calculating the loss, we must add up the loss of each time step, as shown in the following figure.

Our vanilla RNN uses softmax for output $\hat{y}$ and therefore uses cross-entropy loss as its loss function.

$\displaystyle \mathcal{L}(\hat{y}^{<t>},y^{<t>})=-\sum_{k=1}^{n_y}y_k^{<t>}\ln({\hat{y}_k^{<t>}}) \\\\ \mathcal{L}(\hat{y},y)=\sum_{t=1}^{T_y}\mathcal{L}(\hat{y}^{<t>},y^{<t>})=-\sum_{t=1}^{T_y}\sum_{k=1}^{n_y}y_k^{<t>}\ln({\hat{y}_k^{<t>}}) \\\\ n_y:\text{the number of units in the output.} \\\\ y_k^{<t>}:\text{the }k\text{-th unit of the true label at }t\text{-th time step.} \\\\ T_y:\text{the number of time steps.}$

The following is the implementation of the loss function.

class VanillaRNN:
    def compute_loss(self, Y_hat, Y):
        """
        Computes the cross-entropy loss.

        Parameters
        ----------
        Y_hat: (ndarray (n_y, m, T_x)) - predictions for each timestep
        Y: (ndarray (n_y, m, T_x)) - true labels

        Returns
        -------
        loss: (float) - the cross-entropy loss
        """

        return -np.sum(Y * np.log(Y_hat))

Backward Propagation

Compared with traditional neural networks, RNN backpropagation is more complicated. As shown in the figure below, for each time step, we must take the partial derivatives of these.

In addition, each time step is affected by the hidden state of the previous time step. Therefore, when calculating gradients, the partial derivatives obtained in each time step must be summed up. This is backprogation through time (BPTT). If we only calculate the gradient at the last time step, we will ignore the impact of the intermediate time steps on the final result, and the weights will not be able to learn the correct adjustment in the entire time series.

The following is the partial derivatives in the output layer at a time step.

$\frac{\partial \mathcal{L}^{<t>}}{\partial z_y^{<t>}}=\hat{y}^{<t>}-y^{<t>} \\\\ \frac{\partial \mathcal{L}^{<t>}}{\partial Wya}=\frac{\partial z_y^{<t>}}{\partial Wya}\frac{\partial \mathcal{L}^{<t>}}{\partial z_y^{<t>}}=\frac{\partial \mathcal{L}^{<t>}}{\partial z_y^{<t>}}a^{<t>T} \\\\ \frac{\partial \mathcal{L}^{<t>}}{\partial by}=\frac{\partial z_y^{<t>}}{\partial by}\frac{\partial \mathcal{L}^{<t>}}{\partial z_y^{<t>}}=\frac{\partial \mathcal{L}^{<t>}}{\partial z_y^{<t>}} \\\\ \frac{\partial \mathcal{L}^{<t>}}{\partial a^{<t>}}=\frac{\partial \mathcal{L}^{<t>}}{\partial a^{<t>}}+\frac{\partial \mathcal{L}^{<t+1>}}{\partial a^{<t>}}=Wya^T\frac{\partial \mathcal{L}^{<t>}}{\partial z_y^{<t>}}+\frac{\partial \mathcal{L}^{<t+1>}}{\partial a^{<t>}}$

The following is the calculation of the remaining partial derivatives at a time step.

$\frac{\partial \mathcal{L}^{<t>}}{\partial z^{<t>}}=(1-(a^{<t>})^2)\frac{\partial \mathcal{L}^{<t>}}{\partial a^{<t>}} \\\\ \frac{\partial \mathcal{L}^{<t>}}{\partial Waa}=\frac{\partial z^{<t>}}{\partial Waa}\frac{\partial \mathcal{L}^{<t>}}{\partial z^{<t>}}=\frac{\partial \mathcal{L}^{<t>}}{\partial z^{<t>}}a^{<t-1>T} \\\\ \frac{\partial \mathcal{L}^{<t>}}{\partial Wax}=\frac{\partial z^{<t>}}{\partial Wax}\frac{\partial \mathcal{L}^{<t>}}{\partial z^{<t>}}=\frac{\partial \mathcal{L}^{<t>}}{\partial z^{<t>}}x^{<t>T} \\\\ \frac{\partial \mathcal{L}^{<t>}}{\partial ba}=\frac{\partial z^{<t>}}{\partial ba}\frac{\partial \mathcal{L}^{<t>}}{\partial z^{<t>}}=\frac{\partial \mathcal{L}^{<t>}}{\partial z^{<t>}} \\\\ \frac{\partial \mathcal{L}^{<t>}}{\partial x^{<t>}}=\frac{\partial z^{<t>}}{\partial x^{<t>}}\frac{\partial \mathcal{L}^{<t>}}{\partial z^{<t>}}=Wax^T\frac{\partial \mathcal{L}^{<t>}}{\partial z^{<t>}} \\\\ \frac{\partial \mathcal{L}^{<t>}}{\partial a^{<t-1>}}=\frac{\partial z^{<t>}}{\partial a^{<t-1>}}\frac{\partial \mathcal{L}^{<t>}}{\partial z^{<t>}}=Waa^T\frac{\partial \mathcal{L}^{<t>}}{\partial z^{<t>}} \\\\$

It is worth noting that when calculating the partial derivative of the hidden state, we must first calculate the partial derivative from the output layer with respect to $a^{<t>}$ . Then, we add the partial derivative with respect to $a^{<t>}$ calculated in the previous time step, as shown below.

The above is how to find all partial derivatives at each time step. We have to add up all the partial derivatives we calculated at the end.

$\displaystyle \frac{\partial \mathcal{L}}{\partial Wya}=\sum_{t=1}^{T_y}\frac{\partial \mathcal{L}^{<t>}}{\partial Wya} \\\\ \frac{\partial \mathcal{L}}{\partial by}=\sum_{t=1}^{T_y}\frac{\partial \mathcal{L}^{<t>}}{\partial by} \\\\ \frac{\partial \mathcal{L}}{\partial Waa}=\sum_{t=1}^{T_y}\frac{\partial \mathcal{L}^{<t>}}{\partial Waa} \\\\ \frac{\partial \mathcal{L}}{\partial Wax}=\sum_{t=1}^{T_y}\frac{\partial \mathcal{L}^{<t>}}{\partial Wax} \\\\ \frac{\partial \mathcal{L}}{\partial ba}=\sum_{t=1}^{T_y}\frac{\partial \mathcal{L}^{<t>}}{\partial ba} \\\\ \frac{\partial \mathcal{L}}{\partial a}=\sum_{t=1}^{T_y}\frac{\partial \mathcal{L}^{<t>}}{\partial a^{<t-1>}}$

The following is the implementation of backward propagation of a vanilla RNN.

class VanillaRNN:
    def cell_backward(self, y, dat, cache, parameters):
        """
        Implements a single backward step of the RNN-cell.

        Parameters
        ----------
        y: (ndarray (n_y, m)) - true labels at timestep "t"
        dat: (ndarray (n_a, m)) - gradient of the hidden state at timestep "t"
        cache: (tuple) - (at, at_prev, xt, zxt, y_hat_t, zyt)
        parameters:
            Waa: (ndarray (n_a, n_a)) - weight matrix multiplying the hidden state at_prev
            Wax: (ndarray (n_a, n_x)) - weight matrix multiplying the input xt
            Wya: (ndarray (n_y, n_a)) - weight matrix relating the hidden-state to the output
            ba: (ndarray (n_a, 1)) - bias
            by: (ndarray (n_y, 1)) - bias

        Returns
        -------
        gradients: (dict) - the gradients
            dWaa: (ndarray (n_a, n_a)) - gradient of the weight matrix multiplying the hidden state at_prev
            dWax: (ndarray (n_a, n_x)) - gradient of the weight matrix multiplying the input xt
            dWya: (ndarray (n_y, n_a)) - gradient of the weight matrix relating the hidden-state to the output
            dba: (ndarray (n_a, 1)) - gradient of the bias
            dby: (ndarray (n_y, 1)) - gradient of the bias
            dat: (ndarray (n_a, m)) - gradient of the hidden state
        """

        at, at_prev, xt, zt, y_hat_t, zyt = cache
        dy = y_hat_t - y
        gradients = {
            "dWya": dy @ at.T,
            "dby": np.sum(dy, axis=1, keepdims=True),
        }
        dat = parameters["Wya"].T @ dy + dat
        dz = (1 - at ** 2) * dat
        gradients["dba"] = np.sum(dz, axis=1, keepdims=True)
        gradients["dWax"] = dz @ xt.T
        gradients["dWaa"] = dz @ at_prev.T
        gradients["dat"] = parameters["Waa"].T @ dz
        return gradients

    def backward(self, X, Y, parameters, caches):
        """
        Implements the backward propagation of the RNN.

        Parameters
        ----------
        X: (ndarray (n_x, m, T_x)) - input data
        Y: (ndarray (n_y, m, T_x)) - true labels
        parameters:
            Waa: (ndarray (n_a, n_a)) - weight matrix multiplying the hidden state at_prev
            Wax: (ndarray (n_a, n_x)) - weight matrix multiplying the input xt
            Wya: (ndarray (n_y, n_a)) - weight matrix relating the hidden-state to the output
            ba: (ndarray (n_a, 1)) - bias
            by: (ndarray (n_y, 1)) - bias
        caches: (list) - list of caches from rnn_forward

        Returns
        -------
        gradients: (dict) - the gradients
            dWaa: (ndarray (n_a, n_a)) - gradient of the weight matrix multiplying the hidden state at_prev
            dWax: (ndarray (n_a, n_x)) - gradient of the weight matrix multiplying the input xt
            dWya: (ndarray (n_y, n_a)) - gradient of the weight matrix relating the hidden-state to the output
            dba: (ndarray (n_a, 1)) - gradient of the bias
            dby: (ndarray (n_y, 1)) - gradient of the bias
            dat: (ndarray (n_a, m)) - gradient of the hidden state
        """

        n_x, m, T_x = X.shape
        a1, a0, x1, zx1, y_hat1, zy1 = caches[0]
        Waa, Wax, ba = parameters['Waa'], parameters['Wax'], parameters['ba']
        Wya, by = parameters['Wya'], parameters['by']

        gradients = {
            "dWaa": np.zeros_like(Waa), "dWax": np.zeros_like(Wax), "dba": np.zeros_like(ba),
            "dWya": np.zeros_like(Wya), "dby": np.zeros_like(by),
        }

        dat = np.zeros_like(a0)
        for t in reversed(range(T_x)):
            grads = self.cell_backward(Y[:, :, t], dat, caches[t], parameters)
            gradients["dWaa"] += grads["dWaa"]
            gradients["dWax"] += grads["dWax"]
            gradients["dWya"] += grads["dWya"]
            gradients["dba"] += grads["dba"]
            gradients["dby"] += grads["dby"]
            dat = grads["dat"]

        return gradients

Exploding Gradients

In deep RNNs, gradients are obtained by multiplying or adding during backpropagation. Therefore, if the weights are large, then when calculating the partial derivatives, even if the magnification is slightly greater than 1, the final result may explode when the number of layers increases. This will cause loss to reach a large value or NaN.

Therefore, before updating the parameters, we will do gradient clipping, which is actually done as follows.

class VanillaRNN:
    def clip(self, gradients, max_value):
        """
        Clips the gradients to a maximum value.

        Parameters
        ----------
        gradients: (dict) - the gradients
            dWaa: (ndarray (n_a, n_a)) - gradient of the weight matrix multiplying the hidden state at_prev
            dWax: (ndarray (n_a, n_x)) - gradient of the weight matrix multiplying the input xt
            dWya: (ndarray (n_y, n_a)) - gradient of the weight matrix relating the hidden-state to the output
            dba: (ndarray (n_a, 1)) - gradient of the bias
            dby: (ndarray (n_y, 1)) - gradient of the bias
        max_value: (float) - the maximum value to clip the gradients

        Returns
        -------
        gradients: (dict) - the clipped gradients
        """

        dWaa, dWax, dWya = gradients["dWaa"], gradients["dWax"], gradients["dWya"]
        dba, dby = gradients["dba"], gradients["dby"]

        for gradient in [dWax, dWaa, dWya, dba, dby]:
            np.clip(gradient, -max_value, max_value, out=gradient)

        gradients = {"dWaa": dWaa, "dWax": dWax, "dWya": dWya, "dba": dba, "dby": dby}
        return gradients

Parameter Initialization and Update

Initializing the parameters is not difficult, but you need to make sure that the dimensions of each parameter are correct. The following is the implementation of parameter initialization.

class VanillaRNN:
    def initialize_parameters(self, n_a, n_x, n_y):
        """
        Initializes the parameters for the RNN.

        Parameters
        ----------
        n_a: (int) - number of units in the hidden state
        n_x: (int) - number of units in the input data
        n_y: (int) - number of units in the output data

        Returns
        -------
        parameters: (dict) - initialized parameters
            "Wax": (ndarray (n_a, n_x)) - weight matrix multiplying the input
            "Waa": (ndarray (n_a, n_a)) - weight matrix multiplying the hidden state
            "Wya": (ndarray (n_y, n_a)) - weight matrix relating the hidden-state to the output
            "ba": (ndarray (n_a, 1)) - bias
            "by": (ndarray (n_y, 1)) - bias
        """

        Wax = np.random.randn(n_a, n_x) * 0.01
        Waa = np.random.randn(n_a, n_a) * 0.01
        Wya = np.random.randn(n_y, n_a) * 0.01
        ba = np.zeros((n_a, 1))
        by = np.zeros((n_y, 1))
        parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "ba": ba, "by": by}
        return parameters

After having the gradients, we need to use the gradients to update the parameters, which is actually done as follows.

class VanillaRNN:
    def update_parameters(self, parameters, gradients, learning_rate):
        """
        Updates the parameters using the gradients.

        Parameters
        ----------
        parameters: (dict) - the parameters
            Waa: (ndarray (n_a, n_a)) - weight matrix multiplying the hidden state at_prev
            Wax: (ndarray (n_a, n_x)) - weight matrix multiplying the input xt
            Wya: (ndarray (n_y, n_a)) - weight matrix relating the hidden-state to the output
            ba: (ndarray (n_a, 1)) - bias
            by: (ndarray (n_y, 1)) - bias
        gradients: (dict) - the gradients
            dWaa: (ndarray (n_a, n_a)) - gradient of the weight matrix multiplying the hidden state at_prev
            dWax: (ndarray (n_a, n_x)) - gradient of the weight matrix multiplying the input xt
            dWya: (ndarray (n_y, n_a)) - gradient of the weight matrix relating the hidden-state to the output
            dba: (ndarray (n_a, 1)) - gradient of the bias
            dby: (ndarray (n_y, 1)) - gradient of the bias
        learning_rate: (float) - the learning rate
        """

        parameters["Waa"] -= learning_rate * gradients["dWaa"]
        parameters["Wax"] -= learning_rate * gradients["dWax"]
        parameters["Wya"] -= learning_rate * gradients["dWya"]
        parameters["ba"] -= learning_rate * gradients["dba"]
        parameters["by"] -= learning_rate * gradients["dby"]

Putting All Together

The following code implements a complete training process. First, we pass the training data into forward propagation, calculate the loss, and then pass it into backward propagation to finally get the gradients. To prevent exploding gradients, we clip the gradients. Then, use it to update the parameters. This is a complete training.

class VanillaRNN:
    def optimize(self, X, Y, a_prev, parameters, learning_rate, clip_value):
        """
        Implements the forward and backward propagation of the RNN.

        Parameters
        ----------
        X: (ndarray (n_x, m, T_x)) - the input data
        Y: (ndarray (n_y, m, T_x)) - true labels
        a_prev: (ndarray (n_a, m)) - the initial hidden state
        parameters: (dict) - the initial parameters
            Waa: (ndarray (n_a, n_a)) - weight matrix multiplying the hidden state A_prev
            Wax: (ndarray (n_a, n_x)) - weight matrix multiplying the input Xt
            Wya: (ndarray (n_y, n_a)) - weight matrix relating the hidden-state to the output
            ba: (ndarray (n_a, 1)) - bias
            by: (ndarray (n_y, 1)) - bias
        learning_rate: (float) - the learning rate

        Returns
        -------
        a: (ndarray (n_a, m)) - the hidden state at the last timestep
        loss: (float) - the cross-entropy loss
        """

        A, Y_hat, caches = self.forward(X, a_prev, parameters)
        loss = self.compute_loss(Y_hat, Y)
        gradients = self.backward(X, Y, parameters, caches)
        gradients = self.clip(gradients, clip_value)
        self.update_parameters(parameters, gradients, learning_rate)

        at = A[:, :, -1]
        return at, loss

Example

So far, we have implemented optimize() in VanillaRNN, which performs a full training run. Next, we design VanillaRNN as a character-level language model. The training data is a passage from Shakespeare. It trains one character at a time, so the sequence length $T_x$ will be the length of the input characters. So, the first character is $x^{<t>}$ . Then, one-hot encoding is used to encode each character.

class VanillaRNN:
    def preprocess_inputs(self, inputs, char_to_idx):
        """
        Preprocess the input text data.

        Parameters
        ----------
        inputs: (str) - input text data
        char_to_idx: (dict) - dictionary mapping characters to indices

        Returns
        -------
        X: (ndarray (n_x, 1, T_x)) - input data for each time step
        Y: (ndarray (n_y, 1, T_x)) - true labels for each time step
        """

        n_x = n_y = len(char_to_idx)
        T_x = T_y = len(inputs) - 1
        X = np.zeros((n_x, 1, T_x))
        Y = np.zeros((n_y, 1, T_y))
        for t in range(T_x):
            X[char_to_idx[inputs[t]], 0, t] = 1
            Y[char_to_idx[inputs[t + 1]], 0, t] = 1
        return X, Y

    def train(self, inputs, char_to_idx, num_iterations=100, learning_rate=0.01, clip_value=5):
        """
        Train the RNN model on the given text.

        Parameters
        ----------
        inputs: (str) - input text data
        char_to_idx: (dict) - dictionary mapping characters to indices
        num_iterations: (int) - number of iterations for the optimization loop
        learning_rate: (float) - learning rate for the optimization algorithm
        clip_value: (float) - maximum value for the gradients

        Returns
        -------
        losses: (list) - cross-entropy loss at each iteration
        """

        losses = []
        a_prev = self.a0

        X, Y = self.preprocess_inputs(inputs, char_to_idx)

        for i in range(num_iterations):
            a_prev, loss = self.optimize(X, Y, a_prev, self.parameters, learning_rate, clip_value)
            losses.append(loss)
            print(f"Iteration {i}, Loss: {loss}")

        return losses

Once trained, given a starting character, the VanillaRNN will generate Shakespearean-like strings.

class VanillaRNN:
    def sample(self, start_char, char_to_idx, idx_to_char, num_chars=100):
        """
        Generate text using the RNN model.

        Parameters
        ----------
        start_char: (str) - starting character for the text generation
        char_to_idx: (dict) - dictionary mapping characters to indices
        idx_to_char: (dict) - dictionary mapping indices to characters
        num_chars: (int) - number of characters to generate

        Returns
        -------
        text: (str) - generated text
        """

        x = np.zeros((len(char_to_idx), 1))
        a_prev = np.zeros_like(self.a0)

        idx = char_to_idx[start_char]
        x[idx] = 1

        indices = [idx]

        while len(indices) < num_chars:
            a_prev, y_hat, cache = self.cell_forward(x, a_prev, self.parameters)
            idx = np.random.choice(range(len(char_to_idx)), p=y_hat.ravel())
            indices.append(idx)
            x = np.zeros_like(x)
            x[idx] = 1

        text = "".join([idx_to_char[idx] for idx in indices])
        return text

The following is an example of using VanillaRNN.

if __name__ == "__main__":
    with open("shakespeare.txt", "r") as file:
        text = file.read()

    chars = sorted(list(set(text)))
    vocab_size = len(chars)

    char_to_idx = {ch: i for i, ch in enumerate(chars)}
    idx_to_char = {i: ch for i, ch in enumerate(chars)}

    rnn = VanillaRNN(64, vocab_size, vocab_size)
    losses = rnn.train(text, char_to_idx, num_iterations=100, learning_rate=0.01, clip_value=5)

    generated_text = rnn.sample("T", char_to_idx, idx_to_char, num_chars=100)
    print(generated_text)

Vanishing Gradients

The deeper the network, the more backpropagation paths there are, and each time step will have a corresponding partial derivative. Multiplying all the way down, those greater than 1 will easily lead to exploding gradients; those less than 1 will easily cause vanishing gradients. If inputs early in the sequence have an important influence on the final output, but the gradients are already very weak by the time they are passed back, then the model will not be able to learn the key to this early information.

Imagine you have a lengthy article whose very first sentence states the core topic, for example:

“This article is about the impact of social media on teenagers’ mental health. We’ll start by discussing how screen time correlates with well-being…”

Then, there are many descriptive details at the end of the article, which may include case analysis, data reports, etc. The entire article may have thousands of words. For models that want to do “topic classification” or “sentiment analysis”, the beginning of the article is likely to point out the main idea of the article or the author’s attitude. Ideally, the model should remember this core message because everything that follows will relate to this topic.

When the RNN processes this article, it feeds information into the model word by word or sentence by sentence and updates the hidden states. When backpropagation returns to the first half of the article, the gradients have been continuously reduced and are almost close to 0. The model will not remember the topic information mentioned at the beginning of the article. Finally, when the model outputs the “topic classification” or “sentiment analysis” results of the entire article, it does not have sufficient recognition ability for early key prompts and may make wrong or inconsistent judgments.

RNN models such as Long short-term memory (LSTM) or Gated Recurrent Unit (GRU) can better retain the key information at the beginning of the sequence.

Conclusion

Through the introduction of this article, I believe you have a more comprehensive understanding of the operation mode of RNN. We have seen the powerful capabilities of RNN in processing sequential data, and also understood the challenges it may face in long time series, such as vanishing gradients. To address this problem, models such as LSTM and GRU can be used to effectively retain long-term information.

References

Andrew Ng, Deep Learning Specialization, Coursera.

Get source code of posts.

Recurrent Neural Networks (RNN)

Share

Table of Contents

RNN Architecture

RNN Types

Forward Propagation

Multiple Classification Neural Network

Loss Function

Backward Propagation

Exploding Gradients

Parameter Initialization and Update

Putting All Together

Example

Vanishing Gradients

Conclusion

References

Related Tags

Wayne

Leave a Reply Cancel reply

Executing YOLOv8 Models on Android Using ONNX Runtime

Non Maximum Suppression (NMS)

Executing YOLOv8 Models on Android Using PyTorch

YOLOv8 Object Detection Tutorial

Neural Networks and Binary Classification

Word2Vec Word Embedding Model

Attention Models

Sequence to Sequence Model (Seq2Seq)

Bi-directional Recurrent Neural Networks (BRNNs)

GloVe Word Embeddings

Word2Vec Word Embedding Model

Spring Security JWT Authentication with Google Sign-In Explained

How to Backup and Restore MySQL Databases in Spring Boot

Sending Push Notifications Using FCM in Spring Boot

Python Pie/Donut/Sunburst Charts

Kotlin Coroutine Flow Tutorial

Spring Security JWT Authentication with Google Sign-In Explained

How to Backup and Restore MySQL Databases in Spring Boot

Sending Push Notifications Using FCM in Spring Boot

Python Pie/Donut/Sunburst Charts

Recurrent Neural Networks (RNN)

Share

Table of Contents

RNN Architecture

RNN Types

Forward Propagation

Loss Function

Backward Propagation

Exploding Gradients

Parameter Initialization and Update

Putting All Together

Example

Vanishing Gradients

Conclusion

References

Related Tags

Leave a Reply Cancel reply

You May Also Like