The gated recurrent unit (GRU) is a type of RNN specifically designed to process sequential data. Similar to the long short-term memory (LSTM), it is designed to solve the long-term dependency problem of the standard RNN.
The complete code for this chapter can be found in .
Table of Contents
GRU
Standard RNNs suffer from the problem of vanishing gradients. Therefore, if the sequence data is very long, the standard RNN cannot effectively learn the early input data. That is to say, the long-term memory capacity of standard RNN is quite weak. For more details about vanishing gradients, please refer to the following article. In addition, if you are not familiar with RNN, please refer to the following article first.
Compared with LSTM, its structure is simpler and its calculation is more efficient. For details about LSTM, please refer to the following article.
The figure below is a GRU cell. It is more complex than a standard RNN, but simpler than an LSTM. In addition to the hidden state , GRU also needs to calculate the reset gate
, update gate
, and candidate hidden state
.
Here’s what each of them means.
- Reset gate
: This gate determines whether to forget previous information.
- Update gate
: This gate determines how much previous information to remember.
- Candidate hidden state
: Generates new candidate states by processing the current input and part of the past state.
Forward Propagation
The figure below shows GRU forward propagation. Each gate and candidate cell state has its corresponding parameters and activation functions. It is worth noting that we will stack
and
vertically.
The formula in the GRU cell is as follows:
The dimensions of the GRU input and true labels
are as follows:
In the GRU cell, the dimensions of each variable are as follows:
The following is the implementation of GRU’s forward propagation.
class GRU: def cell_forward(self, xt, at_prev, parameters): """ Implements a single forward step for the GRU-cell. Parameters ---------- xt: (ndarray (n_x, m)) - input data for the current timestep at_prev: (ndarray (n_a, m)) - hidden state from the previous timestep parameters: (dict) - dictionary containing the weights and biases of the GRU network Wu: (ndarray (n_a, n_a + n_x)) - weights of the update gate bu: (ndarray (n_a, 1)) - biases of the update gate Wr: (ndarray (n_a, n_a + n_x)) - weights of the reset gate br: (ndarray (n_a, 1)) - biases of the reset gate Wc: (ndarray (n_a, n_a + n_x)) - weights of the candidate value bc: (ndarray (n_a, 1)) - biases of the candidate value Wy: (ndarray (n_y, n_a)) - weights of the output layer by: (ndarray (n_y, 1)) - biases of the output layer Returns ------- at: (ndarray (n_a, m)) - hidden state for the current timestep y_hat_t: (ndarray (n_y, m)) - prediction for the current timestep cache: (tuple) - values needed for the backward pass """ Wu, bu = parameters["Wu"], parameters["bu"] # update gate weights and biases Wr, br = parameters["Wr"], parameters["br"] # reset gate weights and biases Wc, bc = parameters["Wc"], parameters["bc"] # candidate value weights and biases Wy, by = parameters["Wy"], parameters["by"] # prediction weights and biases concat = np.concatenate((at_prev, xt), axis=0) ut = sigmoid(Wu @ concat + bu) # update gate rt = sigmoid(Wr @ concat + br) # reset gate cct = tanh(Wc @ np.concatenate((rt * at_prev, xt), axis=0) + bc) # candidate value at = ut * cct + (1 - ut) * at_prev # hidden state zyt = Wy @ at + by y_hat_t = softmax(zyt) cache = (at, at_prev, ut, rt, cct, xt, y_hat_t, zyt) return at, y_hat_t, cache def forward(self, X, a0, parameters): """ Implements the forward pass of the GRU network. Parameters ---------- X: (ndarray (n_x, m, T_x)) - input data for each time step a0: (ndarray (n_a, m)) - initial hidden state parameters: (dict) - dictionary containing the weights and biases of the GRU network Wu: (ndarray (n_a, n_a + n_x)) - weights of the update gate bu: (ndarray (n_a, 1)) - biases of the update gate Wr: (ndarray (n_a, n_a + n_x)) - weights of the reset gate br: (ndarray (n_a, 1)) - biases of the reset gate Wc: (ndarray (n_a, n_a + n_x)) - weights of the candidate value bc: (ndarray (n_a, 1)) - biases of the candidate value Wy: (ndarray (n_y, n_a)) - weights of the output layer by: (ndarray (n_y, 1)) - biases of the output layer Returns ------- A: (ndarray (n_a, m, T_x)) - hidden states for each timestep Y_hat: (ndarray (n_y, m, T_x)) - predictions for each timestep caches: (list) - values needed for the backward pass """ caches = [] Wy = parameters["Wy"] x_x, m, T_x = X.shape n_y, n_a = Wy.shape A = np.zeros((n_a, m, T_x)) Y_hat = np.zeros((n_y, m, T_x)) at_prev = a0 for t in range(T_x): at_prev, y_hat_t, cache = self.cell_forward(X[:, :, t], at_prev, parameters) A[:, :, t] = at_prev Y_hat[:, :, t] = y_hat_t caches.append(cache) return A, Y_hat, caches
Loss Function
In this article, we use softmax
to output , so we use cross-entropy loss as its loss function. For the formula and implementation of cross-entropy loss, please refer to the following article.
Backward Propagation
The backpropagation of GRU is a bit complicated. We have to calculate the partial derivatives with respect to each parameter. In particular, when calculating , we also need to consider its value at the previous timestep, that is, backpropagation through time (BPTT).
The following calculates the partial derivatives in the output layer.
The following is to calculate the partial derivatives of the reset gate, update gate, and candidate hidden state.
The following calculates the partial derivatives of all parameters .
The following calculates the remaining partial derivatives.
The above is how to calculate all partial derivatives at each time step. Finally, we have to sum up all the partial derivatives we have taken.
The following is the implementation of GRU’s backward propagation.
class GRU: def cell_backward(self, y, dat, cache, parameters): """ Implements a single backward step for the GRU-cell. Parameters ---------- y: (ndarray (n_y, m)) - true labels for the current timestep dat: (ndarray (n_a, m)) - gradient of the hidden state for the current timestep cache: (tuple) - values needed for the backward pass parameters: (dict) - dictionary containing the weights and biases of the GRU network Wu: (ndarray (n_a, n_a + n_x)) - weights of the update gate bu: (ndarray (n_a, 1)) - biases of the update gate Wr: (ndarray (n_a, n_a + n_x)) - weights of the reset gate br: (ndarray (n_a, 1)) - biases of the reset gate Wc: (ndarray (n_a, n_a + n_x)) - weights of the candidate value bc: (ndarray (n_a, 1)) - biases of the candidate value Wy: (ndarray (n_y, n_a)) - weights of the output layer by: (ndarray (n_y, 1)) - biases of the output layer Returns ------- gradients: (dict) - dictionary containing the gradients of the weights and biases of the GRU network dWu: (ndarray (n_a, n_a + n_x)) - gradients of the weights of the update gate dbu: (ndarray (n_a, 1)) - gradients of the biases of the update gate dWr: (ndarray (n_a, n_a + n_x)) - gradients of the weights of the reset gate dbr: (ndarray (n_a, 1)) - gradients of the biases of the reset gate dWc: (ndarray (n_a, n_a + n_x)) - gradients of the weights of the candidate value dbc: (ndarray (n_a, 1)) - gradients of the biases of the candidate value dWy: (ndarray (n_y, n_a)) - gradients of the weights of the output layer dby: (ndarray (n_y, 1)) - gradients of the biases of the output """ at, at_prev, ut, rt, cct, xt, y_hat_t, zyt = cache n_a, m = at.shape dzy = y_hat_t - y dWy = dzy @ at.T dby = np.sum(dzy, axis=1, keepdims=True) dat = parameters["Wy"].T @ dzy + dat dcct = dat * ut * (1 - cct ** 2) # dn_t dut = dat * (cct - at_prev) * ut * (1 - ut) dat_prev = dat * (1 - ut) dcct_ra_x = parameters["Wc"].T @ dcct dcct_r_at_prev = dcct_ra_x[:n_a, :] dcct_xt = dcct_ra_x[n_a:, :] drt = (dcct_r_at_prev * at_prev) * rt * (1 - rt) concat = np.concatenate((at_prev, xt), axis=0) dWc = dcct @ np.concatenate((rt * at_prev, xt), axis=0).T dbc = np.sum(dcct, axis=1, keepdims=True) dWr = drt @ concat.T dbr = np.sum(drt, axis=1, keepdims=True) dWu = dut @ concat.T dbu = np.sum(dut, axis=1, keepdims=True) dat_prev = ( dat_prev + dcct_r_at_prev * rt + parameters["Wr"][:, :n_a].T @ drt + parameters["Wu"][:, :n_a].T @ dut ) dxt = ( dcct_xt + parameters["Wr"][:, n_a:].T @ drt + parameters["Wu"][:, n_a:].T @ dut ) gradients = { "dWu": dWu, "dbu": dbu, "dWr": dWr, "dbr": dbr, "dWc": dWc, "dbc": dbc, "dWy": dWy, "dby": dby, "dat_prev": dat_prev, "dxt": dxt } return gradients def backward(self, X, Y, parameters, caches): """ Implements the backward pass of the GRU network. Parameters ---------- X: (ndarray (n_x, m, T_x)) - input data for each time step Y: (ndarray (n_y, m, T_x)) - true labels for each time step parameters: (dict) - dictionary containing the weights and biases of the GRU network Wu: (ndarray (n_a, n_a + n_x)) - weights of the update gate bu: (ndarray (n_a, 1)) - biases of the update gate Wr: (ndarray (n_a, n_a + n_x)) - weights of the reset gate br: (ndarray (n_a, 1)) - biases of the reset gate Wc: (ndarray (n_a, n_a + n_x)) - weights of the candidate value bc: (ndarray (n_a, 1)) - biases of the candidate value Wy: (ndarray (n_y, n_a)) - weights of the output layer by: (ndarray (n_y, 1)) - biases of the output layer caches: (list) - values needed for the backward pass Returns ------- gradients: (dict) - dictionary containing the gradients of the weights and biases of the GRU network dWu: (ndarray (n_a, n_a + n_x)) - gradients of the weights of the update gate dbu: (ndarray (n_a, 1)) - gradients of the biases of the update gate dWr: (ndarray (n_a, n_a + n_x)) - gradients of the weights of the reset gate dbr: (ndarray (n_a, 1)) - gradients of the biases of the reset gate dWc: (ndarray (n_a, n_a + n_x)) - gradients of the weights of the candidate value dbc: (ndarray (n_a, 1)) - gradients of the biases of the candidate value dWy: (ndarray (n_y, n_a)) - gradients of the weights of the output layer dby: (ndarray (n_y, 1)) - gradients of the biases of the output layer """ n_x, m, T_x = X.shape a1, a0, u0, r1, cc1, x1, y_hat_1, zyt1 = caches[0] Wu, Wr, Wc, Wy = parameters["Wu"], parameters["Wr"], parameters["Wc"], parameters["Wy"] bu, br, bc, by = parameters["bu"], parameters["br"], parameters["bc"], parameters["by"] gradients = { "dWu": np.zeros_like(Wu), "dbu": np.zeros_like(bu), "dWr": np.zeros_like(Wr), "dbr": np.zeros_like(br), "dWc": np.zeros_like(Wc), "dbc": np.zeros_like(bc), "dWy": np.zeros_like(Wy), "dby": np.zeros_like(by), } dat = np.zeros_like(a0) for t in reversed(range(T_x)): grads = self.cell_backward(Y[:, :, t], dat, caches[t], parameters) gradients["dWu"] += grads["dWu"] gradients["dbu"] += grads["dbu"] gradients["dWr"] += grads["dWr"] gradients["dbr"] += grads["dbr"] gradients["dWc"] += grads["dWc"] gradients["dbc"] += grads["dbc"] gradients["dWy"] += grads["dWy"] gradients["dby"] += grads["dby"] dat = grads["dat_prev"] return gradients
Putting All Together
The following code implements a complete training process. First, we pass the training data into forward propagation, calculate the loss, and then pass it into backward propagation to finally get the gradients. To prevent exploding gradients from happening, we will clip the gradients. Then, use it to update the parameters. This is a complete training.
class GRU: def optimize(self, X, Y, a_prev, parameters, learning_rate, clip_value): """ Implements the forward and backward pass of the GRU network. Parameters ---------- X: (ndarray (n_x, m, T_x)) - input data for each time step Y: (ndarray (n_y, m, T_x)) - true labels for each time step a_prev: (ndarray (n_a, m)) - initial hidden state parameters: (dict) - dictionary containing the weights and biases of the GRU network Wu: (ndarray (n_a, n_a + n_x)) - weights of the update gate bu: (ndarray (n_a, 1)) - biases of the update gate Wr: (ndarray (n_a, n_a + n_x)) - weights of the reset gate br: (ndarray (n_a, 1)) - biases of the reset gate Wc: (ndarray (n_a, n_a + n_x)) - weights of the candidate value bc: (ndarray (n_a, 1)) - biases of the candidate value Wy: (ndarray (n_y, n_a)) - weights of the output layer by: (ndarray (n_y, 1)) - biases of the output layer learning_rate: (float) - learning rate clip_value: (float) - maximum value to clip the gradients Returns ------- at: (ndarray (n_a, m)) hidden state for the last time step loss: (float) - the cross-entropy """ A, Y_hat, caches = self.forward(X, a_prev, parameters) loss = self.compute_loss(Y_hat, Y) gradients = self.backward(X, Y, parameters, caches) gradients = self.clip(gradients, clip_value) self.update_parameters(parameters, gradients, learning_rate) at = A[:, :, -1] return at, loss
Example
Next, we design GRU as a character-level language model. The training material is a passage from Shakespeare. It trains one character at a time, so the sequence length will be the length of the input character, and uses one-hot encoding to encode each character. Please refer to the following article for details of this part, because this article and the following article use the same example.
An example of using this GRU is as follows:
if __name__ == "__main__": with open("shakespeare.txt", "r") as file: text = file.read() chars = sorted(list(set(text))) vocab_size = len(chars) char_to_idx = {ch: i for i, ch in enumerate(chars)} idx_to_char = {i: ch for i, ch in enumerate(chars)} gru = GRU(64, vocab_size, vocab_size) losses = gru.train(text, char_to_idx, num_iterations=100, learning_rate=0.01, clip_value=5) generated_text = gru.sample("T", char_to_idx, idx_to_char, num_chars=100) print(generated_text)
Conclusion
GRU is better at learning long-term dependencies than standard RNN, but has a simpler structure and is more computationally efficient than LSTM. However, due to the simpler structure of GRU, LSTM performs better in learning long-term dependencies. Therefore, we must decide whether to use LSTM or GRU based on the usage scenario and the length of the data.
Reference
- Andrew Ng, Deep Learning Specialization , Coursera.