Machine Learning Glossary: Sequence Models

This page contains Sequence Models glossary terms. For all glossary terms, click here.




An N-gram in which N=2.


exploding gradient problem


The tendency for gradients in deep neural networks (especially recurrent neural networks) to become surprisingly steep (high). Steep gradients often cause very large updates to the weights of each node in a deep neural network.

Models suffering from the exploding gradient problem become difficult or impossible to train. Gradient clipping can mitigate this problem.

Compare to vanishing gradient problem.


forget gate


The portion of a Long Short-Term Memory cell that regulates the flow of information through the cell. Forget gates maintain context by deciding which information to discard from the cell state.


gradient clipping


A commonly used mechanism to mitigate the exploding gradient problem by artificially limiting (clipping) the maximum value of gradients when using gradient descent to train a model.


Long Short-Term Memory (LSTM)


A type of cell in a recurrent neural network used to process sequences of data in applications such as handwriting recognition, machine translation, and image captioning. LSTMs address the vanishing gradient problem that occurs when training RNNs due to long data sequences by maintaining history in an internal memory state based on new input and context from previous cells in the RNN.



Abbreviation for Long Short-Term Memory.




An ordered sequence of N words. For example, truly madly is a 2-gram. Because order is relevant, madly truly is a different 2-gram than truly madly.

N Name(s) for this kind of N-gram Examples
2 bigram or 2-gram to go, go to, eat lunch, eat dinner
3 trigram or 3-gram ate too much, three blind mice, the bell tolls
4 4-gram walk in the park, dust in the wind, the boy ate lentils

Many natural language understanding models rely on N-grams to predict the next word that the user will type or say. For example, suppose a user typed three blind. An NLU model based on trigrams would likely predict that the user will next type mice.

Contrast N-grams with bag of words, which are unordered sets of words.


recurrent neural network


A neural network that is intentionally run multiple times, where parts of each run feed into the next run. Specifically, hidden layers from the previous run provide part of the input to the same hidden layer in the next run. Recurrent neural networks are particularly useful for evaluating sequences, so that the hidden layers can learn from previous runs of the neural network on earlier parts of the sequence.

For example, the following figure shows a recurrent neural network that runs four times. Notice that the values learned in the hidden layers from the first run become part of the input to the same hidden layers in the second run. Similarly, the values learned in the hidden layer on the second run become part of the input to the same hidden layer in the third run. In this way, the recurrent neural network gradually trains and predicts the meaning of the entire sequence rather than just the meaning of individual words.

An RNN that runs four times to process four input words.



Abbreviation for recurrent neural networks.


sequence model


A model whose inputs have a sequential dependence. For example, predicting the next video watched from a sequence of previously watched videos.




One "unrolled" cell within a recurrent neural network. For example, the following figure shows three timesteps (labeled with the subscripts t-1, t, and t+1):

Three timesteps in a recurrent neural network. The output of the
          first timestep becomes input to the second timestep. The output
          of the second timestep becomes input to the third timestep.



An N-gram in which N=3.


vanishing gradient problem


The tendency for the gradients of early hidden layers of some deep neural networks to become surprisingly flat (low). Increasingly lower gradients result in increasingly smaller changes to the weights on nodes in a deep neural network, leading to little or no learning. Models suffering from the vanishing gradient problem become difficult or impossible to train. Long Short-Term Memory cells address this issue.

Compare to exploding gradient problem.