Machine Learning Glossary: Reinforcement Learning

This page contains Reinforcement Learning glossary terms. For all glossary terms, click here.




In reinforcement learning, the mechanism by which the agent transitions between states of the environment. The agent chooses the action by using a policy.



In reinforcement learning, the entity that uses a policy to maximize the expected return gained from transitioning between states of the environment.

More generally, an agent is software that autonomously plans and executes a series of actions in pursuit of a goal, with the ability to adapt to changes in its environment. For example, LLM-based agents might use the LLM to generate a plan, rather than applying a reinforcement learning policy.


Bellman equation


In reinforcement learning, the following identity satisfied by the optimal Q-function:

\[Q(s, a) = r(s, a) + \gamma \mathbb{E}_{s'|s,a} \max_{a'} Q(s', a')\]

Reinforcement learning algorithms apply this identity to create Q-learning via the following update rule:

\[Q(s,a) \gets Q(s,a) + \alpha \left[r(s,a) + \gamma \displaystyle\max_{\substack{a_1}} Q(s',a') - Q(s,a) \right] \]

Beyond reinforcement learning, the Bellman equation has applications to dynamic programming. See the Wikipedia entry for Bellman equation.




Synonym for Deep Q-Network.


Deep Q-Network (DQN)


In Q-learning, a deep neural network that predicts Q-functions.

Critic is a synonym for Deep Q-Network.



Abbreviation for Deep Q-Network.




In reinforcement learning, the world that contains the agent and allows the agent to observe that world's state. For example, the represented world can be a game like chess, or a physical world like a maze. When the agent applies an action to the environment, then the environment transitions between states.



In reinforcement learning, each of the repeated attempts by the agent to learn an environment.

epsilon greedy policy


In reinforcement learning, a policy that either follows a random policy with epsilon probability or a greedy policy otherwise. For example, if epsilon is 0.9, then the policy follows a random policy 90% of the time and a greedy policy 10% of the time.

Over successive episodes, the algorithm reduces epsilon's value in order to shift from following a random policy to following a greedy policy. By shifting the policy, the agent first randomly explores the environment and then greedily exploits the results of random exploration.

experience replay


In reinforcement learning, a DQN technique used to reduce temporal correlations in training data. The agent stores state transitions in a replay buffer, and then samples transitions from the replay buffer to create training data.


greedy policy


In reinforcement learning, a policy that always chooses the action with the highest expected return.


Markov decision process (MDP)


A graph representing the decision-making model where decisions (or actions) are taken to navigate a sequence of states under the assumption that the Markov property holds. In reinforcement learning, these transitions between states return a numerical reward.

Markov property


A property of certain environments, where state transitions are entirely determined by information implicit in the current state and the agent's action.




In reinforcement learning, an agent's probabilistic mapping from states to actions.




In reinforcement learning, the function that predicts the expected return from taking an action in a state and then following a given policy.

Q-function is also known as state-action value function.



In reinforcement learning, an algorithm that allows an agent to learn the optimal Q-function of a Markov decision process by applying the Bellman equation. The Markov decision process models an environment.


random policy


In reinforcement learning, a policy that chooses an action at random.

reinforcement learning (RL)


A family of algorithms that learn an optimal policy, whose goal is to maximize return when interacting with an environment. For example, the ultimate reward of most games is victory. Reinforcement learning systems can become expert at playing complex games by evaluating sequences of previous game moves that ultimately led to wins and sequences that ultimately led to losses.

Reinforcement Learning from Human Feedback (RLHF)


Using feedback from human raters to improve the quality of a model's responses. For example, an RLHF mechanism can ask users to rate the quality of a model's response with a 👍 or 👎 emoji. The system can then adjust its future responses based on that feedback.

replay buffer


In DQN-like algorithms, the memory used by the agent to store state transitions for use in experience replay.



In reinforcement learning, given a certain policy and a certain state, the return is the sum of all rewards that the agent expects to receive when following the policy from the state to the end of the episode. The agent accounts for the delayed nature of expected rewards by discounting rewards according to the state transitions required to obtain the reward.

Therefore, if the discount factor is \(\gamma\), and \(r_0, \ldots, r_{N}\) denote the rewards until the end of the episode, then the return calculation is as follows:

$$\text{Return} = r_0 + \gamma r_1 + \gamma^2 r_2 + \ldots + \gamma^{N-1} r_{N-1}$$



In reinforcement learning, the numerical result of taking an action in a state, as defined by the environment.




In reinforcement learning, the parameter values that describe the current configuration of the environment, which the agent uses to choose an action.

state-action value function


Synonym for Q-function.


tabular Q-learning


In reinforcement learning, implementing Q-learning by using a table to store the Q-functions for every combination of state and action.

target network


In Deep Q-learning, a neural network that is a stable approximation of the main neural network, where the main neural network implements either a Q-function or a policy. Then, you can train the main network on the Q-values predicted by the target network. Therefore, you prevent the feedback loop that occurs when the main network trains on Q-values predicted by itself. By avoiding this feedback, training stability increases.

termination condition


In reinforcement learning, the conditions that determine when an episode ends, such as when the agent reaches a certain state or exceeds a threshold number of state transitions. For example, in tic-tac-toe (also known as noughts and crosses), an episode terminates either when a player marks three consecutive spaces or when all spaces are marked.



In reinforcement learning, a sequence of tuples that represent a sequence of state transitions of the agent, where each tuple corresponds to the state, action, reward, and next state for a given state transition.