This page contains Reinforcement Learning glossary terms. For all glossary terms, click here.

## A

## action

In **reinforcement learning**,
the mechanism by which the **agent**
transitions between **states** of the
**environment**. The agent chooses the action by using a
**policy**.

## agent

In **reinforcement learning**,
the entity that uses a
**policy** to maximize the expected **return** gained from
transitioning between **states** of the
**environment**.

## B

## Bellman equation

In reinforcement learning, the following identity satisfied by the optimal
**Q-function**:

\[Q(s, a) = r(s, a) + \gamma \mathbb{E}_{s'|s,a} \max_{a'} Q(s', a')\]

**Reinforcement learning** algorithms apply this
identity to create **Q-learning** via the following update rule:

\[Q(s,a) \gets Q(s,a) + \alpha \left[r(s,a) + \gamma \displaystyle\max_{\substack{a_1}} Q(s’,a’) - Q(s,a) \right] \]

Beyond reinforcement learning, the Bellman equation has applications to dynamic programming. See the Wikipedia entry for Bellman Equation.

## C

## critic

Synonym for **Deep Q-Network**.

## D

## Deep Q-Network (DQN)

In **Q-learning**, a deep **neural network**
that predicts **Q-functions**.

**Critic** is a synonym for Deep Q-Network.

## DQN

Abbreviation for **Deep Q-Network**.

## E

## environment

In reinforcement learning, the world that contains the **agent**
and allows the agent to observe that world's **state**. For example,
the represented world can be a game like chess, or a physical world like a
maze. When the agent applies an **action** to the environment,
then the environment transitions between states.

## episode

In reinforcement learning, each of the repeated attempts by the
**agent** to learn an **environment**.

## epsilon greedy policy

In reinforcement learning, a **policy** that either follows a
**random policy** with epsilon probability or a
**greedy policy** otherwise. For example, if epsilon is
0.9, then the policy follows a random policy 90% of the time and a greedy
policy 10% of the time.

Over successive episodes, the algorithm reduces epsilon’s value in order to shift from following a random policy to following a greedy policy. By shifting the policy, the agent first randomly explores the environment and then greedily exploits the results of random exploration.

## experience replay

In reinforcement learning, a **DQN** technique used to
reduce temporal correlations in training data. The **agent**
stores state transitions in a **replay buffer**, and then
samples transitions from the replay buffer to create training data.

## G

## greedy policy

In reinforcement learning, a **policy** that always chooses the
action with the highest expected **return**.

## M

## Markov decision process (MDP)

A graph representing the decision-making model where decisions
(or **actions**) are taken to navigate a sequence of
**states** under the assumption that the
**Markov property** holds. In
**reinforcement learning**, these transitions
between states return a numerical **reward**.

## Markov property

A property of certain **environments**, where state
transitions are entirely determined by information implicit in the
current **state** and the agent’s **action**.

## P

## policy

In reinforcement learning, an **agent's** probabilistic mapping
from **states** to **actions**.

## Q

## Q-function

In **reinforcement learning**, the function that
predicts the expected **return** from taking an
**action** in a
**state** and then following a given **policy**.

Q-function is also known as **state-action value function**.

## Q-learning

In **reinforcement learning**, an algorithm that
allows an **agent**
to learn the optimal **Q-function** of a
**Markov decision process** by applying the
**Bellman equation**. The Markov decision process models
an **environment**.

## R

## random policy

In **reinforcement learning**, a
**policy** that chooses an
**action** at random.

## reinforcement learning (RL)

A family of algorithms that learn an optimal **policy**, whose goal
is to maximize **return** when interacting with
an **environment**.
For example, the ultimate reward of most games is victory.
Reinforcement learning systems can become expert at playing complex
games by evaluating sequences of previous game moves that ultimately
led to wins and sequences that ultimately led to losses.

## Reinforcement Learning from Human Feedback (RLHF)

Using feedback from human raters to improve the quality of a model's responses. For example, an RLHF mechanism can ask users to rate the quality of a model's response with a 👍 or 👎 emoji. The system can then adjust its future responses based on that feedback.

## replay buffer

In **DQN**-like algorithms, the memory used by the agent
to store state transitions for use in
**experience replay**.

## return

In reinforcement learning, given a certain policy and a certain state, the
return is the sum of all **rewards** that the **agent**
expects to receive when following the **policy** from the
**state** to the end of the **episode**. The agent
accounts for the delayed nature of expected rewards by discounting rewards
according to the state transitions required to obtain the reward.

Therefore, if the discount factor is \(\gamma\), and \(r_0, \ldots, r_{N}\) denote the rewards until the end of the episode, then the return calculation is as follows:

## reward

In reinforcement learning, the numerical result of taking an
**action** in a **state**, as defined by
the **environment**.

## S

## state

In reinforcement learning, the parameter values that describe the current
configuration of the environment, which the **agent** uses to
choose an **action**.

## state-action value function

Synonym for **Q-function**.

## T

## tabular Q-learning

In **reinforcement learning**, implementing
**Q-learning** by using a table to store the
**Q-functions** for every combination of
**state** and **action**.

## target network

In **Deep Q-learning**, a neural network that is a stable
approximation of the main neural network, where the main neural network
implements either a **Q-function** or a **policy**.
Then, you can train the main network on the Q-values predicted by the target
network. Therefore, you prevent the feedback loop that occurs when the main
network trains on Q-values predicted by itself. By avoiding this feedback,
training stability increases.

## termination condition

In **reinforcement learning**, the conditions that
determine when an **episode** ends, such as when the agent reaches
a certain state or exceeds a threshold number of state transitions.
For example, in tic-tac-toe (also
known as noughts and crosses), an episode terminates either when a player marks
three consecutive spaces or when all spaces are marked.

## trajectory

In **reinforcement learning**, a sequence of
tuples that represent
a sequence of **state** transitions of the **agent**,
where each tuple corresponds to the state, **action**,
**reward**, and next state for a given state transition.