[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["没有我需要的信息","missingTheInformationINeed","thumb-down"],["太复杂/步骤太多","tooComplicatedTooManySteps","thumb-down"],["内容需要更新","outOfDate","thumb-down"],["翻译问题","translationIssue","thumb-down"],["示例/代码问题","samplesCodeIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-07-27。"],[[["\u003cp\u003eThis page provides definitions for Reinforcement Learning terms.\u003c/p\u003e\n"],["\u003cp\u003eReinforcement Learning involves an agent learning to make decisions in an environment to maximize rewards.\u003c/p\u003e\n"],["\u003cp\u003eKey concepts include policies (strategies for action), Q-functions (predicting rewards), and various learning algorithms.\u003c/p\u003e\n"],["\u003cp\u003eThe glossary covers topics like Markov Decision Processes, Deep Q-Networks, and experience replay.\u003c/p\u003e\n"],["\u003cp\u003eFor a complete list of machine learning glossary terms, navigate to the provided link.\u003c/p\u003e\n"]]],[],null,["# Machine Learning Glossary: Reinforcement Learning\n\nThis page contains Reinforcement Learning glossary terms. For all glossary terms,\n[click here](/machine-learning/glossary).\n\n\nA\n---\n\n\u003cbr /\u003e\n\n\naction\n------\n\n#rl\n\n\u003cbr /\u003e\n\nIn [**reinforcement learning**](#reinforcement_learning),\nthe mechanism by which the [**agent**](#agent)\ntransitions between [**states**](#state) of the\n[**environment**](#environment). The agent chooses the action by using a\n[**policy**](#policy).\n\n\nagent\n-----\n\n#rl\n\n\u003cbr /\u003e\n\nIn [**reinforcement learning**](#reinforcement_learning),\nthe entity that uses a\n[**policy**](#policy) to maximize the expected [**return**](#return) gained from\ntransitioning between [**states**](#state) of the\n[**environment**](#environment).\n\nMore generally, an agent is software that autonomously plans and executes a\nseries of actions in pursuit of a goal, with the ability to adapt to changes\nin its environment. For example, an [**LLM**](/machine-learning/glossary#LLM)-based agent might use an\nLLM to generate a plan, rather than applying a reinforcement learning policy.\n\n\nB\n---\n\n\u003cbr /\u003e\n\n\nBellman equation\n----------------\n\n#rl\n\n\u003cbr /\u003e\n\nIn reinforcement learning, the following identity satisfied by the optimal\n[**Q-function**](#q-function):\n\n\\\\\\[Q(s, a) = r(s, a) + \\\\gamma \\\\mathbb{E}_{s'\\|s,a} \\\\max_{a'} Q(s', a')\\\\\\]\n\n[**Reinforcement learning**](#reinforcement_learning) algorithms apply this\nidentity to create [**Q-learning**](#q-learning) using the following update\nrule:\n\n\\\\\\[Q(s,a) \\\\gets Q(s,a) + \\\\alpha\n\\\\left\\[r(s,a)\n+ \\\\gamma \\\\displaystyle\\\\max_{\\\\substack{a_1}} Q(s',a')\n- Q(s,a) \\\\right\\]\n\\\\\\]\n\nBeyond reinforcement learning, the Bellman equation has applications to\ndynamic programming. See the\n[Wikipedia entry for Bellman equation](https://wikipedia.org/wiki/Bellman_equation).\n\n\nC\n---\n\n\u003cbr /\u003e\n\n\ncritic\n------\n\n#rl\n\n\u003cbr /\u003e\n\nSynonym for [**Deep Q-Network**](#deep_q-network).\n\n\nD\n---\n\n\u003cbr /\u003e\n\n\nDeep Q-Network (DQN)\n--------------------\n\n#rl\n\n\u003cbr /\u003e\n\nIn [**Q-learning**](#q-learning), a deep [**neural network**](/machine-learning/glossary#neural_network)\nthat predicts [**Q-functions**](#q-function).\n\n**Critic** is a synonym for Deep Q-Network.\n\n\nDQN\n---\n\n#rl\n\n\u003cbr /\u003e\n\nAbbreviation for [**Deep Q-Network**](#deep_q-network).\n\n\nE\n---\n\n\u003cbr /\u003e\n\n\nenvironment\n-----------\n\n#rl\n\n\u003cbr /\u003e\n\nIn reinforcement learning, the world that contains the [**agent**](#agent)\nand allows the agent to observe that world's [**state**](#state). For example,\nthe represented world can be a game like chess, or a physical world like a\nmaze. When the agent applies an [**action**](#action) to the environment,\nthen the environment transitions between states.\n\n\nepisode\n-------\n\n#rl\n\n\u003cbr /\u003e\n\nIn reinforcement learning, each of the repeated attempts by the\n[**agent**](#agent) to learn an [**environment**](#environment).\n\n\nepsilon greedy policy\n---------------------\n\n#rl\n\n\u003cbr /\u003e\n\nIn reinforcement learning, a [**policy**](#policy) that either follows a\n[**random policy**](#random_policy) with epsilon probability or a\n[**greedy policy**](#greedy_policy) otherwise. For example, if epsilon is\n0.9, then the policy follows a random policy 90% of the time and a greedy\npolicy 10% of the time.\n\nOver successive episodes, the algorithm reduces epsilon's value in order\nto shift from following a random policy to following a greedy policy. By\nshifting the policy, the agent first randomly explores the environment and\nthen greedily exploits the results of random exploration.\n\n\nexperience replay\n-----------------\n\n#rl\n\n\u003cbr /\u003e\n\nIn reinforcement learning, a [**DQN**](#deep_q-network) technique used to\nreduce temporal correlations in training data. The [**agent**](#agent)\nstores state transitions in a [**replay buffer**](#replay_buffer), and then\nsamples transitions from the replay buffer to create training data.\n\n\nG\n---\n\n\u003cbr /\u003e\n\n\ngreedy policy\n-------------\n\n#rl\n\n\u003cbr /\u003e\n\nIn reinforcement learning, a [**policy**](#policy) that always chooses the\naction with the highest expected [**return**](#return).\n\n\nM\n---\n\n\u003cbr /\u003e\n\n\nMarkov decision process (MDP)\n-----------------------------\n\n#rl\n\n\u003cbr /\u003e\n\nA graph representing the decision-making model where decisions\n(or [**actions**](#action)) are taken to navigate a sequence of\n[**states**](#state) under the assumption that the\n[**Markov property**](#Markov_property) holds. In\n[**reinforcement learning**](#reinforcement_learning), these transitions\nbetween states return a numerical [**reward**](#reward).\n\n\nMarkov property\n---------------\n\n#rl\n\n\u003cbr /\u003e\n\nA property of certain [**environments**](#environment), where state\ntransitions are entirely determined by information implicit in the\ncurrent [**state**](#state) and the agent's [**action**](#action).\n\n\nP\n---\n\n\u003cbr /\u003e\n\n\npolicy\n------\n\n#rl\n\n\u003cbr /\u003e\n\nIn reinforcement learning, an [**agent's**](#agent) probabilistic mapping\nfrom [**states**](#state) to [**actions**](#action).\n\n\nQ\n---\n\n\u003cbr /\u003e\n\n\nQ-function\n----------\n\n#rl\n\n\u003cbr /\u003e\n\nIn [**reinforcement learning**](#reinforcement_learning), the function that\npredicts the expected [**return**](#return) from taking an\n[**action**](#action) in a\n[**state**](#state) and then following a given [**policy**](#policy).\n\nQ-function is also known as **state-action value function**.\n\n\nQ-learning\n----------\n\n#rl\n\n\u003cbr /\u003e\n\nIn [**reinforcement learning**](#reinforcement_learning), an algorithm that\nallows an [**agent**](#agent)\nto learn the optimal [**Q-function**](#q-function) of a\n[**Markov decision process**](#markov_decision_process) by applying the\n[**Bellman equation**](#bellman_equation). The Markov decision process models\nan [**environment**](#environment).\n\n\nR\n---\n\n\u003cbr /\u003e\n\n\nrandom policy\n-------------\n\n#rl\n\n\u003cbr /\u003e\n\nIn [**reinforcement learning**](#reinforcement_learning), a\n[**policy**](#policy) that chooses an\n[**action**](#action) at random.\n\n\nreinforcement learning (RL)\n---------------------------\n\n#rl\n\n\u003cbr /\u003e\n\nA family of algorithms that learn an optimal [**policy**](#policy), whose goal\nis to maximize [**return**](#return) when interacting with\nan [**environment**](#environment).\nFor example, the ultimate reward of most games is victory.\nReinforcement learning systems can become expert at playing complex\ngames by evaluating sequences of previous game moves that ultimately\nled to wins and sequences that ultimately led to losses.\n\n\nReinforcement Learning from Human Feedback (RLHF)\n-------------------------------------------------\n\n#generativeAI \n#rl\n\n\u003cbr /\u003e\n\nUsing feedback from human raters to improve the quality of a model's responses.\nFor example, an RLHF mechanism can ask users to rate the quality of a model's\nresponse with a 👍 or 👎 emoji. The system can then adjust its future responses\nbased on that feedback.\n\n\nreplay buffer\n-------------\n\n#rl\n\n\u003cbr /\u003e\n\nIn [**DQN**](#deep_q-network)-like algorithms, the memory used by the agent\nto store state transitions for use in\n[**experience replay**](#experience_replay).\n\n\nreturn\n------\n\n#rl\n\n\u003cbr /\u003e\n\nIn reinforcement learning, given a certain policy and a certain state, the\nreturn is the sum of all [**rewards**](#reward) that the [**agent**](#agent)\nexpects to receive when following the [**policy**](#policy) from the\n[**state**](#state) to the end of the [**episode**](#episode). The agent\naccounts for the delayed nature of expected rewards by discounting rewards\naccording to the state transitions required to obtain the reward.\n\nTherefore, if the discount factor is \\\\(\\\\gamma\\\\), and \\\\(r_0, \\\\ldots, r_{N}\\\\)\ndenote the rewards until the end of the episode, then the return calculation\nis as follows: \n$$\\\\text{Return} = r_0 + \\\\gamma r_1 + \\\\gamma\\^2 r_2 + \\\\ldots + \\\\gamma\\^{N-1} r_{N-1}$$\n\n\nreward\n------\n\n#rl\n\n\u003cbr /\u003e\n\nIn reinforcement learning, the numerical result of taking an\n[**action**](#action) in a [**state**](#state), as defined by\nthe [**environment**](#environment).\n\n\nS\n---\n\n\u003cbr /\u003e\n\n\nstate\n-----\n\n#rl\n\n\u003cbr /\u003e\n\nIn reinforcement learning, the parameter values that describe the current\nconfiguration of the environment, which the [**agent**](#agent) uses to\nchoose an [**action**](#action).\n\n\nstate-action value function\n---------------------------\n\n#rl\n\n\u003cbr /\u003e\n\nSynonym for [**Q-function**](#q-function).\n\n\nT\n---\n\n\u003cbr /\u003e\n\n\ntabular Q-learning\n------------------\n\n#rl\n\n\u003cbr /\u003e\n\nIn [**reinforcement learning**](#reinforcement_learning), implementing\n[**Q-learning**](#q-learning) by using a table to store the\n[**Q-functions**](#q-function) for every combination of\n[**state**](#state) and [**action**](#action).\n\n\ntarget network\n--------------\n\n#rl\n\n\u003cbr /\u003e\n\nIn [**Deep Q-learning**](#q-learning), a neural network that is a stable\napproximation of the main neural network, where the main neural network\nimplements either a [**Q-function**](#q-function) or a [**policy**](#policy).\nThen, you can train the main network on the Q-values predicted by the target\nnetwork. Therefore, you prevent the feedback loop that occurs when the main\nnetwork trains on Q-values predicted by itself. By avoiding this feedback,\ntraining stability increases.\n\n\ntermination condition\n---------------------\n\n#rl\n\n\u003cbr /\u003e\n\nIn [**reinforcement learning**](#reinforcement_learning), the conditions that\ndetermine when an [**episode**](#episode) ends, such as when the agent reaches\na certain state or exceeds a threshold number of state transitions.\nFor example, in [tic-tac-toe](https://wikipedia.org/wiki/Tic-tac-toe) (also\nknown as noughts and crosses), an episode terminates either when a player marks\nthree consecutive spaces or when all spaces are marked.\n\n\ntrajectory\n----------\n\n#rl\n\n\u003cbr /\u003e\n\nIn [**reinforcement learning**](#reinforcement_learning), a sequence of\n[tuples](https://wikipedia.org/wiki/Tuple) that represent\na sequence of [**state**](#state) transitions of the [**agent**](#agent),\nwhere each tuple corresponds to the state, [**action**](#action),\n[**reward**](#reward), and next state for a given state transition."]]