type
status
date
Dec 22, 2024 11:53 AM
slug
summary
tags
category
icon
password
基本概念
State | agent的状态向量 |
State Space | 所有可能state的集合 |
State Transition | 状态转移 |
Action | agent的行动 |
Policy | 策略是agent从环境的状态s映射到动作a的函数,即在状态s下选择动作a的概率π(a∣s) |
Reward | a real number we get after taking an action |
Trajectory | A trajectory is a state-action-reward chain |
Return | The return of this trajectory is the sum of all the rewards collected along the
trajectory |
Discounted return | 更远的reward加入衰减系数 |
Episode | When interactin with the environment following a policy, the agent may stop at som terminal states. The resulting trajectory is called an episode (or a
trial). |
Episodic Taska vs Continuing Tasks | 有限任务和无限任务 |
MDP
如何计算return?
Bootstrapping
v: 当前state下return value的期望
r: reward value
γ: 衰减系数
P: 和状态转移 矩阵
state value和自身相关!
the Bellman equation (for this specific deterministic problem)
v = r + γPv
- Author:NotionNext
- URL:https://tangly1024.com/article/1649b391-226a-803d-8524-d50e5a62b7be
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!