type
status
date
Dec 22, 2024 11:53 AM
slug
summary
tags
category
icon
password

基本概念

State
agent的状态向量
State Space
所有可能state的集合
State Transition
状态转移
Action
agent的行动
Policy
策略是agent从环境的状态s映射到动作a的函数,即在状态s下选择动作a的概率π(a∣s)
Reward
a real number we get after taking an action
Trajectory
A trajectory is a state-action-reward chain
Return
The return of this trajectory is the sum of all the rewards collected along the trajectory
Discounted return
更远的reward加入衰减系数
Episode
When interactin with the environment following a policy, the agent may stop at som terminal states. The resulting trajectory is called an episode (or a trial).
Episodic Taska vs Continuing Tasks
有限任务和无限任务

MDP

notion image

如何计算return?

Bootstrapping

notion image
notion image
v: 当前state下return value的期望
r: reward value
γ: 衰减系数
P: 和状态转移 矩阵
state value和自身相关!

the Bellman equation (for this specific deterministic problem)

v = r + γPv
 
秋招8Day无伤速通实况记录+面经地面上最好用的Windows免费ocr软件