Learn General world models_数学联邦政治世界观()

1.Paper: Learning General World Models in a Handful of Reward-Free Deployments

Motivation：building generally capable agents by world models

• Generalize to novel tasks: WM training should not include rewards.

• deploy without retraining too much.

Methods outline

Instead of designing some intrinsic rewards for world model, this work proposes a better exploration policy without reward: It needs information gain and diversity. The focus of our work is on how to train ⇡EXP offline such that it gathers heterogeneous and informative data which facilitate zero-shot transfer to unknown tasks.

如何训练？For zero-shot evaluation, we follow [97] and only train the reward head at test time when provided with labels for our pre-collected data, which is then used to train a behavior policy offine.

How to design such exploration policy?

目标：

πᴇxᴘ＝arg max l(dπ ᴍψ；Mψ)＝H(dπ ᴍψ) – H(dπ ᴍψ|Mψ)

其意义是在未知MDP（reward function)时，着重探索uncertain的部分，explore；而在已知reward function的时候，Policy倾向于deep explore,即把最成功的路径给走一遍。

进一步地，A cascading objective.首先证明最优点可以到达，基于次和greedy的保证，可以转化为cascading的objective：

ᵢ

π⁽ⁱ⁾＝arg max l (∏ ℙΦ ～π₍ⱼ₎[Mψ]；Mψ|～π⁽ʲ⁾＝π⁽ʲ⁾ ∀j ≤ i – 1)

～π⁽ⁱ⁾ ∈Π ⱼ₌₁

ᵢ

＝H(∏ ℙΦ ～π₍ⱼ₎[Mψ]|～π⁽ʲ⁾＝π⁽ʲ⁾ ∀j ≤ i – 1)

ⱼ₌₁

ᵢ