AI

策略梯度Policy Gradient

策略梯度Policy Gradient

Posted by xuepro on June 10, 2018

策略梯度Policy Gradient

The general case is that when we have an expression of the form - i.e. the expectation of some scalar valued score function under some probability distribution parameterized by some . Hint hint, will become our reward function (or advantage function more generally) and will be our policy network, which is really a model for, giving a distribution over actions for any image . Then we are interested in finding how we should shift the distribution (through its parameters to increase the scores of its samples, as judged by (i.e. how do we change the network’s parameters so that action samples get higher rewards). We have that:

Policy Gradient

参数化的随机策略:

最小化reward(advantage)期望:

可以取