AI

动态规划Dynamic Programming

Dynamic Programming

Posted by xuepro on June 2, 2018

Dynamic Programming

Policy Evaluation (Prediction)

For a given policy compute the state–value function

A system of jSj simultaneous linear equations Solution in matrix notation (complexity :

Iterative Policy Evaluation

Iterative application of Bellman expectation backup

A full policy–evaluation backup:

A sweep consists of applying a backup operation to each state Using synchronous backups

  • At each iteration k + 1
  • For all states
  • Update from

Policy Improvement

Policy Improvement Theorem

For some state s we would like to know whether or not we should change the policy to deterministically choose an action ,We know how good it is to follow the current policy from s—that is —but would it be better or worse to change to the new policy ? One way to answer this question is to consider selecting a in s and thereafter following the existing policy,

Let and be any pair of deterministic policies such that

Then the policy must be as good as, or better than

Proof.

Policy Iteration

What if improvements stops ( ?

But this is the Bellman optimality equation, Therefore

So is an optimal policy!

Value Iteration

One drawback to policy iteration is that each of its iterations involves policy evaluation, which may itself be a protracted iterative computation requiring multiple sweeps through the state set.

In fact, the policy evaluation step of policy iteration can be truncated in several ways without losing the convergence guarantees of policy iteration. One important special case is when policy evaluation is stopped after just one sweep (one update of each state). This algorithm is called value iteration. It can be written as a particularly simple update operation that combines the policy improvement and truncated policy evaluation steps: