AI

马尔可夫决策过程MDP

MDP

Posted by xuepro on May 31, 2018

A stochastic process is an indexed collection of random variables . e.g., time series of weekly demands for a product

A stochastic process is said to be Markovian if and only if

“The future is independent of the past given the present”

A Markov process (or Markov chain) is a memoryless stochastic process, i.e., a sequence of random states with the Markov property

A Markov process is a is a tuple

  • S is a (finite) set of states
  • P is a state transition probability matrix,

  • a set of initial probabilities for all i

A Markov reward process is a Markov process with values. A Markov Reward Process is a tuple

  • S is a (finite) set of states
  • P is a state transition probability matrix,

  • R is a reward function,
  • is a discount factor,
  • a set of initial probabilities for all i

The return is the total discounted reward from time–step t.

The state value function V(s) of an MRP is the expected return starting from state s

Bellman Equation

So, We get the Bellman Equation:

The Bellman equation can be expressed concisely using matrices.

Solving the Bellman Equation

The Bellman equation is a linear equation. It can be solved directly

Computational complexity is for n states. Direct solution only possible for small MRPs There are many iterative methods for large MRPs,e.g.,

  • Dynamic programming
  • Monte–Carlo evaluation
  • Temporal–Difference learning

Discrete–time Finite Markov Decision Processes (MDP)

A Markov decision process (MDP) is Markov reward process with decisions. It models an environment in which all states are Markov and time is divided into stages.

A Markov Process is a tuple

  • S is a (finite) set of states
  • A is a (finite) set of actions
  • P is a state transition probability matrix,
  • R is a reward function,
  • is a discount factor,
  • a set of initial probabilities for all i

Goal is a scalar reward:goals and purposes can be well thought of as the maximization of the cumulative sum of a received scalar signal (reward).

Policies: A policy, at any given point in time, decides which action the agent selects A policy fully defines the behavior of an agent Policies can be:

  • Markovian History–dependent
  • Deterministic Stochastic
  • Stationary Non–stationary

Stationary Stochastic Markovian Policies

A policy is a distribution over actions given the state:

  • MDP policies depend on the current state (not the history)
  • i.e., Policies are stationary (time–independent)
  • Given an MDP and a policy
    • The state sequence s1; s2; … is a Markov process
    • The state and reward sequence s1; r2; s2; … is a Markov reward process , where

Value Functions

Given a policy , it is possible to define the utility of each state: Policy Evaluation

  • The state–value function of an MDP is the expected return starting from state s, and then following policy

For control purposes, rather than the value of each state, it is easier to consider the value of each action in each state.

The action–value function is the expected return starting from state s, taking action a, and then following policy

Bellman Expectation Equation

The Bellman expectation equation can be expressed concisely using the induced MRP

with direct solution

Bellman operators for

The Bellman operator for is defined as : (maps value functions to value functions):

Using Bellman operator, Bellman expectation equation can be compactly written as:

  • is a fixed point of the Bellman operator
  • This is a linear equation in and
  • If 0 < then is a contraction w.r.t. the maximum norm

The Bellman operator for is defined as

: (maps action–value functions to action–value functions):

Using Bellman operator, Bellman expectation equation can be compactly written as:

Optimal Value Function and optical policy

Optimal Value Function

The optimal state–value function is the maximum value function over all policies

The optimal action–value function is the maximum action–value function over all policies

  • The optimal value function specifies the best possible performance in the MDP
  • An MDP is “solved” when we know the optimal value function

Optimal Policy

Value functions define a partial ordering over policies

if ,

Theorem

For any Markov Decision Process

  • There exists an optimal policy that is better than or equal to all other policies
  • All optimal policies achieve the optimal value function,
  • All optimal policies achieve the optimal action–value function,
  • There is always a deterministic optimal policy for any MDP

A deterministic optimal policy can be found by maximizing over

Bellman Optimality Equation

Bellman Optimality Equation for

Bellman Optimality Equation for

Principle of optimality:

the tail of an optimal policy is optimal for the “tail” problem.(最优策略的尾策略对于尾问题也是最优的。或者说,不管初始状态和前面已经发生的,对于后续子问题,最优策略对剩余子问题的策略(tail policy)仍然是最优的)

数学符号 refer:

数学符号的竖线 有时需要用\vert表示

特殊符号 ‘要用\prime

List of Greek letters and math symbols

2