A stochastic process is an indexed collection of random variables . e.g., time series of weekly demands for a product
A stochastic process is said to be Markovian if and only if
“The future is independent of the past given the present”
A Markov process (or Markov chain) is a memoryless stochastic process, i.e., a sequence of random states with the Markov property
A Markov process is a is a tuple
- S is a (finite) set of states
-
P is a state transition probability matrix,
- a set of initial probabilities for all i
A Markov reward process is a Markov process with values. A Markov Reward Process is a tuple
- S is a (finite) set of states
-
P is a state transition probability matrix,
- R is a reward function,
- is a discount factor,
- a set of initial probabilities for all i
The return is the total discounted reward from time–step t.
The state value function V(s) of an MRP is the expected return starting from state s
Bellman Equation
So, We get the Bellman Equation:
The Bellman equation can be expressed concisely using matrices.
Solving the Bellman Equation
The Bellman equation is a linear equation. It can be solved directly
Computational complexity is for n states. Direct solution only possible for small MRPs There are many iterative methods for large MRPs,e.g.,
- Dynamic programming
- Monte–Carlo evaluation
- Temporal–Difference learning
Discrete–time Finite Markov Decision Processes (MDP)
A Markov decision process (MDP) is Markov reward process with decisions. It models an environment in which all states are Markov and time is divided into stages.
A Markov Process is a tuple
- S is a (finite) set of states
- A is a (finite) set of actions
- P is a state transition probability matrix,
- R is a reward function,
- is a discount factor,
- a set of initial probabilities for all i
Goal is a scalar reward:goals and purposes can be well thought of as the maximization of the cumulative sum of a received scalar signal (reward).
Policies: A policy, at any given point in time, decides which action the agent selects A policy fully defines the behavior of an agent Policies can be:
- Markovian History–dependent
- Deterministic Stochastic
- Stationary Non–stationary
Stationary Stochastic Markovian Policies
A policy is a distribution over actions given the state:
- MDP policies depend on the current state (not the history)
- i.e., Policies are stationary (time–independent)
- Given an MDP and a policy
- The state sequence s1; s2; … is a Markov process
- The state and reward sequence s1; r2; s2; … is a Markov reward process , where
Value Functions
Given a policy , it is possible to define the utility of each state: Policy Evaluation
- The state–value function of an MDP is the expected return starting from state s, and then following policy
For control purposes, rather than the value of each state, it is easier to consider the value of each action in each state.
The action–value function is the expected return starting from state s, taking action a, and then following policy
Bellman Expectation Equation
The Bellman expectation equation can be expressed concisely using the induced MRP
with direct solution
Bellman operators for
The Bellman operator for is defined as : (maps value functions to value functions):
Using Bellman operator, Bellman expectation equation can be compactly written as:
- is a fixed point of the Bellman operator
- This is a linear equation in and
- If 0 < then is a contraction w.r.t. the maximum norm
The Bellman operator for is defined as
: (maps action–value functions to action–value functions):
Using Bellman operator, Bellman expectation equation can be compactly written as:
Optimal Value Function and optical policy
Optimal Value Function
The optimal state–value function is the maximum value function over all policies
The optimal action–value function is the maximum action–value function over all policies
- The optimal value function specifies the best possible performance in the MDP
- An MDP is “solved” when we know the optimal value function
Optimal Policy
Value functions define a partial ordering over policies
if ,
Theorem
For any Markov Decision Process
- There exists an optimal policy that is better than or equal to all other policies
- All optimal policies achieve the optimal value function,
- All optimal policies achieve the optimal action–value function,
- There is always a deterministic optimal policy for any MDP
A deterministic optimal policy can be found by maximizing over
Bellman Optimality Equation
Bellman Optimality Equation for
Bellman Optimality Equation for
Principle of optimality:
the tail of an optimal policy is optimal for the “tail” problem.(最优策略的尾策略对于尾问题也是最优的。或者说,不管初始状态和前面已经发生的,对于后续子问题,最优策略对剩余子问题的策略(tail policy)仍然是最优的)
数学符号 refer:
数学符号的竖线 | 有时需要用\vert 表示 |
特殊符号 ‘要用\prime