Markov decision processes

Markov decision processes (MDPs), named after Andrey Markov, provide a mathematical framework for modeling decision making in situations where outcomes are partlyrandom and partly under the control of a decision maker. MDPs are useful for studying a wide range of optimization problems solved via dynamic programming andreinforcement learning. MDPs were known at least as early as the 1950s (cf. Bellman 1957). A core body of research on Markov decision processes resulted from Ronald A. Howard‘s book published in 1960, Dynamic Programming and Markov Processes. They are used in a wide area of disciplines, including robotics, automated control, economics, and manufacturing.

More precisely, a Markov Decision Process is a discrete time stochastic control process. At each time step, the process is in some state s, and the decision maker may choose any action a that is available in state s. The process responds at the next time step by randomly moving into a new state s', and giving the decision maker a corresponding reward R_a(s,s').

The probability that the process moves into its new state s' is influenced by the chosen action. Specifically, it is given by the state transition function P_a(s,s'). Thus, the next state s' depends on the current state s and the decision maker’s action a. But given s and a, it is conditionally independent of all previous states and actions; in other words, the state transitions of an MDP possess satisfies the Markov property.

Markov decision processes are an extension of Markov chains; the difference is the addition of actions (allowing choice) and rewards (giving motivation). Conversely, if only one action exists for each state and all rewards are the same (e.g., zero), a Markov decision process reduces to a Markov chain.

Example of a simple MDP with three states and two actions.

A Markov decision process is a 5-tuple (S,A,P_\cdot(\cdot,\cdot),R_\cdot(\cdot,\cdot),\gamma), where

  • S is a finite set of states,
  • A is a finite set of actions (alternatively, A_s is the finite set of actions available from state s),
  • P_a(s,s') = \Pr(s_{t+1}=s' \mid s_t = s, a_t=a) is the probability that action a in state s at time t will lead to state s' at time t+1,
  • R_a(s,s') is the immediate reward (or expected immediate reward) received after transition to state s'from state s,
  • \gamma \in [0,1] is the discount factor, which represents the difference in importance between future rewards and present rewards.

The core problem of MDPs is to find a “policy” for the decision maker: a function \pi that specifies the action \pi(s) that the decision maker will choose when in state s. Note that once a Markov decision process is combined with a policy in this way, this fixes the action for each state and the resulting combination behaves like a Markov chain.

The goal is to choose a policy \pi that will maximize some cumulative function of the random rewards, typically the expected discounted sum over a potentially infinite horizon:

\sum^{\infty}_{t=0} {\gamma^t R_{a_t} (s_t, s_{t+1})}    (where we choose a_t = \pi(s_t))

where \ \gamma \ is the discount factor and satisfies 0 \le\ \gamma\ < 1. (For example,  \gamma = 1/(1+r) when the discount rate is r.)  \gamma is typically close to 1.

Because of the Markov property, the optimal policy for this particular problem can indeed be written as a function of s only, as assumed above.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s