# RL Characteristics

  • No supervisor, only reward signal
  • Feedback is delayed, not instantaneous
  • Time matters (sequential, non IID data)
  • Agent's actions affect the subsequent data it receives

# History

  • The history can be represented by

where A stands for action, O stands for observation and R stands for reward.

The history determines what happens next, whether that is the agent picking an action or the environment selecting observation/rewards

But it is not very useful cause we do not want to go back through the entire history to make a decision.

# State

This is where state comes into play, it has a summary of the history that can be used for making decisions moving forward.

  • This state is the agent's internal representation, not the environment's state since that is not accessible to the agent.
  • i.e. whatever info the agent uses to pick the next action
  • i.e. it is the info used by RL learning algorithms

# Information state

  • An information state (aka Markov state) contains all useful information from the history.

  • The future is independent of the past given the present.

  • You can throw away everything that came before, just keep present state and that characterizes future actions.

  • The environment's state is Markov, the history is Markov. And now we need to ensure the agent's state is Markov as well.

# Fully observable environments

david-fully-observable

# Partially observable environments

Some examples

  • a robot with camera vision isn't told its absolute location
  • A trading agent only observes current prices

In this case, agent state != environment state.

Formally, this is Partially Observable Markov Decision Process (POMDP)

In this, agent has to construct its own state representation.

  • Could use complete history
  • Could use beliefs of environment state using probability
  • Recurrent Neural Network

# Major Components of RL Agent

An RL agent may include one or more of these components (not an exclusive list):

# Policy

Agent's behavior function. A map from state to action

  • Determinstic Policy: . a function that maps
  • Stochastic Policy: . Helps make random decisions and explore

# Value function

  • expected return if you start in a particular state, how much reward would you earn in the future if you were dropped into this state of markov process
  • Used to evaluate goodness/badness of states
  • to select actions
  • to predict future reward

# Model

agent's representation of the environment that predicts what the environment will do next

  • Transitions: P predicts the next state (i.e. dynamics)
  • Rewards: R predicts the next (immediate) reward

# Learning and Planning

Two fundamental problems in sequential decision making

  • Reinforcement Learning
    • Environment initially unknown
    • Agent interacts with environment
    • Agent improves policy
  • Planning
    • Model of environment is known
    • Agent performs computation with model (without external interaction)
    • Agent improves its policy aka search, deliberation

# Prediction vs Control

  • Prediction: Evaluate the future given a policy
  • Control: Optimize the future by finding the best policy