Reinforcement learning and multi-armed bandit

Multi Armed Bandit

N-armed bandit, classic RL, between any N choices, start by choosing random and as time passes you will get more feedback information and will be able to iterate on that new information .. this is classic RL relying on exploration vs exploitation

Contextual Bandit

Why do we have to choose one size fit all theory .. if the goal is to maximize a particular score/value why not use multi strategy for multi users and then iterate from there ..

Policy Gradient

Policy tells what action to take in a given state , with some parameters ( and is usually a neural net ) Goal : maximize expected returns

using gradient ascent, we update params in direction that increase the return

Written on July 31, 2025