Reinforcement learning and multi-armed bandit
Multi Armed Bandit
N-armed bandit, classic RL, between any N choices, start by choosing random and as time passes you will get more feedback information and will be able to iterate on that new information .. this is classic RL relying on exploration vs exploitation
Contextual Bandit
Why do we have to choose one size fit all theory .. if the goal is to maximize a particular score/value why not use multi strategy for multi users and then iterate from there ..
Policy Gradient
Policy tells what action to take in a given state , with some parameters ( and is usually a neural net ) Goal : maximize expected returns
using gradient ascent, we update params in direction that increase the return
Written on July 31, 2025