Model-based RL (1/?)
This post is the first of a short series, concerned with model-based RL. We will start walking this road via the principled trail: describing and analysing $\texttt{UCB-VI}$, a theoretically grounded algorithm to solve unknown finite MDPs. More precisely, we will see how this model-based approach directly models the unknown MDP and uses optimism for strategic exploration to provably find “good” policies in finite-time. Concentration inequalities and regret bounds incoming, fun!