Post-Training is a Contextual Bandit
Coming from a control background, applying RL to text generation is hardly intuitive. Is there an actual, non-degenerate dynamical system at play here? What is the concrete novelty behind the shiny LLM post-training algorithms? This post is an attempt to answer those questions by providing a semiformal derivation of the celebrated GRPO algorithm through a contextual bandit lens.