AI & MACHINE LEARNING

Reinforcement learning

The third paradigm of machine learning: an agent learns by interacting with an environment, choosing actions and maximizing cumulative reward over time. Formalized by the Markov decision process; combined with deep networks, it becomes deep RL.

Extended definition

Reinforcement learning (RL) is the third paradigm of machine learning, distinct from supervised and unsupervised learning. Instead of learning from labeled examples, an agent learns by interaction: it observes the state of an environment, chooses an action, receives a reward, and transitions to a new state, adjusting its behavior to maximize cumulative reward over time. The formal framework is the Markov decision process, defined by states, actions, a reward function, and a transition dynamic. The policy is the rule that maps states to actions, and the value function estimates the expected long-term return. Deep reinforcement learning combines this scheme with neural networks that approximate the policy or the value. Arulkumaran and colleagues (2017) organize the field into value-based methods, such as deep Q-learning, and policy-based methods, such as policy gradient and actor-critic.

When it applies

RL applies to sequential decision problems, where the present choice affects future states and rewards and there is no set of correct answers to imitate. It applies when there is a well-defined reward signal and an environment, real or simulated, with which the agent can interact many times. Mnih and colleagues (2015) demonstrated the paradigm by training a single agent that learned to play dozens of Atari games directly from pixels, reaching human-level performance. Silver and colleagues (2016) carried the idea to Go, defeating a human champion with a combination of deep networks and tree search. It applies today to robotics, control, systems optimization, sequential recommendation, and the tuning of language models by reinforcement from human feedback.

When it does not apply

RL does not apply when the problem is in fact static prediction: if there are labels and no sequential decision, supervised learning solves it at far lower cost. It does not apply without a reliable reward signal; poorly specified rewards lead the agent to optimize the wrong objective, exploiting loopholes instead of solving the task. It does not apply cheaply: RL is notoriously sample-inefficient, requiring an enormous number of interactions, which makes it impractical when each real-world trial is costly or dangerous. It does not apply well without a faithful simulator when direct training is risky, and the gap between simulation and reality can invalidate the learned policy. And it does not apply where reproducibility is fragile: RL results are sensitive to seeds, hyperparameters, and implementation details.

Applications by field

  • Robotics and control: learning locomotion and manipulation policies, usually trained in simulation before transfer to the real world.
  • Games and simulation: the historical domain of the paradigm, from Atari to Go, used as a testbed for algorithms.
  • Language models: reinforcement from human feedback to align a model’s output with preferences.
  • Operations and optimization: systems control, resource allocation, and sequential recommendation, where the decision affects the future state.

Common pitfalls

The first pitfall is misspecifying the reward: the agent optimizes exactly what is measured, and a malformed objective produces behavior that satisfies the metric without solving the task. The second is underestimating sample inefficiency, planning the project as if interactions were cheap. The third is trusting a policy trained only in simulation without measuring the gap to the real world. The fourth is ignoring the fragility of results: without multiple seeds and honest reporting of variance, an apparent gain may be noise. The fifth is applying RL where a supervised method would suffice, paying the paradigm’s complexity without the sequential-decision need that justifies it.

Last updated —