RLHF

Module: fundamentals

What it is

Reinforcement Learning from Human Feedback (RLHF) is a training technique where human preferences guide model improvement. Humans rate model outputs, and these ratings train a reward model that then guides further training. This helps models learn what humans consider helpful, harmless, and honest.

Why it matters

RLHF is why modern AI assistants are helpful rather than just completing text. It's the primary technique for aligning models with human values and expectations. RLHF helps models refuse harmful requests, stay on topic, and produce genuinely useful responses rather than technically correct but unhelpful ones.