Training Data

Module: fundamentals

What it is

Training data is the information used to train an AI model. For LLMs, this typically includes web pages, books, articles, code, and other text—often trillions of words. The model learns patterns from this data. The quality, diversity, and recency of training data significantly impacts model capabilities.

Why it matters

Training data determines what models know and how they behave. Biases in training data become biases in the model. Knowledge cutoffs exist because training data has an end date. When a model performs poorly on certain topics, insufficient or low-quality training data is often the cause.