Training Data
Module: fundamentals
What it is
Training data is the information used to train an AI model. For LLMs, this typically includes web pages, books, articles, code, and other text—often trillions of words. The model learns patterns from this data. The quality, diversity, and recency of training data significantly impacts model capabilities.
Why it matters
Training data determines what models know and how they behave. Biases in training data become biases in the model. Knowledge cutoffs exist because training data has an end date. When a model performs poorly on certain topics, insufficient or low-quality training data is often the cause.