Tokenisation
Module: fundamentals
What it is
Tokenisation is the process of converting text into tokens that an AI model can process. Different models use different tokenisation schemes. Some split on word boundaries, others use subword units. The tokeniser is trained alongside the model to optimally represent the training data.
Why it matters
Tokenisation affects how models handle different languages and unusual text. English typically tokenises efficiently, but other languages may require more tokens for the same meaning. Code, URLs, and technical terms may tokenise unexpectedly. This can impact costs and whether content fits within context limits.