Tokenisation

Module: fundamentals

What it is

Tokenisation is the process of converting text into tokens that an AI model can process. Different models use different tokenisation schemes. Some split on word boundaries, others use subword units. The tokeniser is trained alongside the model to optimally represent the training data.

Why it matters

Tokenisation affects how models handle different languages and unusual text. English typically tokenises efficiently, but other languages may require more tokens for the same meaning. Code, URLs, and technical terms may tokenise unexpectedly. This can impact costs and whether content fits within context limits.