What are tokens?

A relatively simple explanation of tokens and why they are important

Tokens are chunks of text that a language model uses to understand and generate language. A token can be as small as a character or as long as a whole word, depending on the language and context.

For example:
The word "apple" is one token.
The word "unbelievable" might be broken into smaller tokens like "un", "believ", and "able".
Punctuation and spaces also count as tokens.

The model processes and limits text based on tokens instead of words. So when you hear something like "4,000-token limit," that’s how much input and output combined the model can handle at once. Each model counts tokens slightly differently, but typically one token equals to around 4 characters.

Writing systems that are different from Latin alphabet (like Chinese characters, Hangul or Cyrillic alphabet) may be tokenized differently, and often are counted as one token per symbol. So if your character contains non-Latin writing, there is a chance it may end up more token-heavy.

Most bot sites you want to keep under 2000 tokens (around 8000 characters) for the "permanent" fields of the bot. For Charsnap this includes the Description, Personality, System Prompts, and Always Active System Prompts fields combined. According to the dev, anything near 3000 tokens on Charsnap is when you begin to notice large drops in chat quality. Keep in mind that Charsnap has a vast Lorebook system! If your bot is getting too long, use a Lorebook!

Persona descriptions are also permanent memory, so that'll also go toward your context memory.

To count tokens in your text, you can use external tokenizers. Here is one from OpenAI: Tokenizer.

PreviousMy Characters NextMarkdown Guide

Last updated 2 months ago