ExplainerTechnical AI Knowledge

What Tokens and Context Windows Mean for Enterprise AI Decisions

A clear explanation of tokens and context windows, and why these two technical limits shape cost, accuracy and feasibility in enterprise AI projects.

By Edison NguFounder, Edison AI30 May 20265 min read
Quick answer

Quick answer

Tokens are the unit of currency in every large language model: they determine what the model can process, how long it takes, and how much it costs. A context window is the hard ceiling on how much text the model can consider in a single interaction. Together, these two concepts constrain the design of virtually every enterprise AI workflow. Understanding them shifts AI decision-making from vendor-led to evidence-led.

What this means

A token is not a word. It is a subword unit — typically three to four characters — produced by the tokeniser used to pre-process text before it enters the model. The word "implementation" might become two tokens; "AI" is typically one. Code, numbers and non-English text often tokenise less efficiently, meaning they consume more tokens relative to their information content.

A context window is the number of tokens a model can process in a single forward pass — input plus output combined. Modern frontier models have context windows ranging from roughly 128,000 tokens (approximately 100,000 words) to over one million tokens in some configurations. Every token fed into the model — system instructions, retrieved documents, conversation history and the current query — counts against that limit.

Why it matters for business

Token economics directly determine the cost and feasibility of enterprise AI tasks. Most API-based AI pricing charges separately for input tokens and output tokens, with output typically more expensive. A workflow that processes a 200-page policy document once per query consumes far more tokens — and incurs far higher cost — than one that uses retrieval to inject only the three most relevant paragraphs.

Context window size also affects accuracy. Attention quality across a very long context is not uniform: research has consistently shown that models perform less reliably when the information critical to a query is buried in the middle of a very long context, rather than near the start or end. This effect — sometimes called the "lost in the middle" problem — means that a larger context window does not automatically mean better answers. How context is structured matters as much as how much of it you provide.

Gartner predicts that by 2027, inaccurate AI cost and budget calculations will drive approximately 60% of large enterprises to adopt FinOps practices specifically for AI. Token consumption modelling is one of the primary inputs to those calculations.

How it works technically

When a request is sent to an LLM:

  1. Input tokenisation — the entire input string (system prompt + retrieved documents + chat history + user query) is tokenised and counted against the context limit.
  2. Attention over context — the model's attention mechanism processes relationships between all tokens in the context window simultaneously.
  3. Output generation — the model generates response tokens one at a time, each new token being appended to the context for the next prediction step, until the model produces a stop token or reaches the output limit.
  4. Billing calculation — the API provider counts total input tokens and total output tokens and applies the relevant per-token pricing tier.

A practical approximation for scoping work: one million tokens is roughly equivalent to 750,000 words, or about 1,500 pages of dense business text.

Practical implementation considerations

Token and context window management becomes operationally significant at scale. Key decisions include:

  • Context budgeting: For retrieval-augmented workflows, define maximum tokens allocated to retrieved content, system instructions, and conversation history. This prevents runaway costs on large document queries.
  • Chunking strategy: When documents exceed what fits usefully in one context, how they are split and indexed for retrieval directly affects which content the model sees. Poor chunking means critical context gets excluded.
  • Model selection by task: Long-context models cost more per token but are appropriate for tasks that genuinely require processing entire contracts or reports. Shorter-context, cheaper models are adequate for classification or short-answer tasks.
  • Output token limits: Setting a max-output-token parameter prevents verbose responses inflating API costs unnecessarily.

Edison AI's AI training programmes include hands-on context window budgeting exercises for teams building or evaluating enterprise AI systems, ensuring cost projections reflect real-world token volumes rather than vendor demo conditions.

Common mistakes

  • Budgeting by word count rather than token count — token estimates must account for non-English content, code and special characters, which tokenise less efficiently.
  • Assuming context window = useful processing window — attention quality degrades at extreme context lengths for many tasks. Retrieving the right content is more reliable than pasting everything in.
  • Ignoring system prompt token consumption — detailed system instructions can consume thousands of tokens per request. At scale, this adds up.
  • Not benchmarking cost before committing to a model — running representative workloads against multiple model providers before production commitment avoids late-stage cost surprises.
  • Conflating input and output pricing — output tokens often cost two to four times more than input tokens. Workflows generating lengthy outputs need separate cost modelling.

What leaders should do next

  1. Ask your technical team to produce a token consumption estimate for each major AI workflow before committing to a model or pricing tier.
  2. Evaluate whether retrieval-augmented generation is appropriate for any use case currently using full-document context — it is almost always more cost-effective.
  3. Include context window behaviour in your model evaluation criteria, not just benchmark scores.
  4. Establish a token budget governance process as part of your AI operating model, so cost does not escalate unchecked as usage grows.

Edison AI runs practical AI training that turns this understanding into day-to-day team capability.

Frequently asked

Questions, answered.

  • What is a token in AI?

    A token is a chunk of text — roughly four characters or three-quarters of a word — that a language model reads and generates. Models price and limit work by tokens, not words.

  • What is a context window?

    A context window is the maximum amount of text, measured in tokens, a model can consider at once. Everything beyond it is ignored or must be retrieved separately.

  • Why do context windows matter for cost?

    You pay per token for input and output. Larger contexts improve grounding but raise cost and latency, so right-sizing context is an operational decision, not just a technical one.

Take the next step

Ready to put this into practice?

Edison AI helps Australian businesses move from AI curiosity to practical implementation, with workflow design, team training and measurable outcomes. Tell us about your setup and we'll come back with a sequenced plan grounded in the same thinking you just read.

Article: What Tokens and Context Windows Mean for Enterprise AI Decisions