Pillar 1 · Module 2

How LLMs Actually Work

By Robert Triebwasser Foundations · Level 2 ~15 min read

You've been using ChatGPT or Claude long enough to notice the weird stuff. It forgets what you told it earlier in the conversation. It invents a citation that sounds exactly right but doesn't exist. There's a “temperature” slider in the API that nobody ever explained. Every article you read uses the word “inference” like you should already know what it means. This module is the four-word glossary that makes all of it click: context window, inference, temperature, hallucination.

Watch: How LLMs Actually Work

How LLMs Actually Work

The four words that explain 90% of the weird things LLMs do

Quick Recap of Module 1

If you skipped Module 1, the one-sentence version is this: a language model is a program that predicts the next word in a sequence. That's the entire trick. Everything the thing does — answer email, summarize contracts, write marketing copy — is that one simple operation scaled to absurd levels across a dataset the size of the public internet.

Module 1 gave you the mental model. This one zooms in by exactly one level: what actually happens each time you hit Enter?

The answer involves four words you've been seeing in articles, settings panels, and Slack threads without anyone ever defining them for you. Learn these four and roughly 90% of the weird behavior you've observed from LLMs stops being mysterious. You'll know what's happening, why it's happening, and what to do about it.

Word 1 — Context Window

The context window is the model's working memory. It's the chunk of text the model can "see" when it's generating its next word. And the important thing to internalize is this: anything outside that window may as well not exist to the model.

Context windows are measured in tokens. A token is roughly a word-chunk — about four characters on average, so "hamburger" might be one token, "hippopotamus" might be two, "don't" is two (the apostrophe breaks it). Don't stress the exact math. What matters is that when a product says "200,000 token context window," they mean the model can work with about 150,000 words of text at once. That's a short novel. That sounds like a lot.

Here's what goes inside that window, all competing for the same space:

  • The system prompt the tool adds behind the scenes (telling the model who it is, how to behave, what it can and can't do)
  • Any documents, PDFs, or files you pasted in or uploaded
  • Your current message
  • Every previous message in the conversation — yours and the model's
  • The reply the model is in the middle of generating

All of that, at once, every single time the model produces a word. When the window fills up, something has to give — and typically it's the oldest material that rolls off the top. This is why, in a long chat, the model seems to "forget" what you told it an hour ago. It didn't forget. The earliest part of the conversation simply scrolled out of its field of view.

Now here's the piece that almost nobody tells you. A bigger context window is not the same as a bigger brain. Anthropic, who builds Claude, published a piece in 2025 naming a phenomenon they call context rot: as the context fills up, the model's ability to actually use what's in there degrades. Even well below the stated limit. You'll feel it as the model getting vaguer, missing details you explicitly mentioned, or repeating itself. It's not a bug. It's the price of asking the model to pay attention to too many things at once.

Practical consequence: curate, don't dump. If you're working on a long task, don't paste your entire knowledge base into one chat. Summarize first. Extract the pieces that actually matter. Start a fresh chat for a new topic instead of letting one conversation balloon into an hour-long sprawl. If the model seems to be "getting dumber" mid-conversation, it probably is — the fix isn't prompting harder, it's starting over with a cleaner context.

Word 2 — Inference

Inference is the word researchers use for "the thing that happens between you hitting Enter and the reply showing up." It's the counterpart to training. Training is the months-long, multi-million-dollar process of teaching a model by showing it trillions of words. Inference is what happens when you use the finished model. Every time you ask ChatGPT a question, you're running inference.

Here's what inference actually looks like under the hood, stripped of jargon:

  1. The model reads your entire context window — every token of it — in one pass.
  2. It computes a probability distribution over every possible next token. For a model with a 50,000-token vocabulary, that's 50,000 little probabilities adding up to 1.0.
  3. It picks one token from that distribution.
  4. It sticks that token onto the end of the context.
  5. It goes back to step 1 and does the whole thing again for the next token. And the next. And the next.

One token at a time. That's it. That's why replies stream in front of you word by word instead of appearing all at once — you're literally watching inference happen in real time. Each new word is a fresh pass through the entire model with one more token of context than the pass before.

Two practical consequences fall out of this.

First, cost and speed are both functions of tokens. The tool you're using charges (directly or indirectly) for every input token it reads and every output token it produces. Short prompts and short outputs are fast and cheap. Long prompts and long outputs are slow and expensive. If your AI workflow feels sluggish, the first place to look is how much text is going in and out, not how "powerful" the model is.

Second, and this one surprises people: in most tools, nothing is cached between your messages. Every turn of the conversation, the model re-reads the entire chat history from scratch. There's no running "memory" quietly accumulating in the background. The model isn't "learning you" as you go. It's doing the same cold read every single time — just with one more message on the end than it had last turn. That's why it sometimes contradicts something it said three messages ago: from its perspective, every message is fresh input to the same stateless machine.

Word 3 — Temperature

Remember step 2 of inference — the model computing a probability distribution over possible next tokens? Temperature is the knob that controls what happens at step 3, when it has to pick one.

Here's the clearest way to think about it. Imagine the model has just computed that the next word in "The cat sat on the ___" is 40% likely to be "mat," 20% likely to be "chair," 15% likely to be "rug," 10% likely to be "couch," and a long tail of everything else.

Low temperature (0 to 0.3): almost always pick the top option. The output becomes predictable, repeatable, and a little boring. Run the same prompt twice, you'll get very similar answers. This is what you want for code, data extraction, structured output, factual Q&A, or anywhere that consistency and repeatability matter more than variety.

Higher temperature (0.7 to 1.0 and beyond): sample more freely from the full distribution. "Chair" and "rug" start showing up more often. Sometimes "windowsill" or "keyboard" sneaks through. The output becomes varied, occasionally surprising, sometimes weird. This is what you want for brainstorming, creative writing, ideation, marketing copy, or anywhere you'd rather see fifteen different takes than the same safe answer fifteen times.

Now the myth-bust, because this is the single most misunderstood concept in practical LLM use:

Temperature is a creativity dial, not an accuracy dial.

Low temperature does not make the model "more truthful." A low-temperature model will still hallucinate — it'll just hallucinate the same plausible-sounding wrong answer every time you ask. High temperature doesn't make the model "dumber." A high-temperature model can still get factual questions right; it'll just phrase them in more varied ways. Temperature has no opinion about truth. It only controls how willing the model is to pick something other than its top guess.

Practical reality: in consumer chat apps like ChatGPT, Claude.ai, or Gemini, you usually can't change the temperature, and you don't need to. The apps ship with a sensible default (typically somewhere around 0.7 to 1.0) tuned for general conversation. The setting starts mattering when you're using the API directly, building with an agent framework, or wiring a model into a product. When that happens: start at the default, only touch it if the outputs are too same-y (raise it) or too chaotic (lower it). Match the dial to the job.

Word 4 — Hallucination

Hallucination is the most important word in this module, and the one most people get most wrong. A hallucination is what happens when a language model generates text that sounds plausible, is delivered with complete confidence, and is factually wrong. The fake case law. The invented URL. The quote from a person who never said it. The 2021 statistic that's actually from 2018, off by a factor of two.

The critical thing to understand about hallucinations is this: they are not a bug. They're not something the labs are about to fix in the next model release. They're baked into how the technology works, and once you understand why, you'll stop being surprised by them and start designing your workflow around them. Let me explain the two reasons they happen.

Reason one: inference is sampling, not lookup. The model does not have a database of facts it consults. It doesn't "know" anything in the sense that you or I know things. What it has is a statistical shape of text, and when you ask it a question, it generates the text that's most likely to come after your question given everything it saw in training. For common facts — the capital of France, the formula for water, the lyrics of a famous song — plausible and true overlap almost perfectly, and the model gets it right. For rare or specific facts — an obscure citation, an exact revenue figure, a person who wasn't famous, anything that happened after the training cutoff — plausible and true come apart. The model doesn't know they've come apart. It generates something plausible anyway. That's a hallucination.

Reason two: fine-tuning trains models to always give an answer. Andrej Karpathy calls this part "LLM psychology." During the fine-tuning phase, when humans are rating model outputs to shape its behavior, confident and helpful answers get rewarded. "I don't know" rarely does. Over millions of training examples, the model learns a deep lesson: when in doubt, improvise confidently. That's why you almost never see a real uncertainty signal from an LLM — the expression of uncertainty was trained out of it. When the model doesn't know, it doesn't hesitate. It invents.

Where this gets dangerous is anything with a proper noun the model saw rarely. Citations. Case law. Quotes. Phone numbers. URLs. Medical dosages. Legal statutes. Recent events. Specific dates. Exact figures. Names of people who aren't famous. These are the zones where "plausible" and "true" diverge most sharply, and where confident-sounding hallucination is most likely to burn you.

The business-practical rule: trust an LLM for structure, not for specifics. Ask it to draft the contract clause, then have a lawyer check the statute. Ask it to write the email, then fact-check the dollar amount. Ask it to summarize the report, then spot-check the summary against the section that actually matters. The first draft is almost free. The verification is where your expertise earns its keep.

Hallucination is not fixable by prompting harder. Not by asking the model to "only answer if you're sure." Not by saying "this is important, don't make anything up." The model has no internal sensor for whether it's making something up. It's generating plausible text. Plausible is the only thing it knows how to do.

Why These Four Words Matter Together

Step back. All four of these concepts come from the same root cause: LLMs are probability machines, not retrieval machines.

The context window is what they can see. Inference is how they compute. Temperature controls how they pick. Hallucination is what happens when the picker picks the plausible-sounding wrong answer. Every quirk you've observed in ChatGPT or Claude — the forgetting, the inventing, the sometimes-too-varied, sometimes-too-repetitive — traces back to one of these four.

Once you see the pattern, a lot of the weird stuff stops being weird. The model "forgot" what you told it earlier? Context window rolled over. It gave three different answers to the same question in three sessions? Temperature sampling. It confidently cited a paper that doesn't exist? Hallucination, working exactly as designed. Every reply streaming in one word at a time? Inference happening live, token by token.

This is the instruction manual nobody ships with these tools. The good news is: you only need to learn it once, and it works across every LLM on the market. Claude, ChatGPT, Gemini, Llama, Mistral — same four mechanics, same four failure modes, same four rules for using them well.

What This Means For Your Work

Here are the four operational rules that fall out of this module. Each one maps to one of the four words.

1. Curate your context — don't dump

A bigger context window isn't a bigger brain. If the model seems to be losing the thread, summarize and start fresh instead of pasting more. The best-performing prompts are focused, not comprehensive.

2. Expect hallucinations — build verification in

Never take a number, a date, a name, a quote, or a citation from an LLM without checking it. Make verification part of the workflow, not an afterthought you remember to do sometimes. The cost of a confident-sounding wrong answer is higher than the cost of spending 30 seconds fact-checking.

3. Leave temperature alone unless you know why you're changing it

Defaults are defaults for a reason. If you're writing code or extracting structured data and the outputs are inconsistent, lower it. If you're brainstorming and everything sounds the same, raise it. Otherwise: don't touch it.

4. Understand that "the model forgot" is not a bug

When a long chat goes sideways, the window rolled over. That's not the model being broken — that's the model operating exactly as designed. The fix is a new conversation with a cleaner context, not more arguing with the model about what you already told it.

Four words. Four rules. This is the day-to-day operator's manual for every LLM you'll touch. Learn it once. Apply it forever.

Sources

The six sources this module was built from. If you want to go deeper than the video on any one of the four words, this is where to start.

Next up
Module 3 — The AI Landscape in 2026

Want this drilled into your team, with real examples?

Book a free 15-minute discovery call and I'll show you how this module plugs into the broader AI training I run for organizations.

Book a Free Call →