Attention as a Call Stack:
A Mental Model for Prompting

05/22/25 • 12 mins

Many people interact with language models like LLMs understand what we want. Type in a prompt, hope for the best. If the result isn't right? Use reinforcement. "Try again." "Not quite." Still off? Start over.

This trial-and-error dance is familiar to anyone who's tried to get an LLM to do something specific. But the problem isn't just the prompt. It’s the way we thinking about prompting altogether. You need a way to debug it. To trace it. To reason about it. To attribute it.

I want to propose a mental model: attention as the call stack.

For software engineers turned AI engineers, this is intuitive. When debugging software, we trace execution through call stacks—each function call building on the previous, creating a clear path from input to output. With LLMs, we lack that clarity. Prompts go in, text comes out, and the reasoning process remains opaque.

A Developer's Mental Model: Attention as the Call Stack

When debugging software, we trace execution through call stacks. Each function builds on the previous, forming a clear path from input to output. With LLMs, that path is hidden. Prompts go in, text comes out. The reasoning process is opaque.

But attention gives us structure. In this model:

The context window is the model's working memory
Attention is the execution trace
Prompt tokens are variable assignments or function calls

This turns prompting into cognitive programming. The context window is your call stack. Attention determines how each token influences what comes next.

From Text to Execution: The Transformer Pipeline

1. Tokenization and Embeddings

Before the model can do anything, text must become numbers.

Tokenization: "The queen sat" → [464, 12675, 7983]
Embedding lookup: Maps token IDs to dense vectors (e.g., 1536 dimensions)
Positional encoding: Adds position info so the model knows order

Think of these embeddings as coordinates in high-dimensional semantic space. "Queen" might lie near "king," "royalty," and "British" because of their usage patterns in training data.

2. Layer Normalization

Each embedding is normalized:

Subtract the mean
Divide by the standard deviation
Scale and shift via learned parameters

This stabilizes training and inference by ensuring consistent value ranges.

3. Self-Attention (The Execution Path)

The transformer is built from stacked layers of multi-head self-attention.

Each layer builds richer representations. Each head focuses on a different aspect: syntax, semantics, co-reference, etc.

Step 1: Create Q, K, V Projections

Each token embedding is passed through learned matrices to produce:

Q (Query): What am I looking for?
K (Key): What do I offer?
V (Value): What content do I contain?

These are computed for each attention head independently.

Step 2: Compute Attention Scores

For every token i, compare its query Qᵢ to every key Kⱼ:

score[i,j] = Qᵢ ⋅ Kⱼ

This measures relevance between token i and all others.

Step 3: Softmax Normalization

To turn raw scores into probabilities:

attention_weight[i,j] = softmax(score[i,j])

Each row forms a probability distribution over all tokens.

Step 4: Weighted Sum of Values

Use attention weights to blend value vectors:

outputᵢ = Σⱼ (attention_weight[i,j] × Vⱼ)

Each token embedding becomes context-aware—influenced by tokens it attends to.

This happens:

Per token
Per attention head
Per layer

Resulting in deep, contextualized representations across the entire stack.

4. Multi-Head Attention

Multiple heads capture different patterns:

Head 1: Subject-verb agreement
Head 2: Coreference resolution
Head 3: Semantic similarity
Head 4: Syntax tracking

Their outputs are concatenated and passed to the next layer.

5. Feedforward + Normalization

After attention, each token embedding is passed through:

A feedforward neural network
A final normalization layer

This repeats across all transformer layers. The final token embedding captures nuanced context from the entire prompt.

From Embedding to Prediction

After the final layer:

The output for the last token is projected into vocabulary space (e.g., 50,000 tokens)
Produces logits: raw scores for each possible next token
Softmax converts logits into probabilities

Example:

throne: 64%
bed: 22%
lap: 12%
floor: 2%

Depending on decoding strategy, the model:

Picks the highest (argmax), or
Samples probabilistically (top-k, nucleus, temperature, etc.)

The selected token is appended to the prompt, and the cycle repeats.

Why the Call Stack Model Matters

Now that we know how it works, let's return to our metaphor:

Context window ≈ call stack
Tokens ≈ variable assignments or function calls
Attention ≈ execution trace

Earlier tokens initialize the stack. Later tokens override or build on it. Each new token is computed by tracing which earlier tokens had influence.

Instructions vs. Execution

Instruction: "You are a smart AI."

Execution Cue: "Let's break this down step-by-step. First, identify the knowns."

The former adds little to attention. The latter shapes it.

Prompting isn't about telling the model what to be. It's about steering how it thinks. Execution cues trigger meaningful attention pathways; instructions are often ignored unless reinforced.

Debugging Like an Engineer

Visualize Attention as Stack Traces

Heatmaps and attribution graphs help visualize:

What tokens influenced what outputs
How attention flows over long contexts

Just like in software:

High-attention tokens ≈ recently called functions
Vanishing influence ≈ dropped variables

Recency and Position Effects

Like memory scopes in code:

Recent tokens have stronger influence (recency bias)
Early tokens set interpretive frames (primacy effect)
Mid tokens fade unless reinforced

This is why buried instructions get ignored. Repetition, strategic priming, and placement matter.

Prompt Engineering = Cognitive Programming

Ask yourself:

What are my priming tokens?
What state am I initializing early?
What behavior am I reinforcing near the output?
Are my examples being attended to or ignored?

Each edit to your prompt is like changing a variable or control flow. The structure matters as much as the content.

The Illusion of Understanding

When LLMs write fluidly, they feel intelligent. But they aren't collaborating. They're executing a process.

No cognition. No intention. Just prediction.

Why do they feel smart? Because the right prompt stumbles into the right activation pattern. Change one token and the illusion collapses.

Most prompting advice focuses on instructions. But instructions don't reliably activate behavior. Structure and placement do.

This is why LLMs output generic LinkedIn fluff when asked to write a "LinkedIn post." That's the average. Not a bug. It's the most likely pattern.

More Than a Stochastic Parrot

Two modern capabilities push LLMs beyond mere parroting:

1. Post-Training Alignment

Reinforcement Learning from Human Feedback (RLHF) fine-tunes models to follow intent, not just statistical likelihood. This makes them feel cooperative and helpful.

2. In-Context Learning

Larger models can generalize from a few examples within a prompt. They exhibit "emergent" behaviors like learning tasks without weight updates.

Together, these make priming powerful. You can guide behavior—if you know how attention flows.

Cracking the Black Box

Anthropic's Tracing Thoughts and Attribution Graphs show that:

LLMs plan ahead, activating rhyme candidates or thematic arcs
Internal activations track reasoning-like steps (e.g. Dallas → Texas)

The takeaway? There is structure inside the model. But to guide it, we must speak its language: placement, pattern, and weight—not vague instructions.

Final Takeaways

LLMs don't follow instructions. They predict tokens based on context.
Attention is execution. It routes information from prior tokens to shape the next.
Recency and placement matter. Later tokens override. Early ones initialize.
Prompting is programming. Examples are function outputs. Tokens are variables.
The context window is your call stack. Learn to read it.

If you treat LLMs like APIs, you'll always be guessing.

But if you treat them like virtual machines with memory, attention, and execution flow?

You can start to debug.

References

Vaswani, A., et al. (2017). "Attention Is All You Need." arxiv.org/abs/1706.03762
Christiano, P., et al. (2017). "Deep Reinforcement Learning from Human Preferences." arxiv.org/abs/1706.03741
Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." arxiv.org/abs/2203.02155
Wei, J., et al. (2022). "Emergent Abilities of Large Language Models." arxiv.org/abs/2206.07682
Brown, T.B., et al. (2020). "Language Models are Few-Shot Learners." arxiv.org/abs/2005.14165
Anthropic (2024). "Tracing Thoughts: How Language Models Plan Ahead." anthropic.com/research/tracing-thoughts-language-model
Elhage, N., et al. (2022). "Interpretability in the Wild: A Circuit for Indirect Object Identification." transformer-circuits.pub
Elhage, N., et al. (2025). "Attribution graphs: Tracing language model computations beyond attention." transformer-circuits.pub/2025/attribution-graphs

Attention as a Call Stack: A Mental Model for Prompting