Attention as a Call Stack:
A Mental Model for Prompting

Many people interact with language models like LLMs understand what we want. Type in a prompt, hope for the best. If the result isn't right? Use reinforcement. "Try again." "Not quite." Still off? Start over.
This trial-and-error dance is familiar to anyone who's tried to get an LLM to do something specific. But the problem isn't just the prompt. It’s the way we thinking about prompting altogether. You need a way to debug it. To trace it. To reason about it. To attribute it.
I want to propose a mental model: attention as the call stack.
For software engineers turned AI engineers, this is intuitive. When debugging software, we trace execution through call stacks—each function call building on the previous, creating a clear path from input to output. With LLMs, we lack that clarity. Prompts go in, text comes out, and the reasoning process remains opaque.
A Developer's Mental Model: Attention as the Call Stack
When debugging software, we trace execution through call stacks. Each function builds on the previous, forming a clear path from input to output. With LLMs, that path is hidden. Prompts go in, text comes out. The reasoning process is opaque.
But attention gives us structure. In this model:
- The context window is the model's working memory
- Attention is the execution trace
- Prompt tokens are variable assignments or function calls
This turns prompting into cognitive programming. The context window is your call stack. Attention determines how each token influences what comes next.
From Text to Execution: The Transformer Pipeline
1. Tokenization and Embeddings
Before the model can do anything, text must become numbers.
Tokenization: "The queen sat" → [464, 12675, 7983]
Embedding lookup: Maps token IDs to dense vectors (e.g., 1536 dimensions)
Positional encoding: Adds position info so the model knows order
Think of these embeddings as coordinates in high-dimensional semantic space. "Queen" might lie near "king," "royalty," and "British" because of their usage patterns in training data.
2. Layer Normalization
Each embedding is normalized:
- Subtract the mean
- Divide by the standard deviation
- Scale and shift via learned parameters
This stabilizes training and inference by ensuring consistent value ranges.
3. Self-Attention (The Execution Path)
The transformer is built from stacked layers of multi-head self-attention.
Each layer builds richer representations. Each head focuses on a different aspect: syntax, semantics, co-reference, etc.
Step 1: Create Q, K, V Projections
Each token embedding is passed through learned matrices to produce:
Q (Query): What am I looking for?
K (Key): What do I offer?
V (Value): What content do I contain?
These are computed for each attention head independently.
Step 2: Compute Attention Scores
For every token i, compare its query Qᵢ to every key Kⱼ:
score[i,j] = Qᵢ ⋅ Kⱼ
This measures relevance between token i and all others.
Step 3: Softmax Normalization
To turn raw scores into probabilities:
attention_weight[i,j] = softmax(score[i,j])
Each row forms a probability distribution over all tokens.
Step 4: Weighted Sum of Values
Use attention weights to blend value vectors:
outputᵢ = Σⱼ (attention_weight[i,j] × Vⱼ)
Each token embedding becomes context-aware—influenced by tokens it attends to.
This happens:
- Per token
- Per attention head
- Per layer
Resulting in deep, contextualized representations across the entire stack.
4. Multi-Head Attention
Multiple heads capture different patterns:
Head 1: Subject-verb agreement
Head 2: Coreference resolution
Head 3: Semantic similarity
Head 4: Syntax tracking
Their outputs are concatenated and passed to the next layer.
5. Feedforward + Normalization
After attention, each token embedding is passed through:
- A feedforward neural network
- A final normalization layer
This repeats across all transformer layers. The final token embedding captures nuanced context from the entire prompt.
From Embedding to Prediction
After the final layer:
- The output for the last token is projected into vocabulary space (e.g., 50,000 tokens)
- Produces logits: raw scores for each possible next token
- Softmax converts logits into probabilities
Example:
throne: 64%
bed: 22%
lap: 12%
floor: 2%
Depending on decoding strategy, the model:
- Picks the highest (argmax), or
- Samples probabilistically (top-k, nucleus, temperature, etc.)
The selected token is appended to the prompt, and the cycle repeats.
Why the Call Stack Model Matters
Now that we know how it works, let's return to our metaphor:
Context window ≈ call stack
Tokens ≈ variable assignments or function calls
Attention ≈ execution trace
Earlier tokens initialize the stack. Later tokens override or build on it. Each new token is computed by tracing which earlier tokens had influence.
Instructions vs. Execution
Instruction: "You are a smart AI."
Execution Cue: "Let's break this down step-by-step. First, identify the knowns."
The former adds little to attention. The latter shapes it.
Prompting isn't about telling the model what to be. It's about steering how it thinks. Execution cues trigger meaningful attention pathways; instructions are often ignored unless reinforced.
Debugging Like an Engineer
Visualize Attention as Stack Traces
Heatmaps and attribution graphs help visualize:
- What tokens influenced what outputs
- How attention flows over long contexts
Just like in software:
High-attention tokens ≈ recently called functions
Vanishing influence ≈ dropped variables
Recency and Position Effects
Like memory scopes in code:
- Recent tokens have stronger influence (recency bias)
- Early tokens set interpretive frames (primacy effect)
- Mid tokens fade unless reinforced
This is why buried instructions get ignored. Repetition, strategic priming, and placement matter.
Prompt Engineering = Cognitive Programming
Ask yourself:
- What are my priming tokens?
- What state am I initializing early?
- What behavior am I reinforcing near the output?
- Are my examples being attended to or ignored?
Each edit to your prompt is like changing a variable or control flow. The structure matters as much as the content.
The Illusion of Understanding
When LLMs write fluidly, they feel intelligent. But they aren't collaborating. They're executing a process.
No cognition. No intention. Just prediction.
Why do they feel smart? Because the right prompt stumbles into the right activation pattern. Change one token and the illusion collapses.
Most prompting advice focuses on instructions. But instructions don't reliably activate behavior. Structure and placement do.
This is why LLMs output generic LinkedIn fluff when asked to write a "LinkedIn post." That's the average. Not a bug. It's the most likely pattern.
More Than a Stochastic Parrot
Two modern capabilities push LLMs beyond mere parroting:
1. Post-Training Alignment
Reinforcement Learning from Human Feedback (RLHF) fine-tunes models to follow intent, not just statistical likelihood. This makes them feel cooperative and helpful.
2. In-Context Learning
Larger models can generalize from a few examples within a prompt. They exhibit "emergent" behaviors like learning tasks without weight updates.
Together, these make priming powerful. You can guide behavior—if you know how attention flows.
Cracking the Black Box
Anthropic's Tracing Thoughts and Attribution Graphs show that:
- LLMs plan ahead, activating rhyme candidates or thematic arcs
- Internal activations track reasoning-like steps (e.g. Dallas → Texas)
The takeaway? There is structure inside the model. But to guide it, we must speak its language: placement, pattern, and weight—not vague instructions.
Final Takeaways
- LLMs don't follow instructions. They predict tokens based on context.
- Attention is execution. It routes information from prior tokens to shape the next.
- Recency and placement matter. Later tokens override. Early ones initialize.
- Prompting is programming. Examples are function outputs. Tokens are variables.
- The context window is your call stack. Learn to read it.
If you treat LLMs like APIs, you'll always be guessing.
But if you treat them like virtual machines with memory, attention, and execution flow?
You can start to debug.
References
- Vaswani, A., et al. (2017). "Attention Is All You Need." arxiv.org/abs/1706.03762
- Christiano, P., et al. (2017). "Deep Reinforcement Learning from Human Preferences." arxiv.org/abs/1706.03741
- Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." arxiv.org/abs/2203.02155
- Wei, J., et al. (2022). "Emergent Abilities of Large Language Models." arxiv.org/abs/2206.07682
- Brown, T.B., et al. (2020). "Language Models are Few-Shot Learners." arxiv.org/abs/2005.14165
- Anthropic (2024). "Tracing Thoughts: How Language Models Plan Ahead." anthropic.com/research/tracing-thoughts-language-model
- Elhage, N., et al. (2022). "Interpretability in the Wild: A Circuit for Indirect Object Identification." transformer-circuits.pub
- Elhage, N., et al. (2025). "Attribution graphs: Tracing language model computations beyond attention." transformer-circuits.pub/2025/attribution-graphs