The model is biased to produce response-shaped continuations when input is instruction-shaped
The next-token loop. The model is still picking tokens conditioned on the prefix.
Helpful, harmless, honest patterns from RLHF show up as default tendencies
The fact that “instruction” is a property of input tokens, not of a separate execution mode
The full mechanics are in Phase 4. For prompting, the working assumption is that the model in front of you wants to follow instructions that look like instructions.
Application input: system prompt + user-supplied or retrieved data.
Failure mode: that data contains instruction-shaped tokens.
Result: the model has no robust way to tell whose instruction
is authoritative; it may follow the injected one.
Variant
Who is the attacker?
Where it shows up
Direct injection
The user typing the prompt
Anywhere a user message is concatenated into a prompt
Indirect injection
A third party who controls retrieved content (webpage, PDF, email, support ticket)
RAG, email summarizers, scraping pipelines, browsing tools. The user is the victim, not the attacker.
Mitigation
What it does
What it does not do
Instruction-hierarchy training
Teaches the model to weight system instructions over user instructions
Eliminate the gap; raises the bar, does not turn it into a wall
Channel separation (system vs user roles in the API)
Reduces the surface area where injected instructions get authority
Prevent injection in user-supplied content that is concatenated into prompts
Output filtering / sandboxing
Limits the damage an injected instruction can do to downstream systems
Stop the model from following the injection in the first place
Jailbreak vs injection
Threat model
Jailbreak
Attacker is also the user; bypasses the model’s refusal training on the attacker’s own prompt
Injection
Attacker is not the user; hides instructions in user-supplied or retrieved content that the operator concatenates into a prompt
Design rule: treat any user-supplied or web-fetched content inside a prompt as untrusted. Do not give a model on top of that input access to anything you would not let the untrusted input control directly.
No. Every prompt is just input tokens conditioning the next-token loop. Structural choice swamps wording choice.
Cargo-culting role prompts
”Act as a senior X” often makes no measurable difference; the model performs the persona at the cost of doing the task.
Treating the system prompt as enforcement
It is guidance the model is biased toward following, not a sandbox.
Conflating prompting with fine-tuning
Different tools, different cost profiles. Prompting is per-call and free of training cost; fine-tuning persists across calls without paying token cost.
Prompt: the input tokens you control. Becomes the prefix that conditions the model’s next-token loop.
System prompt: a separate, conceptually higher-trust input that sets standing instructions for the conversation. Mechanically just more tokens at the start; “system” comes from an API contract plus a learned training-time bias.
Role prompt: an instruction that asks the model to adopt a persona (“you are a…”). May appear in the system or user channel.
Instruction-tuning: post-training (covered in Phase 4) that biases the model to produce response-shaped continuations when input is instruction-shaped.
Prompt injection: instruction-shaped content hidden in user-supplied or web-fetched data that the model may follow as if it were operator instructions.
Direct injection: the user is the attacker.
Indirect injection: the attacker is a third party whose content the application retrieves (RAG, email, scraping). The user is the victim.
Jailbreak: a different threat model where the attacker is the user and is trying to get the model to bypass its refusal training on the attacker’s own prompt.
Instruction-hierarchy training: post-training technique that teaches the model to weight some instruction sources (typically the system channel) over others. Reduces but does not eliminate prompt-injection vulnerability.
A prompt is just input tokens. The model follows instructions because it was trained to. That is also why it follows the wrong ones when they are hidden in its input.