Skip to content

How prompting works: mechanics, system prompts, and prompt injection

This is lesson 2 of Phase 5 (How we steer models at inference) in Track 5 (AI Foundations). Modern chat assistants follow instructions because they were post-trained to. The full mechanics of that post-training (supervised fine-tuning, reward modeling, RLHF) are covered in Phase 4. For this lesson the working assumption is that the assistant in front of you wants to follow instructions. Course materials are at cme295.stanford.edu.

This lesson covers the foundation that all prompting patterns sit on top of: what a prompt actually is at the token level (just input tokens conditioning the next-token prediction loop from the previous lesson), what a system prompt does (a high-priority hint backed by an API contract plus a training-time bias toward weighting system over user, not a hard execution boundary), and why every instruction-tuned model is structurally vulnerable to prompt injection (the same property that makes the model useful, instruction-following over input tokens, is what makes it follow injected instructions). The lesson distinguishes direct injection (the user is the attacker) from indirect injection (a third party hides instructions in retrieved content the application later concatenates into a prompt) and from jailbreaks (different threat model, same family of failures). It closes by naming where mitigations actually live (at the application layer, not in the prompt) and what each common mitigation does and does not do. Few-shot prompting and chain-of-thought get their own dedicated lessons next.

This is lesson 2 of Phase 5, How we steer models at inference. The previous lesson (Token by token: how a transformer generates text) showed how the generation loop works. This lesson covers the mechanics-and-trust-boundaries layer that all prompting patterns sit on top of. The next two Phase 5 lessons, How few-shot examples teach in context and How chain of thought makes models think out loud, cover the deep teaching of the two prompting patterns this lesson only names. Phase 6 then opens the reasoning-and-agents arc, starting with reasoning models (where chain-of-thought is trained into the model rather than prompted in).

Prerequisites: the text generation lesson is required, since prompting is ultimately just choosing the input tokens that condition the next-token prediction loop covered there. If you want to understand why the model follows instructions in the first place (the SFT and RLHF training that produces an instruction-tuned assistant), Phase 4 covers the post-training pipeline; it is not required reading before this one.

  • Explain what a prompt actually is at the token level (the input tokens you control, conditioning the model’s next-token prediction loop) and why that mental model is more useful than treating prompting as “magic words”
  • Describe what a system prompt does and what makes it different from a user message (an API-contract-plus-training-time-bias, not a hard execution boundary)
  • Distinguish direct prompt injection (the user is the attacker) from indirect prompt injection (a third party hides instructions in retrieved content the application later concatenates into a prompt)
  • Distinguish prompt injection from jailbreaks (different threat models, same family of failures)
  • Recognize where mitigations actually live (at the application layer, not in the prompt) and what each common mitigation does and does not do
  • Read time: about 14 minutes
  • Practice time: about 12 minutes (a trust-boundary spotting exercise on three small prompt-application designs, plus a redesign exercise)
  • Difficulty: standard