Clawdemy Lessons

Clawdemy LessonsFree AI literacy for everyday users. Bite-size narrated lessons that turn fear into fluency, one topic at a time.https://clawdemy.org/enClawdemyFree AI literacy for everyday users. Bite-size narrated lessons that turn fear into fluency, one topic at a time.Clawdemyhello@clawdemy.orgfalseepisodicData, part 2, filtering, deduplication, mixing, synthetichttps://clawdemy.org/lessons/build-an-llm-from-scratch/data-filtering/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/data-filtering/lesson/Lesson 12 of Track 15. The later stages of the funnel from lesson 11: heuristic and classifier filtering, exact / near-duplicate / substring deduplication, mixing weights (increasingly learned rather than hand-tuned), and the fast-growing category of synthetic data. Taught technical-not-legal throughout: legal and policy debates about training data are out of scope here.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 12 of Track 15. The later stages of the funnel from lesson 11: heuristic and classifier filtering, exact / near-duplicate / substring deduplication, mixing weights (increasingly learned rather than hand-tuned), and the fast-growing category of synthetic data. Taught technical-not-legal throughout: legal and policy debates about training data are out of scope here.Data, part 1, sources and datasetshttps://clawdemy.org/lessons/build-an-llm-from-scratch/data-sources/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/data-sources/lesson/Lesson 11 of Track 15. Where the trillions of training tokens scaling laws demand actually come from. Six source categories (web crawls, wikis, books, code, math/academic, social/forum), the reference open datasets (The Pile, RedPajama, FineWeb, RefinedWeb), the 50-to-1000x raw-to-final funnel, and the sampling-weight intuitions that shape what the model becomes good at. Taught technical-not-legal throughout: legal and policy debates around training data are out of scope here.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 11 of Track 15. Where the trillions of training tokens scaling laws demand actually come from. Six source categories (web crawls, wikis, books, code, math/academic, social/forum), the reference open datasets (The Pile, RedPajama, FineWeb, RefinedWeb), the 50-to-1000x raw-to-final funnel, and the sampling-weight intuitions that shape what the model becomes good at. Taught technical-not-legal throughout: legal and policy debates around training data are out of scope here.Evaluation, measuring a language modelhttps://clawdemy.org/lessons/build-an-llm-from-scratch/evaluation/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/evaluation/lesson/Lesson 10 of Track 15. Scaling laws predict loss; what you care about is capability. This lesson covers the four benchmark formats (multiple-choice, executable, instruction-following, open-ended), the four reasons evaluation is hard (construct validity, contamination, format sensitivity, open-ended scoring), the practical defenses against each, and the layered pragmatic stack modern LLM teams actually run. The discipline of treating any single number with suspicion is what bridges loss to capability honestly.Mon, 25 May 2026 00:00:00 GMTClawdemy12:00falseLesson 10 of Track 15. Scaling laws predict loss; what you care about is capability. This lesson covers the four benchmark formats (multiple-choice, executable, instruction-following, open-ended), the four reasons evaluation is hard (construct validity, contamination, format sensitivity, open-ended scoring), the practical defenses against each, and the layered pragmatic stack modern LLM teams actually run. The discipline of treating any single number with suspicion is what bridges loss to capability honestly.How models run on hardware, GPUs and TPUshttps://clawdemy.org/lessons/build-an-llm-from-scratch/gpus-and-tpus/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/gpus-and-tpus/lesson/Lesson 5 of Track 15, opening Phase 2. Phase 1 built the model; this phase makes it run fast. The lesson opens the chip itself: how a GPU executes math (SIMT, streaming multiprocessors, tensor cores), the memory hierarchy that decides whether the math is fed (HBM, SRAM, registers), how TPUs differ (systolic arrays), and why hardware shapes architecture choices in lesson 2's terms.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 5 of Track 15, opening Phase 2. Phase 1 built the model; this phase makes it run fast. The lesson opens the chip itself: how a GPU executes math (SIMT, streaming multiprocessors, tensor cores), the memory hierarchy that decides whether the math is fed (HBM, SRAM, registers), how TPUs differ (systolic arrays), and why hardware shapes architecture choices in lesson 2's terms.Inference, serving a trained model fasthttps://clawdemy.org/lessons/build-an-llm-from-scratch/inference/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/inference/lesson/Lesson 8 of Track 15, closing Phase 2. Inference is a different cost problem than training: mostly memory bandwidth in decode, not compute. This lesson covers the prefill/decode split, the KV cache as the central object, and the techniques that turn memory-bound decode into something efficient: continuous batching, paged attention, speculative decoding, and quantization, plus a note on how parallelism shows up differently at inference than at training.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 8 of Track 15, closing Phase 2. Inference is a different cost problem than training: mostly memory bandwidth in decode, not compute. This lesson covers the prefill/decode split, the KV cache as the central object, and the techniques that turn memory-bound decode into something efficient: continuous batching, paged attention, speculative decoding, and quantization, plus a note on how parallelism shows up differently at inference than at training.Writing fast kernels, Triton and XLAhttps://clawdemy.org/lessons/build-an-llm-from-scratch/kernels-triton-xla/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/kernels-triton-xla/lesson/Lesson 6 of Track 15. The code-level lever for raising arithmetic intensity from lesson 2. What a kernel is, why fusing operations is the single biggest performance lever (keep intermediates in SRAM/registers, round-trip HBM once), and the two practical paths: Triton (write block-level kernels in Python; the compiler handles warps/registers/tiling) and XLA (a graph compiler that fuses standard ops automatically). FlashAttention as the worked example: same math, ~2-4x faster, large memory savings.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 6 of Track 15. The code-level lever for raising arithmetic intensity from lesson 2. What a kernel is, why fusing operations is the single biggest performance lever (keep intermediates in SRAM/registers, round-trip HBM once), and the two practical paths: Triton (write block-level kernels in Python; the compiler handles warps/registers/tiling) and XLA (a graph compiler that fuses standard ops automatically). FlashAttention as the worked example: same math, ~2-4x faster, large memory savings.Training across many devices, parallelismhttps://clawdemy.org/lessons/build-an-llm-from-scratch/parallelism/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/parallelism/lesson/Lesson 7 of Track 15, collapsing CS336 Lectures 7 and 8. Lesson 2's 16N memory accounting already exceeds one GPU; frontier models are far larger. This lesson covers the three classic parallelism schemes (data, tensor, pipeline), the modern sharded variant (FSDP/ZeRO), the within-node vs across-nodes placement rules, and how 3D parallelism combines all of them for frontier-scale training. The lesson-2 accounting becomes an actionable cluster configuration.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 7 of Track 15, collapsing CS336 Lectures 7 and 8. Lesson 2's 16N memory accounting already exceeds one GPU; frontier models are far larger. This lesson covers the three classic parallelism schemes (data, tensor, pipeline), the modern sharded variant (FSDP/ZeRO), the within-node vs across-nodes placement rules, and how 3D parallelism combines all of them for frontier-scale training. The lesson-2 accounting becomes an actionable cluster configuration.Post-training, SFT and RLHFhttps://clawdemy.org/lessons/build-an-llm-from-scratch/post-training-sft-rlhf/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/post-training-sft-rlhf/lesson/Lesson 13 of Track 15. How a pretrained base model becomes a usable assistant. Supervised fine-tuning on instruction-response data, then preference tuning on `(prompt, A, B, preferred)` data via RLHF (reward model + PPO) or its simpler successor DPO (closed-form-derived loss; no reward model, no RL step; modern default). Taught technical-primer throughout: what the methods do mechanically, with no contested-alignment-as-safety framing.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 13 of Track 15. How a pretrained base model becomes a usable assistant. Supervised fine-tuning on instruction-response data, then preference tuning on `(prompt, A, B, preferred)` data via RLHF (reward model + PPO) or its simpler successor DPO (closed-form-derived loss; no reward model, no RL step; modern default). Taught technical-primer throughout: what the methods do mechanically, with no contested-alignment-as-safety framing.Reasoning and alignment, RL with verifiable rewardshttps://clawdemy.org/lessons/build-an-llm-from-scratch/reasoning-rl/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/reasoning-rl/lesson/Lesson 14 of Track 15, the track capstone. Builds on CS336 Lecture 16 (post-training RLVR); the RL-as-systems framing is the lesson's own synthesis. RL with verifiable rewards (RLVR) replaces RLHF's learned reward model with a verifiable check (math grader, code tests, puzzle validator); GRPO is the modern algorithm, in TRL alongside SFTTrainer/DPOTrainer. DeepSeek R1 and Open R1 are the landscape anchors. RL at LLM scale is mostly a systems problem (sample + verify + train workers), and the lesson closes the track with the synthesis: you can now build the whole pipeline, and the durable method outlasts the next frontier. Taught technical-primer; contested alignment debates out of scope.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 14 of Track 15, the track capstone. Builds on CS336 Lecture 16 (post-training RLVR); the RL-as-systems framing is the lesson's own synthesis. RL with verifiable rewards (RLVR) replaces RLHF's learned reward model with a verifiable check (math grader, code tests, puzzle validator); GRPO is the modern algorithm, in TRL alongside SFTTrainer/DPOTrainer. DeepSeek R1 and Open R1 are the landscape anchors. RL at LLM scale is mostly a systems problem (sample + verify + train workers), and the lesson closes the track with the synthesis: you can now build the whole pipeline, and the durable method outlasts the next frontier. Taught technical-primer; contested alignment debates out of scope.Scaling laws, predicting what bigger gets youhttps://clawdemy.org/lessons/build-an-llm-from-scratch/scaling-laws/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/scaling-laws/lesson/Lesson 9 of Track 15, opening Phase 3. Scaling laws turn the budget question (bigger model or more data?) from folklore into arithmetic. This lesson collapses CS336 Lectures 9 and 11 per Phase 0: the power-law form, the Kaplan-to-Chinchilla shift (D ~ 20N tokens per parameter), how the laws turn a fixed compute budget into an optimal (N, D), and how inference cost pushes modern open models past Chinchilla-optimal in practice.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 9 of Track 15, opening Phase 3. Scaling laws turn the budget question (bigger model or more data?) from folklore into arithmetic. This lesson collapses CS336 Lectures 9 and 11 per Phase 0: the power-law form, the Kaplan-to-Chinchilla shift (D ~ 20N tokens per parameter), how the laws turn a fixed compute budget into an optimal (N, D), and how inference cost pushes modern open models past Chinchilla-optimal in practice.Recovering the third dimension, 3D visionhttps://clawdemy.org/lessons/computer-vision/3d-vision/lesson/https://clawdemy.org/lessons/computer-vision/3d-vision/lesson/The world is three-dimensional; photographs are two-dimensional. Every camera capture collapses one dimension (depth) that has to be recovered if a vision system wants to interact with the world physically. This lesson covers how vision recovers 3D structure from 2D images. We meet the depth cues (stereo disparity, monocular priors, motion), the 3D representations (depth maps, voxels, point clouds, meshes, implicit / SDFs, NeRF), the standard methods (monocular depth like MiDaS, multi-view stereo, Structure from Motion via COLMAP, NeRF, 3D Gaussian Splatting), and work one stereo-disparity-to-depth calculation by hand (`Z = (f · b) / d`).Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseThe world is three-dimensional; photographs are two-dimensional. Every camera capture collapses one dimension (depth) that has to be recovered if a vision system wants to interact with the world physically. This lesson covers how vision recovers 3D structure from 2D images. We meet the depth cues (stereo disparity, monocular priors, motion), the 3D representations (depth maps, voxels, point clouds, meshes, implicit / SDFs, NeRF), the standard methods (monocular depth like MiDaS, multi-view stereo, Structure from Motion via COLMAP, NeRF, 3D Gaussian Splatting), and work one stereo-disparity-to-depth calculation by hand (`Z = (f · b) / d`).The architectures that cracked vision, AlexNet to ResNethttps://clawdemy.org/lessons/computer-vision/cnn-architectures/lesson/https://clawdemy.org/lessons/computer-vision/cnn-architectures/lesson/Lesson 5 introduced the conv layer. This lesson is the story of how it actually got stacked, between 2012 and 2015, into the architectures that cracked computer vision. We walk four landmarks (AlexNet, VGG, GoogLeNet, ResNet) with their key ideas, parameter counts, and ImageNet results, then explain ResNet's residual block (`y = F(x) + x`) and why identity shortcuts solved the optimization-difficulty problem that had capped depth. The folded subsection on training at scale covers data parallelism, model parallelism, and the engineering tricks (mixed precision, learning-rate warmup, the linear scaling rule) that let modern vision models train on hundreds to thousands of accelerators while the underlying gradient descent algorithm stays unchanged.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 5 introduced the conv layer. This lesson is the story of how it actually got stacked, between 2012 and 2015, into the architectures that cracked computer vision. We walk four landmarks (AlexNet, VGG, GoogLeNet, ResNet) with their key ideas, parameter counts, and ImageNet results, then explain ResNet's residual block (`y = F(x) + x`) and why identity shortcuts solved the optimization-difficulty problem that had capped depth. The folded subsection on training at scale covers data parallelism, model parallelism, and the engineering tricks (mixed precision, learning-rate warmup, the linear scaling rule) that let modern vision models train on hundreds to thousands of accelerators while the underlying gradient descent algorithm stays unchanged.How machines see local patterns, convolutionhttps://clawdemy.org/lessons/computer-vision/convolution-and-cnns/lesson/https://clawdemy.org/lessons/computer-vision/convolution-and-cnns/lesson/Phase 2 opener. The general-purpose classifier from Phase 1 would technically work on images, but its first layer is wasteful (a single FC neuron on a 224x224x3 input holds 150,528 weights) and blind to the spatial structure of images. This lesson replaces that layer with the convolution: a small learned filter slides spatially across the input, computing a dot product with each local patch and producing a feature map of where its pattern occurred. We work one filter (a vertical-edge detector) by hand on a 5x5 image, name the three hyperparameters (depth K, stride S, padding P), state the exact output spatial-size formula `(W - F + 2P) / S + 1`, and count the parameter savings (AlexNet's first conv layer = 34,944 parameters, the same number for any input image size). The training loop on top is unchanged.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falsePhase 2 opener. The general-purpose classifier from Phase 1 would technically work on images, but its first layer is wasteful (a single FC neuron on a 224x224x3 input holds 150,528 weights) and blind to the spatial structure of images. This lesson replaces that layer with the convolution: a small learned filter slides spatially across the input, computing a dot product with each local patch and producing a feature map of where its pattern occurred. We work one filter (a vertical-edge detector) by hand on a 5x5 image, name the three hyperparameters (depth K, stride S, padding P), state the exact output spatial-size formula `(W - F + 2P) / S + 1`, and count the parameter savings (AlexNet's first conv layer = 34,944 parameters, the same number for any input image size). The training loop on top is unchanged.Beyond what is it, detection, segmentation, and seeing inside the nethttps://clawdemy.org/lessons/computer-vision/detection-segmentation-visualizing/lesson/https://clawdemy.org/lessons/computer-vision/detection-segmentation-visualizing/lesson/Classification answers 'what is in this image?' Real-world vision often needs more. This lesson covers the three task families that go beyond classification. **Detection** produces lists of (class, bounding box) per image (R-CNN family vs YOLO; anchor boxes; IoU + mAP evaluation). **Segmentation** labels every pixel (semantic with FCN / U-Net vs instance with Mask R-CNN). **Visualization** lets us peek inside trained networks (saliency, occlusion, Grad-CAM, t-SNE, DeepDream) with an honest caveat that these are debugging tools, not full explanations. We work one IoU computation by hand in the body and another in practice, and the training loop on top is unchanged across all three task families.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseClassification answers 'what is in this image?' Real-world vision often needs more. This lesson covers the three task families that go beyond classification. **Detection** produces lists of (class, bounding box) per image (R-CNN family vs YOLO; anchor boxes; IoU + mAP evaluation). **Segmentation** labels every pixel (semantic with FCN / U-Net vs instance with Mask R-CNN). **Visualization** lets us peek inside trained networks (saliency, occlusion, Grad-CAM, t-SNE, DeepDream) with an honest caveat that these are debugging tools, not full explanations. We work one IoU computation by hand in the body and another in practice, and the training loop on top is unchanged across all three task families.Generating images by denoising, diffusionhttps://clawdemy.org/lessons/computer-vision/diffusion-models/lesson/https://clawdemy.org/lessons/computer-vision/diffusion-models/lesson/VAEs were stable but blurry; GANs were sharp but unstable. Diffusion models take a third approach that has largely replaced both for high-quality image generation since around 2020. The trick is to gradually corrupt training images with noise (a fixed forward process), train a network to predict and reverse the noise (the learned reverse process), and then run the network in reverse: start from pure noise and iteratively denoise into an image. This lesson covers diffusion at vision-context intuition level, works one forward noising step by hand, names the trade-off (high quality + stable training, but slow iterative inference), and explains how text-to-image systems (Stable Diffusion, Imagen, DALL-E 2/3) add language conditioning on top with classifier-free guidance. The L11 VAE makes a comeback as latent diffusion's first-stage encoder.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseVAEs were stable but blurry; GANs were sharp but unstable. Diffusion models take a third approach that has largely replaced both for high-quality image generation since around 2020. The trick is to gradually corrupt training images with noise (a fixed forward process), train a network to predict and reverse the noise (the learned reverse process), and then run the network in reverse: start from pure noise and iteratively denoise into an image. This lesson covers diffusion at vision-context intuition level, works one forward noising step by hand, names the trade-off (high quality + stable training, but slow iterative inference), and explains how text-to-image systems (Stable Diffusion, Imagen, DALL-E 2/3) add language conditioning on top with classifier-free guidance. The L11 VAE makes a comeback as latent diffusion's first-stage encoder.Teaching machines to imagine, GANs and VAEshttps://clawdemy.org/lessons/computer-vision/gans-and-vaes/lesson/https://clawdemy.org/lessons/computer-vision/gans-and-vaes/lesson/Every architecture in this track so far has been discriminative (image in, label out). This lesson opens the generative side. We distinguish discriminative from generative modeling, walk the two pre-2020 generative-image-model families (VAEs and GANs) at intuition level, and work the reparameterization trick `z = μ + σ · ε` by hand. The VAE-vs-GAN trade-off (smooth-but-blurry vs sharp-but-hard-to-train) sets up why neither was a perfect solution and motivates diffusion (next lesson). Full mechanical derivations live in sister tracks (T19 for the VAE's ELBO, T24 for GAN training dynamics); this lesson stays at the vision-applied-use level.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseEvery architecture in this track so far has been discriminative (image in, label out). This lesson opens the generative side. We distinguish discriminative from generative modeling, walk the two pre-2020 generative-image-model families (VAEs and GANs) at intuition level, and work the reparameterization trick `z = μ + σ · ε` by hand. The VAE-vs-GAN trade-off (smooth-but-blurry vs sharp-but-hard-to-train) sets up why neither was a perfect solution and motivates diffusion (next lesson). Full mechanical derivations live in sister tracks (T19 for the VAE's ELBO, T24 for GAN training dynamics); this lesson stays at the vision-applied-use level.Computer vision among people, the human-centered viewhttps://clawdemy.org/lessons/computer-vision/human-centered-ai/lesson/https://clawdemy.org/lessons/computer-vision/human-centered-ai/lesson/Closing lesson of Track 16. T16 built classifiers, detectors, segmenters, generative models, 3D recovery, vision-language systems, and world models, and many of them are deployed in the real world. The final question this track owes is what these systems get right and wrong in deployment, and how to reason about those strengths and failures as engineering concerns. We catalog the standard failure modes (distribution shift, adversarial examples, OOD inputs, shortcut learning, calibration / overconfidence), treat bias as a property of training data with concrete measurement (disaggregated reporting) and mitigation (data / model / evaluation engineering), and close with the trustworthiness gap between benchmark accuracy and real-world reliability. Policy debates around vision systems are real, important, and outside this lesson's scope; the right forum for those is different.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseClosing lesson of Track 16. T16 built classifiers, detectors, segmenters, generative models, 3D recovery, vision-language systems, and world models, and many of them are deployed in the real world. The final question this track owes is what these systems get right and wrong in deployment, and how to reason about those strengths and failures as engineering concerns. We catalog the standard failure modes (distribution shift, adversarial examples, OOD inputs, shortcut learning, calibration / overconfidence), treat bias as a property of training data with concrete measurement (disaggregated reporting) and mitigation (data / model / evaluation engineering), and close with the trustworthiness gap between benchmark accuracy and real-world reliability. Policy debates around vision systems are real, important, and outside this lesson's scope; the right forum for those is different.Telling pictures apart with one score, linear classifiershttps://clawdemy.org/lessons/computer-vision/linear-classifiers/lesson/https://clawdemy.org/lessons/computer-vision/linear-classifiers/lesson/Lesson 1 named the strategy (learn from labeled examples); this lesson is the simplest machine that actually carries it out. The linear classifier flattens an image into a column of numbers, multiplies it by a learned weight matrix, adds a learned bias, and reads off one score per class. We define the score function `s = W · x + b`, ground it in CIFAR-10's shapes (x is 3072 numbers, W is 10 by 3072, 10 scores out), compute a small prediction by hand, see what each row of W really is (a learned per-class template), look at the geometric (hyperplane) view, and meet the structural limit (one template per class) that motivates everything that follows.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 1 named the strategy (learn from labeled examples); this lesson is the simplest machine that actually carries it out. The linear classifier flattens an image into a column of numbers, multiplies it by a learned weight matrix, adds a learned bias, and reads off one score per class. We define the score function `s = W · x + b`, ground it in CIFAR-10's shapes (x is 3072 numbers, W is 10 by 3072, 10 scores out), compute a small prediction by hand, see what each row of W really is (a learned per-class template), look at the geometric (hyperplane) view, and meet the structural limit (one template per class) that motivates everything that follows.How a classifier learns, loss and optimizationhttps://clawdemy.org/lessons/computer-vision/loss-and-optimization/lesson/https://clawdemy.org/lessons/computer-vision/loss-and-optimization/lesson/Lesson 2 left us with a classifier (s = W · x + b) and no way to set its knobs. This lesson defines both halves of the answer. A loss function turns 'predictions match labels' into a single number to drive down (we define multiclass SVM and softmax / cross-entropy and work each on the same worked example); regularization adds a penalty on large weights for better generalization; and gradient descent is the loop that nudges W and b in the negative-gradient direction with step size set by the learning rate. We name analytic vs numerical gradients and mini-batch / SGD as the practical realization. That four-step cycle (forward pass, loss, gradient, step) is how every classifier in this track, including the giants ahead, actually trains.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 2 left us with a classifier (s = W · x + b) and no way to set its knobs. This lesson defines both halves of the answer. A loss function turns 'predictions match labels' into a single number to drive down (we define multiclass SVM and softmax / cross-entropy and work each on the same worked example); regularization adds a penalty on large weights for better generalization; and gradient descent is the loop that nudges W and b in the negative-gradient direction with step size set by the learning rate. We name analytic vs numerical gradients and mini-batch / SGD as the practical realization. That four-step cycle (forward pass, loss, gradient, step) is how every classifier in this track, including the giants ahead, actually trains.Learning features instead of coding them, neural networks and backprophttps://clawdemy.org/lessons/computer-vision/neural-networks-and-backprop/lesson/https://clawdemy.org/lessons/computer-vision/neural-networks-and-backprop/lesson/The Phase 1 capstone. Lesson 2 capped us at one template per class; lesson 3 gave us a training loop. This lesson lifts that cap. Stacking two linear layers gains nothing on its own (the composition collapses to one linear layer), so we insert a non-linearity (ReLU) between them. The hidden layer now produces learned features of the image instead of operating on raw pixels, which broke the multi-modal limit and ended the hand-engineered-features era of computer vision. Computing the gradient through every weight in every layer is then made tractable by backpropagation, the chain rule applied recursively through the network's computational graph: one forward pass plus one backward pass yields gradients for every weight at once. By the end, the full general-purpose image-classifier training loop is in place.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseThe Phase 1 capstone. Lesson 2 capped us at one template per class; lesson 3 gave us a training loop. This lesson lifts that cap. Stacking two linear layers gains nothing on its own (the composition collapses to one linear layer), so we insert a non-linearity (ReLU) between them. The hidden layer now produces learned features of the image instead of operating on raw pixels, which broke the multi-modal limit and ended the hand-engineered-features era of computer vision. Computing the gradient through every weight in every layer is then made tractable by backpropagation, the chain rule applied recursively through the network's computational graph: one forward pass plus one backward pass yields gradients for every weight at once. By the end, the full general-purpose image-classifier training loop is in place.Learning from images without labels, self-supervised visionhttps://clawdemy.org/lessons/computer-vision/self-supervised-vision/lesson/https://clawdemy.org/lessons/computer-vision/self-supervised-vision/lesson/Phase 3 opener. Every supervised model so far in this track has needed labeled images, and labels are expensive (ImageNet's million labels took years). Self-supervised learning lets a model learn useful visual features from unlabeled images alone, by constructing pretext tasks whose labels come from the data itself. We walk the pretext-task history (rotation, jigsaw, colorization), the contrastive-learning shift (SimCLR, MoCo, BYOL) with one cosine similarity by hand, and masked image modeling (MAE, DINO/DINOv2). The pre-train-then-fine-tune workflow that powers most modern vision-language and multimodal systems lives here, and it is the engine that makes vision feasible in label-scarce domains (medical imaging, satellite, scientific data) where unlabeled data is abundant.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falsePhase 3 opener. Every supervised model so far in this track has needed labeled images, and labels are expensive (ImageNet's million labels took years). Self-supervised learning lets a model learn useful visual features from unlabeled images alone, by constructing pretext tasks whose labels come from the data itself. We walk the pretext-task history (rotation, jigsaw, colorization), the contrastive-learning shift (SimCLR, MoCo, BYOL) with one cosine similarity by hand, and masked image modeling (MAE, DINO/DINOv2). The pre-train-then-fine-tune workflow that powers most modern vision-language and multimodal systems lives here, and it is the engine that makes vision feasible in label-scarce domains (medical imaging, satellite, scientific data) where unlabeled data is abundant.Sequence tools for vision, recurrence and attentionhttps://clawdemy.org/lessons/computer-vision/sequence-tools-for-vision/lesson/https://clawdemy.org/lessons/computer-vision/sequence-tools-for-vision/lesson/A single image is a static scene; many vision tasks involve sequences (captions are sequences of words, videos are sequences of frames, and the Vision Transformer treats an image itself as a sequence of patches). This lesson covers the two sequence-processing tools (recurrence and attention) at the level needed for vision applications. Recurrence (RNN, LSTM, GRU) processes a sequence one step at a time and carries a hidden state forward; attention compares every position to every other in parallel and returns a weighted average of values. We cover vision applications (CNN-RNN captioning, CNN-attention captioning, CNN-RNN video, Vision Transformer), work one attention computation by hand, and route to sister tracks (T12 L2 for recurrence; T5 multi-lesson + T14 for transformers) for the deep mechanics. Combining Lec 7+8 into one lesson is a deliberate Phase 0 choice to avoid duplicating sister-track depth.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseA single image is a static scene; many vision tasks involve sequences (captions are sequences of words, videos are sequences of frames, and the Vision Transformer treats an image itself as a sequence of patches). This lesson covers the two sequence-processing tools (recurrence and attention) at the level needed for vision applications. Recurrence (RNN, LSTM, GRU) processes a sequence one step at a time and carries a hidden state forward; attention compares every position to every other in parallel and returns a weighted average of values. We cover vision applications (CNN-RNN captioning, CNN-attention captioning, CNN-RNN video, Vision Transformer), work one attention computation by hand, and route to sister tracks (T12 L2 for recurrence; T5 multi-lesson + T14 for transformers) for the deep mechanics. Combining Lec 7+8 into one lesson is a deliberate Phase 0 choice to avoid duplicating sister-track depth.Teaching machines to understand videohttps://clawdemy.org/lessons/computer-vision/video-understanding/lesson/https://clawdemy.org/lessons/computer-vision/video-understanding/lesson/A photo is one moment; a video is a sequence of moments stretched across time. This lesson walks the standard ways of adding the time dimension to a vision system, from the surprisingly competitive single-frame baseline through late and early fusion, 3D convolutions (~3x param cost per filter; C3D and I3D), two-stream networks (RGB appearance + optical-flow motion; SlowFast for the modern descendant), CNN-plus-RNN (cross-link to L7), and video transformers (TimeSformer's divided space-time attention as a practical factorization). The training loop is unchanged across all of them. We work the 2D-vs-3D conv parameter-count ratio in the body and again in practice, and emphasize the practitioner discipline of always running the single-frame baseline as the floor any video model must beat.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseA photo is one moment; a video is a sequence of moments stretched across time. This lesson walks the standard ways of adding the time dimension to a vision system, from the surprisingly competitive single-frame baseline through late and early fusion, 3D convolutions (~3x param cost per filter; C3D and I3D), two-stream networks (RGB appearance + optical-flow motion; SlowFast for the modern descendant), CNN-plus-RNN (cross-link to L7), and video transformers (TimeSformer's divided space-time attention as a practical factorization). The training loop is unchanged across all of them. We work the 2D-vs-3D conv parameter-count ratio in the body and again in practice, and emphasize the practitioner discipline of always running the single-frame baseline as the floor any video model must beat.Connecting pictures and words, vision and languagehttps://clawdemy.org/lessons/computer-vision/vision-and-language/lesson/https://clawdemy.org/lessons/computer-vision/vision-and-language/lesson/Modern AI systems do not treat images and language as separate problems; they share a representation. This lesson covers CLIP's two-tower contrastive setup (image encoder + text encoder trained jointly on ~400M web image-text pairs), the downstream applications that fall out of the trained joint embedding space (zero-shot classification, image-text retrieval, captioning, VQA), modern general-purpose vision-language models (VLMs), and the economic frame that closes Phase 3 (image-text pairs are abundant on the web; CLIP-scale pre-training exploits that abundance). We work one image-text cosine similarity by hand and one zero-shot-classification reasoning exercise.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseModern AI systems do not treat images and language as separate problems; they share a representation. This lesson covers CLIP's two-tower contrastive setup (image encoder + text encoder trained jointly on ~400M web image-text pairs), the downstream applications that fall out of the trained joint embedding space (zero-shot classification, image-text retrieval, captioning, VQA), modern general-purpose vision-language models (VLMs), and the economic frame that closes Phase 3 (image-text pairs are abundant on the web; CLIP-scale pre-training exploits that abundance). We work one image-text cosine similarity by hand and one zero-shot-classification reasoning exercise.Models that imagine the world, world modelinghttps://clawdemy.org/lessons/computer-vision/world-modeling/lesson/https://clawdemy.org/lessons/computer-vision/world-modeling/lesson/Every vision system so far in this track has been reactive (process current input, output an answer). World modeling extends vision to predictive: given the past, predict the future. Self-driving trajectory prediction, robotics planning, video generation, and model-based reinforcement learning are all variants. This lesson covers world modeling at vision-context level: the three-piece architecture (encoder + dynamics + optional decoder), the central pixel-space-vs-latent-space prediction trade-off (worked with a parameter-cost calculation), landmark architectures (World Models, Dreamer family, MuZero, JEPA, Sora-style video world models), and the cross-track ties to T18 (model-based RL depth) and T24 (production-scale video generation depth).Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseEvery vision system so far in this track has been reactive (process current input, output an answer). World modeling extends vision to predictive: given the past, predict the future. Self-driving trajectory prediction, robotics planning, video generation, and model-based reinforcement learning are all variants. This lesson covers world modeling at vision-context level: the three-piece architecture (encoder + dynamics + optional decoder), the central pixel-space-vs-latent-space prediction trade-off (worked with a parameter-cost calculation), landmark architectures (World Models, Dreamer family, MuZero, JEPA, Sora-style video world models), and the cross-track ties to T18 (model-based RL depth) and T24 (production-scale video generation depth).Agentshttps://clawdemy.org/lessons/llm-ops-and-production/agents/lesson/https://clawdemy.org/lessons/llm-ops-and-production/agents/lesson/Lesson 10 of Track 21. What an LLM agent is (the lesson-4 tool-use loop with the model deciding when to stop), the three foundational patterns (function-calling agents, ReAct, plan-and-execute), the three tests for whether a task should be an agent (variable shape + bounded tools + acceptable cost), the five engineering failure modes (loops, wrong paths, compound cost, harder evaluation, brittle tool boundaries), and how lesson 7's LLMOps discipline scales to trajectory-level evaluation. Taught technical-primer: WHAT, WHEN, WHAT-GOES-WRONG, HOW; agent-autonomy and contested-alignment debates explicitly out of scope.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 10 of Track 21. What an LLM agent is (the lesson-4 tool-use loop with the model deciding when to stop), the three foundational patterns (function-calling agents, ReAct, plan-and-execute), the three tests for whether a task should be an agent (variable shape + bounded tools + acceptable cost), the five engineering failure modes (loops, wrong paths, compound cost, harder evaluation, brittle tool boundaries), and how lesson 7's LLMOps discipline scales to trajectory-level evaluation. Taught technical-primer: WHAT, WHEN, WHAT-GOES-WRONG, HOW; agent-autonomy and contested-alignment debates explicitly out of scope.Augmented language models, retrieval and toolshttps://clawdemy.org/lessons/llm-ops-and-production/augmented-llms/lesson/https://clawdemy.org/lessons/llm-ops-and-production/augmented-llms/lesson/Lesson 4 of Track 21, opening Phase 2 (building production apps). The two patterns that take an LLM beyond what it was trained on: retrieval-augmented generation (RAG, with its seven moving parts and the trade-offs that decide whether it works), and tool use (the four-step loop where the model calls functions you define). Modern apps often implement RAG as a tool, letting the model decide when retrieval is needed. Every move lives against the three productive limits from lesson 2.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 4 of Track 21, opening Phase 2 (building production apps). The two patterns that take an LLM beyond what it was trained on: retrieval-augmented generation (RAG, with its seven moving parts and the trade-offs that decide whether it works), and tool use (the four-step loop where the model calls functions you define). Modern apps often implement RAG as a tool, letting the model decide when retrieval is needed. Every move lives against the three productive limits from lesson 2.Industry perspective: where the field is goinghttps://clawdemy.org/lessons/llm-ops-and-production/industry-perspective/lesson/https://clawdemy.org/lessons/llm-ops-and-production/industry-perspective/lesson/Lesson 11 of Track 21. The track capstone. Synthesizes the 10 lessons that came before (arc: demo to production-grade application) against the fireside-chat industry perspective from a Full Stack Deep Learning Bootcamp fireside chat with Peter Welinder (OpenAI). Three rules for reading a fireside (attribute, separate, generate questions). Five durable bets the field has converged on. Three concrete reader moves post-track. Treated as synthesis + careful read of a primary source, not as a forecast; speaker views are attributed as views, not absorbed as canon.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 11 of Track 21. The track capstone. Synthesizes the 10 lessons that came before (arc: demo to production-grade application) against the fireside-chat industry perspective from a Full Stack Deep Learning Bootcamp fireside chat with Peter Welinder (OpenAI). Three rules for reading a fireside (attribute, separate, generate questions). Five durable bets the field has converged on. Three concrete reader moves post-track. Treated as synthesis + careful read of a primary source, not as a forecast; speaker views are attributed as views, not absorbed as canon.Launch an LLM app in one hourhttps://clawdemy.org/lessons/llm-ops-and-production/launch-an-llm-app/lesson/https://clawdemy.org/lessons/llm-ops-and-production/launch-an-llm-app/lesson/Lesson 1 of Track 21, the production-tier track that opens by shipping. The track inverts the usual order: build a working LLM application first, then learn what makes it actually good. This lesson covers the five components of a minimum-viable LLM app (hosted model, API key, prompt template, application code, UI + deployment), takes one in about thirty lines of Python (Streamlit + Anthropic Claude API or another provider's), and maps honestly to the gaps the rest of the track refines (retrieval, prompt engineering, UX, observability).Mon, 25 May 2026 00:00:00 GMTClawdemy11:00falseLesson 1 of Track 21, the production-tier track that opens by shipping. The track inverts the usual order: build a working LLM application first, then learn what makes it actually good. This lesson covers the five components of a minimum-viable LLM app (hosted model, API key, prompt template, application code, UI + deployment), takes one in about thirty lines of Python (Streamlit + Anthropic Claude API or another provider's), and maps honestly to the gaps the rest of the track refines (retrieval, prompt engineering, UX, observability).LLM foundations for productionhttps://clawdemy.org/lessons/llm-ops-and-production/llm-foundations/lesson/https://clawdemy.org/lessons/llm-ops-and-production/llm-foundations/lesson/Lesson 2 of Track 21. The working picture a production builder needs after shipping the minimum app. A hosted LLM is a stateless next-token function bounded by three productive limits: context length (a hard input budget shared by system + retrieved + history + max_tokens output), cost per token (input vs output priced separately, output usually several times more, compounds at scale), and latency (TTFT + output_tokens / tokens_per_second; streaming masks it). The constraints under which every later design decision lives.Mon, 25 May 2026 00:00:00 GMTClawdemy12:00falseLesson 2 of Track 21. The working picture a production builder needs after shipping the minimum app. A hosted LLM is a stateless next-token function bounded by three productive limits: context length (a hard input budget shared by system + retrieved + history + max_tokens output), cost per token (input vs output priced separately, output usually several times more, compounds at scale), and latency (TTFT + output_tokens / tokens_per_second; streaming masks it). The constraints under which every later design decision lives.LLMOpshttps://clawdemy.org/lessons/llm-ops-and-production/llmops/lesson/https://clawdemy.org/lessons/llm-ops-and-production/llmops/lesson/Lesson 7 of Track 21, closing Phase 2. The operational layer that keeps an LLM application working over time: the LLM analogue of DevOps and MLOps. Five engineering pillars: observability (log enough to debug), evaluation in production (sample + score live; A/B test changes), prompt versioning (treat prompts as code), cost and latency monitoring (dashboards + alerts), and regression testing (suite run before every change; makes model upgrades safe). The smallest practical first stack is days, not months, and the tools matter less than the discipline.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 7 of Track 21, closing Phase 2. The operational layer that keeps an LLM application working over time: the LLM analogue of DevOps and MLOps. Five engineering pillars: observability (log enough to debug), evaluation in production (sample + score live; A/B test changes), prompt versioning (treat prompts as code), cost and latency monitoring (dashboards + alerts), and regression testing (suite run before every change; makes model upgrades safe). The smallest practical first stack is days, not months, and the tools matter less than the discipline.Project walkthrough, a real LLM application end to endhttps://clawdemy.org/lessons/llm-ops-and-production/project-walkthrough/lesson/https://clawdemy.org/lessons/llm-ops-and-production/project-walkthrough/lesson/Lesson 5 of Track 21. The bootcamp's worked example, askFSDL (a Q&A app over the FSDL course materials), read for the production decisions it embeds at each pipeline stage: knowledge-source scoping, content-shaped chunking with metadata, source-carrying retrieval, a scope-honest citation-asking system prompt, streaming generation with citations, and logging that seeds LLMOps. The complexity is in the decisions, not the line count, real apps of this shape are a few hundred lines.Mon, 25 May 2026 00:00:00 GMTClawdemy11:00falseLesson 5 of Track 21. The bootcamp's worked example, askFSDL (a Q&A app over the FSDL course materials), read for the production decisions it embeds at each pipeline stage: knowledge-source scoping, content-shaped chunking with metadata, source-carrying retrieval, a scope-honest citation-asking system prompt, streaming generation with citations, and logging that seeds LLMOps. The complexity is in the decisions, not the line count, real apps of this shape are a few hundred lines.Prompt engineering, "Learn to Spell"https://clawdemy.org/lessons/llm-ops-and-production/prompt-engineering/lesson/https://clawdemy.org/lessons/llm-ops-and-production/prompt-engineering/lesson/Lesson 3 of Track 21, closing Phase 1. Prompt engineering is the single highest-leverage application skill, and the prompt is the spec for what the assistant is. This lesson covers the toolkit (clarity, format constraints, few-shot, chain-of-thought, system prompts, persona, delimiters, end-placement, negatives used sparingly), when a prompt fix beats a code fix (the largest, cheapest category of failures), the discipline that turns prompting into engineering (version + test on held-out examples), and where prompts run out (retrieval, tool use, fine-tuning, lessons 4 and 9).Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 3 of Track 21, closing Phase 1. Prompt engineering is the single highest-leverage application skill, and the prompt is the spec for what the assistant is. This lesson covers the toolkit (clarity, format constraints, few-shot, chain-of-thought, system prompts, persona, delimiters, end-placement, negatives used sparingly), when a prompt fix beats a code fix (the largest, cheapest category of failures), the discipline that turns prompting into engineering (version + test on held-out examples), and where prompts run out (retrieval, tool use, fine-tuning, lessons 4 and 9).Training your own LLMhttps://clawdemy.org/lessons/llm-ops-and-production/training-your-own-llm/lesson/https://clawdemy.org/lessons/llm-ops-and-production/training-your-own-llm/lesson/Lesson 9 of Track 21. The deep dive on the fine-tune point of the build-vs-buy spectrum from lesson 8. When training your own (smaller, specialized) model is the right move for a production application (the three-things-true-at-once test), the staged pipeline most teams should follow (open checkpoint → curated SFT data → LoRA training → optional DPO → eval → A/B test), the practical tools (TRL, Axolotl, managed compute), the economics that decide payback, and how fine-tuning fits the mix architecture. Taught technical-primer: mechanical when/how, with broader debates explicitly out of scope.Mon, 25 May 2026 00:00:00 GMTClawdemy12:00falseLesson 9 of Track 21. The deep dive on the fine-tune point of the build-vs-buy spectrum from lesson 8. When training your own (smaller, specialized) model is the right move for a production application (the three-things-true-at-once test), the staged pipeline most teams should follow (open checkpoint → curated SFT data → LoRA training → optional DPO → eval → A/B test), the practical tools (TRL, Axolotl, managed compute), the economics that decide payback, and how fine-tuning fits the mix architecture. Taught technical-primer: mechanical when/how, with broader debates explicitly out of scope.UX for language user interfaceshttps://clawdemy.org/lessons/llm-ops-and-production/ux-for-luis/lesson/https://clawdemy.org/lessons/llm-ops-and-production/ux-for-luis/lesson/Lesson 6 of Track 21. A language user interface is a new interaction surface, and the patterns that make one usable are different from the patterns of forms and buttons. The five core patterns (streaming, citations, regeneration, hedging, recoverable failure), the supporting details that lift quality, and a critique-this-UX checklist. Taught as interaction-design throughout: content-policy and moderation debates are out of scope here.Mon, 25 May 2026 00:00:00 GMTClawdemy12:00falseLesson 6 of Track 21. A language user interface is a new interaction surface, and the patterns that make one usable are different from the patterns of forms and buttons. The five core patterns (streaming, citations, regeneration, hedging, recoverable failure), the supporting details that lift quality, and a critique-this-UX checklist. Taught as interaction-design throughout: content-policy and moderation debates are out of scope here.What's next, the LLM landscape in motionhttps://clawdemy.org/lessons/llm-ops-and-production/whats-next/lesson/https://clawdemy.org/lessons/llm-ops-and-production/whats-next/lesson/Lesson 8 of Track 21, opening Phase 3. A survey of the six directions the LLM landscape is moving (longer context, multimodality, smaller specialized models, the build-vs-buy spectrum, agents, reasoning models), what each changes for a builder reading through lesson 2's productive limits, and how three of them set up the deeper Phase 3 lessons that follow. Survey-lean: lighter pedagogy, breadth-over-depth, points forward.Mon, 25 May 2026 00:00:00 GMTClawdemy10:00falseLesson 8 of Track 21, opening Phase 3. A survey of the six directions the LLM landscape is moving (longer context, multimodality, smaller specialized models, the build-vs-buy spectrum, agents, reasoning models), what each changes for a builder reading through lesson 2's productive limits, and how three of them set up the deeper Phase 3 lessons that follow. Survey-lean: lighter pedagogy, breadth-over-depth, points forward.Function approximation and deep RLhttps://clawdemy.org/lessons/reinforcement-learning-foundations/function-approximation-and-deep-rl/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/function-approximation-and-deep-rl/lesson/Lesson 9 of Track 17. Tables don't scale; Atari, Go, and robotics state spaces are too big. Function approximation replaces the table with a parameterized function (linear features or a neural network), keeps the Bellman recursion intact, and lets one update generalize across all states via shared parameters. This lesson works a single semi-gradient step on a linear Q, explains why the deadly triad (TD + off-policy + function approximation) can diverge, and shows how DQN's experience replay and target network make value-based deep RL stable.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 9 of Track 17. Tables don't scale; Atari, Go, and robotics state spaces are too big. Function approximation replaces the table with a parameterized function (linear features or a neural network), keeps the Bellman recursion intact, and lets one update generalize across all states via shared parameters. This lesson works a single semi-gradient step on a linear Q, explains why the deadly triad (TD + off-policy + function approximation) can diverge, and shows how DQN's experience replay and target network make value-based deep RL stable.Markov Decision Processeshttps://clawdemy.org/lessons/reinforcement-learning-foundations/markov-decision-processes/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/markov-decision-processes/lesson/Lesson 2 of Track 17. The first lesson sketched the agent-environment loop informally; this one nails it down. The Markov Decision Process is the universal contract of RL: a tuple (states, actions, transition probabilities, reward function, discount factor) plus the Markov property, on which every algorithm in the rest of the track operates. This lesson lays out the tuple, explains the Markov property as a property of the state representation (the Atari frame-stacking story), defines a trajectory and the discounted return, walks the return at three discount values on a small example, and draws the planning-versus-learning boundary between Phase 2 and Phase 3.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 2 of Track 17. The first lesson sketched the agent-environment loop informally; this one nails it down. The Markov Decision Process is the universal contract of RL: a tuple (states, actions, transition probabilities, reward function, discount factor) plus the Markov property, on which every algorithm in the rest of the track operates. This lesson lays out the tuple, explains the Markov property as a property of the state representation (the Atari frame-stacking story), defines a trajectory and the discounted return, walks the return at three discount values on a small example, and draws the planning-versus-learning boundary between Phase 2 and Phase 3.Monte Carlo predictionhttps://clawdemy.org/lessons/reinforcement-learning-foundations/monte-carlo-prediction/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/monte-carlo-prediction/lesson/Lesson 6 of Track 17 and the opener of Phase 3 (model-free learning). Phase 2 assumed you know P and R; Phase 3 is the real-world case where you do not. Monte Carlo prediction is the simplest model-free way to evaluate a policy: play episodes, average the observed returns, let the law of large numbers do the rest. This lesson lays out first-visit and every-visit MC, runs a 3-state worked example through five episodes that shows both convergence and the variance failure mode, and frames MC as the unbiased extreme of a bias-variance spectrum TD learning sits at the other end of.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 6 of Track 17 and the opener of Phase 3 (model-free learning). Phase 2 assumed you know P and R; Phase 3 is the real-world case where you do not. Monte Carlo prediction is the simplest model-free way to evaluate a policy: play episodes, average the observed returns, let the law of large numbers do the rest. This lesson lays out first-visit and every-visit MC, runs a 3-state worked example through five episodes that shows both convergence and the variance failure mode, and frames MC as the unbiased extreme of a bias-variance spectrum TD learning sits at the other end of.Policy gradient and the path to modern RLhttps://clawdemy.org/lessons/reinforcement-learning-foundations/policy-gradient-and-the-path-to-modern-rl/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/policy-gradient-and-the-path-to-modern-rl/lesson/Lesson 10 of Track 17, the close. Lessons 4-9 learned a value function and read the policy off as greedy; this lesson flips the script: parameterize the policy directly, then take gradient steps that increase the probability of actions that lead to high return. The capstone writes the REINFORCE update, walks one policy-gradient step on a tiny softmax policy (the probability of a rewarded action climbs from 0.50 to about 0.55), places actor-critic as the variance fix that produces PPO and the modern workhorses, and closes the track with the bridge to RLHF for large language models.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 10 of Track 17, the close. Lessons 4-9 learned a value function and read the policy off as greedy; this lesson flips the script: parameterize the policy directly, then take gradient steps that increase the probability of actions that lead to high return. The capstone writes the REINFORCE update, walks one policy-gradient step on a tiny softmax policy (the probability of a rewarded action climbs from 0.50 to about 0.55), places actor-critic as the variance fix that produces PPO and the modern workhorses, and closes the track with the bridge to RLHF for large language models.Policy iterationhttps://clawdemy.org/lessons/reinforcement-learning-foundations/policy-iteration/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/policy-iteration/lesson/Lesson 4 of Track 17 and the opener of Phase 2. The Bellman equation said value is recursive; policy iteration is the first algorithm that actually computes the optimal policy from it. The algorithm alternates two simple steps, evaluate the current policy by solving its Bellman expectation equation, then improve the policy by acting greedily, and provably converges to pi^* in any finite MDP. This lesson lays out both steps, runs the algorithm end-to-end on a two-state MDP through two iterations, and introduces the generalized-policy-iteration lens that ties almost every later RL method together.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 4 of Track 17 and the opener of Phase 2. The Bellman equation said value is recursive; policy iteration is the first algorithm that actually computes the optimal policy from it. The algorithm alternates two simple steps, evaluate the current policy by solving its Bellman expectation equation, then improve the policy by acting greedily, and provably converges to pi^* in any finite MDP. This lesson lays out both steps, runs the algorithm end-to-end on a two-state MDP through two iterations, and introduces the generalized-policy-iteration lens that ties almost every later RL method together.Q-learning: model-free controlhttps://clawdemy.org/lessons/reinforcement-learning-foundations/q-learning/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/q-learning/lesson/Lesson 8 of Track 17 and the close of Phase 3. MC and TD prediction estimated V^pi from samples; Q-learning is the control counterpart that estimates Q^* and acts greedily. Its update is TD's bootstrap on Q with a max-over-actions in the target -- combining value iteration's Bellman optimality recursion with sample-based learning. This lesson works five Q-learning steps on a 2-state-2-action MDP (greedy policy already pi^* after 5 updates), contrasts on-policy SARSA with off-policy Q-learning, explains why exploration is required, and previews the DQN bridge with the deadly-triad caveat.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 8 of Track 17 and the close of Phase 3. MC and TD prediction estimated V^pi from samples; Q-learning is the control counterpart that estimates Q^* and acts greedily. Its update is TD's bootstrap on Q with a max-over-actions in the target -- combining value iteration's Bellman optimality recursion with sample-based learning. This lesson works five Q-learning steps on a 2-state-2-action MDP (greedy policy already pi^* after 5 updates), contrasts on-policy SARSA with off-policy Q-learning, explains why exploration is required, and previews the DQN bridge with the deadly-triad caveat.Temporal-difference learninghttps://clawdemy.org/lessons/reinforcement-learning-foundations/temporal-difference-learning/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/temporal-difference-learning/lesson/Lesson 7 of Track 17. Monte Carlo waited until an episode ended to compute a return; TD learning updates after every single step using a bootstrapped one-step target. This lesson writes the TD(0) update, walks four episodes of a deterministic chain through clean monotonic convergence (with value visibly propagating backward from the terminal one bootstrap per episode), compares MC and TD on the bias-variance axis, and places TD as the foundation under Q-learning, SARSA, DQN, and actor-critic.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 7 of Track 17. Monte Carlo waited until an episode ended to compute a return; TD learning updates after every single step using a bootstrapped one-step target. This lesson writes the TD(0) update, walks four episodes of a deterministic chain through clean monotonic convergence (with value visibly propagating backward from the terminal one bootstrap per episode), compares MC and TD on the bias-variance axis, and places TD as the foundation under Q-learning, SARSA, DQN, and actor-critic.Value functions and the Bellman equationshttps://clawdemy.org/lessons/reinforcement-learning-foundations/value-functions-and-the-bellman-equations/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/value-functions-and-the-bellman-equations/lesson/Lesson 3 of Track 17 and the close of Phase 1. With the MDP nailed down, you need a way to say how good things are. The state-value V and action-value Q answer that, the expected total reward from a state or a state-action pair under a policy. Their defining property is recursive: value here equals one step of reward plus the discounted value at the next state. That recursion, in two forms (expectation under a policy, and optimality over the best action), is the Bellman equation, the mathematical heart of reinforcement learning.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 3 of Track 17 and the close of Phase 1. With the MDP nailed down, you need a way to say how good things are. The state-value V and action-value Q answer that, the expected total reward from a state or a state-action pair under a policy. Their defining property is recursive: value here equals one step of reward plus the discounted value at the next state. That recursion, in two forms (expectation under a policy, and optimality over the best action), is the Bellman equation, the mathematical heart of reinforcement learning.Value iterationhttps://clawdemy.org/lessons/reinforcement-learning-foundations/value-iteration/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/value-iteration/lesson/Lesson 5 of Track 17 and the close of Phase 2. Policy iteration did full evaluation between improvements; value iteration is the simpler sibling that interleaves them completely. The update is a direct sweep of the Bellman optimality equation. This lesson runs value iteration four steps on the same MDP as the previous lesson so the comparison is direct, shows the greedy policy stabilizes long before V converges (a standard early-stopping trick), and places value iteration as the extreme point of generalized policy iteration that pre-figures Q-learning and DQN.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 5 of Track 17 and the close of Phase 2. Policy iteration did full evaluation between improvements; value iteration is the simpler sibling that interleaves them completely. The update is a direct sweep of the Bellman optimality equation. This lesson runs value iteration four steps on the same MDP as the previous lesson so the comparison is direct, shows the greedy policy stabilizes long before V converges (a standard early-stopping trick), and places value iteration as the extreme point of generalized policy iteration that pre-figures Q-learning and DQN.What reinforcement learning actually ishttps://clawdemy.org/lessons/reinforcement-learning-foundations/what-reinforcement-learning-actually-is/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/what-reinforcement-learning-actually-is/lesson/The opener of Track 17 (Reinforcement Learning Foundations). RL is a third paradigm beside supervised and unsupervised learning, the one where an agent learns from interaction with consequences. This lesson sets up the agent-environment-reward loop, explains what makes RL harder than supervised learning (no oracle, delayed reward, distribution shift from the policy), introduces the exploration-versus-exploitation dilemma that every method in the track is, underneath, an answer to, and tours where RL shows up, from board games to robotics to the RLHF behind modern chatbots.Mon, 25 May 2026 00:00:00 GMTClawdemy12:00falseThe opener of Track 17 (Reinforcement Learning Foundations). RL is a third paradigm beside supervised and unsupervised learning, the one where an agent learns from interaction with consequences. This lesson sets up the agent-environment-reward loop, explains what makes RL harder than supervised learning (no oracle, delayed reward, distribution shift from the policy), introduces the exploration-versus-exploitation dilemma that every method in the track is, underneath, an answer to, and tours where RL shows up, from board games to robotics to the RLHF behind modern chatbots.Attention alternatives and mixture of expertshttps://clawdemy.org/lessons/build-an-llm-from-scratch/attention-alternatives-and-moe/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/attention-alternatives-and-moe/lesson/Lesson 4 of Track 15, closing Phase 1. The two variations that make modern LLMs efficient, one per sublayer. Standard attention is quadratic in length and its KV cache dominates inference; multi-query and grouped-query attention shrink that cache, and sliding-window attention bounds long-context cost. Mixture of experts replaces the single FFN with many experts plus a router, decoupling total parameters (capacity, memory) from active parameters (per-token compute). Both are resource-allocation moves in lesson 2's terms.Sun, 24 May 2026 00:00:00 GMTClawdemy13:00falseLesson 4 of Track 15, closing Phase 1. The two variations that make modern LLMs efficient, one per sublayer. Standard attention is quadratic in length and its KV cache dominates inference; multi-query and grouped-query attention shrink that cache, and sliding-window attention bounds long-context cost. Mixture of experts replaces the single FFN with many experts plus a router, decoupling total parameters (capacity, memory) from active parameters (per-token compute). Both are resource-allocation moves in lesson 2's terms.Counting the cost, FLOPs, memory, and arithmetic intensityhttps://clawdemy.org/lessons/build-an-llm-from-scratch/counting-the-cost/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/counting-the-cost/lesson/Lesson 2 of Track 15. Efficiency is the track's through-line, and this lesson is the accounting that makes it concrete: estimate a model's compute before you spend it (matmul FLOPs, the 6ND training rule), its memory (parameters, gradients, optimizer states, activations, the 16N estimate), and its arithmetic intensity (compute-bound versus memory-bound), plus reading the tensor reshaping that dominates model code with einops.Sun, 24 May 2026 00:00:00 GMTClawdemy14:00falseLesson 2 of Track 15. Efficiency is the track's through-line, and this lesson is the accounting that makes it concrete: estimate a model's compute before you spend it (matmul FLOPs, the 6ND training rule), its memory (parameters, gradients, optimizer states, activations, the 16N estimate), and its arithmetic intensity (compute-bound versus memory-bound), plus reading the tensor reshaping that dominates model code with einops.What "from scratch" means, and the tokenizerhttps://clawdemy.org/lessons/build-an-llm-from-scratch/from-scratch-and-the-tokenizer/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/from-scratch-and-the-tokenizer/lesson/Lesson 1 of Track 15, the deepest tier on the site. This track builds an LLM from scratch, the real thing, the way frontier labs do. This opener lays out what 'from scratch' actually entails end to end, why efficiency (FLOPs, memory, hardware) is the through-line, and then builds the model's first component: the tokenizer. It covers why subword beats character- and word-level tokens and how byte-level BPE works, the procedure you will implement by hand.Sun, 24 May 2026 00:00:00 GMTClawdemy13:00falseLesson 1 of Track 15, the deepest tier on the site. This track builds an LLM from scratch, the real thing, the way frontier labs do. This opener lays out what 'from scratch' actually entails end to end, why efficiency (FLOPs, memory, hardware) is the through-line, and then builds the model's first component: the tokenizer. It covers why subword beats character- and word-level tokens and how byte-level BPE works, the procedure you will implement by hand.The Transformer architecture and its hyperparametershttps://clawdemy.org/lessons/build-an-llm-from-scratch/the-architecture/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/the-architecture/lesson/Lesson 3 of Track 15. The model itself. Modern LLMs share one skeleton, a decoder-only Transformer with a residual stream, and differ in a handful of converged choices (pre-norm, RMSNorm, gated SwiGLU activations, RoPE positions, no biases, weight tying). This lesson lays out that skeleton, those choices, and the hyperparameters that size a model (d_model, n_layers, n_heads, d_ff, vocab, context), tying the parameter count back to the cost accounting of lesson 2.Sun, 24 May 2026 00:00:00 GMTClawdemy14:00falseLesson 3 of Track 15. The model itself. Modern LLMs share one skeleton, a decoder-only Transformer with a residual stream, and differ in a handful of converged choices (pre-norm, RMSNorm, gated SwiGLU activations, RoPE positions, no biases, weight tying). This lesson lays out that skeleton, those choices, and the hyperparameters that size a model (d_model, n_layers, n_heads, d_ff, vocab, context), tying the parameter count back to the cost accounting of lesson 2.Why seeing is hard for machineshttps://clawdemy.org/lessons/computer-vision/why-seeing-is-hard/lesson/https://clawdemy.org/lessons/computer-vision/why-seeing-is-hard/lesson/The opener of Phase 1 (Foundations for vision) and the Track 16 entry point. A computer handed a photo receives only a grid of numbers, with no object or meaning inside. This lesson builds the central problem of computer vision: the semantic gap between pixels and meaning, why the same object produces wildly different numbers (viewpoint, scale, deformation, occlusion, illumination, clutter, intra-class variation), why hand-written rules collapse, and the data-driven shift (collect labeled images, train, evaluate on the unseen) that the rest of the track is built on.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseThe opener of Phase 1 (Foundations for vision) and the Track 16 entry point. A computer handed a photo receives only a grid of numbers, with no object or meaning inside. This lesson builds the central problem of computer vision: the semantic gap between pixels and meaning, why the same object produces wildly different numbers (viewpoint, scale, deformation, occlusion, illumination, clutter, intra-class variation), why hand-written rules collapse, and the data-driven shift (collect labeled images, train, evaluate on the unseen) that the rest of the track is built on.Backpropagation and the chain rulehttps://clawdemy.org/lessons/neural-network-intuition/backpropagation-and-the-chain-rule/lesson/https://clawdemy.org/lessons/neural-network-intuition/backpropagation-and-the-chain-rule/lesson/Lesson 9 of Track 11 (Neural Network Intuition), and the most math-leaning lesson in the track. Lesson 8 kept saying backprop figures out how much each knob should change without computing it; this lesson names the how-much. It is the chain rule applied through the layers. It uses the chain rule (not teaches it; Track 8 does that), shows why the cost is a deeply nested function, works the smallest chain by hand (dC/dw1 as a product of four simple factors = 3), reveals that the chain-rule product is exactly lesson 8's backward flow of desires, explains why running it backward reuses shared factors so one sweep yields the whole gradient, and locates the vanishing-gradient difficulty in the same product of rates.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 9 of Track 11 (Neural Network Intuition), and the most math-leaning lesson in the track. Lesson 8 kept saying backprop figures out how much each knob should change without computing it; this lesson names the how-much. It is the chain rule applied through the layers. It uses the chain rule (not teaches it; Track 8 does that), shows why the cost is a deeply nested function, works the smallest chain by hand (dC/dw1 as a product of four simple factors = 3), reveals that the chain-rule product is exactly lesson 8's backward flow of desires, explains why running it backward reuses shared factors so one sweep yields the whole gradient, and locates the vanishing-gradient difficulty in the same product of rates.Gradient descent, step by stephttps://clawdemy.org/lessons/neural-network-intuition/gradient-descent-step-by-step/lesson/https://clawdemy.org/lessons/neural-network-intuition/gradient-descent-step-by-step/lesson/Lesson 7 of Track 11 (Neural Network Intuition), and the close of the learning arc. Three lessons built to this: learning is minimizing the cost, the negative gradient points downhill, and now we take the walk. This lesson gives the gradient descent update rule (new value = old value minus learning rate times slope), runs it by hand until the cost slides toward zero, shows how a badly chosen learning rate makes training diverge or crawl, frames training as a repeated loop, names stochastic gradient descent as the real-world shortcut, and flags the one thing it assumes but does not explain: how the gradient itself gets computed. That is backpropagation, the subject of Phase 3.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 7 of Track 11 (Neural Network Intuition), and the close of the learning arc. Three lessons built to this: learning is minimizing the cost, the negative gradient points downhill, and now we take the walk. This lesson gives the gradient descent update rule (new value = old value minus learning rate times slope), runs it by hand until the cost slides toward zero, shows how a badly chosen learning rate makes training diverge or crawl, frames training as a repeated loop, names stochastic gradient descent as the real-world shortcut, and flags the one thing it assumes but does not explain: how the gradient itself gets computed. That is backpropagation, the subject of Phase 3.Neurons as numbers, layers as structurehttps://clawdemy.org/lessons/neural-network-intuition/neurons-and-layers/lesson/https://clawdemy.org/lessons/neural-network-intuition/neurons-and-layers/lesson/Lesson 2 of Track 11 (Neural Network Intuition). The last lesson named the goal, a function from 784 numbers to 10, and left it sealed. This lesson opens it up. Inside is nothing exotic: layers of neurons, where a neuron is just a container holding one number between 0 and 1 (its activation). It traces a real pixel into the 784-neuron input layer, reads a guess off the 10-neuron output layer by finding the tallest activation, meets the two hidden layers in between, and explains why this one-directional design is called feedforward. The edges-to-loops story of what hidden layers do is offered as a hope to hold loosely, not a proven fact.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 2 of Track 11 (Neural Network Intuition). The last lesson named the goal, a function from 784 numbers to 10, and left it sealed. This lesson opens it up. Inside is nothing exotic: layers of neurons, where a neuron is just a container holding one number between 0 and 1 (its activation). It traces a real pixel into the 784-neuron input layer, reads a guess off the 10-neuron output layer by finding the tallest activation, meets the two hidden layers in between, and explains why this one-directional design is called feedforward. The edges-to-loops story of what hidden layers do is offered as a hope to hold loosely, not a proven fact.Seeing it whole, and where nexthttps://clawdemy.org/lessons/neural-network-intuition/seeing-it-whole-and-where-next/lesson/https://clawdemy.org/lessons/neural-network-intuition/seeing-it-whole-and-where-next/lesson/Lesson 10 of Track 11 (Neural Network Intuition), the synthesis finale. Ten lessons ago a messy handwritten 3 was something you could read instantly but not explain; now you can explain it down to the arithmetic. This closing lesson adds no new machinery. It assembles the whole picture in one breath (function, layers, neurons, cost, landscape, gradient descent, backpropagation), walks one full training step end to end on that very 3, is honest about what the track did not cover (architectures, optimizers, regularization, fine-tuning, code), leaves you with one durable image (a row of dials and a landscape, a patient walk downhill), and routes you to three next tracks: build it yourself (T13), understand modern LLMs (T5), or use AI to build things (T20).Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 10 of Track 11 (Neural Network Intuition), the synthesis finale. Ten lessons ago a messy handwritten 3 was something you could read instantly but not explain; now you can explain it down to the arithmetic. This closing lesson adds no new machinery. It assembles the whole picture in one breath (function, layers, neurons, cost, landscape, gradient descent, backpropagation), walks one full training step end to end on that very 3, is honest about what the track did not cover (architectures, optimizers, regularization, fine-tuning, code), leaves you with one durable image (a row of dials and a landscape, a patient walk downhill), and routes you to three next tracks: build it yourself (T13), understand modern LLMs (T5), or use AI to build things (T20).The cost landscapehttps://clawdemy.org/lessons/neural-network-intuition/the-cost-landscape/lesson/https://clawdemy.org/lessons/neural-network-intuition/the-cost-landscape/lesson/Lesson 6 of Track 11 (Neural Network Intuition). Lesson 5 turned learning into a clean goal, make the cost small, but left us standing in a 13,000-dimensional space with no idea which way to move. This lesson gives that space a shape. It pictures the cost as a landscape of hills and valleys (each knob setting a point, its cost the height), explains why high dimensions are fine even though they cannot be drawn, introduces the gradient as the direction of steepest uphill, and shows why stepping along the negative gradient lowers the cost fastest. It works the downhill step by hand in one and two dimensions, and ends on an honest caveat: downhill walking reaches a local minimum, not always the deepest valley.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 6 of Track 11 (Neural Network Intuition). Lesson 5 turned learning into a clean goal, make the cost small, but left us standing in a 13,000-dimensional space with no idea which way to move. This lesson gives that space a shape. It pictures the cost as a landscape of hills and valleys (each knob setting a point, its cost the height), explains why high dimensions are fine even though they cannot be drawn, introduces the gradient as the direction of steepest uphill, and shows why stepping along the negative gradient lowers the cost fastest. It works the downhill step by hand in one and two dimensions, and ends on an honest caveat: downhill walking reaches a local minimum, not always the deepest valley.The whole network as one functionhttps://clawdemy.org/lessons/neural-network-intuition/the-whole-network-as-one-function/lesson/https://clawdemy.org/lessons/neural-network-intuition/the-whole-network-as-one-function/lesson/Lesson 4 of Track 11 (Neural Network Intuition), and the close of the structure arc. The first three lessons named a goal and built the parts; this lesson steps back to see the whole machine, and it turns out to be exactly the function promised in lesson 1. Running it is the forward pass: lesson 3's neuron formula applied layer by layer. It evaluates a tiny network end to end by hand, introduces the f(x; w, b) framing that separates the per-use input from the fixed weights and biases, and shows that the same skeleton behaves completely differently depending only on its parameter values. The chapter's payoff: a network is a function, and all its capability lives in those numbers, which sets up the question Phase 2 answers, how the right numbers get found.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 4 of Track 11 (Neural Network Intuition), and the close of the structure arc. The first three lessons named a goal and built the parts; this lesson steps back to see the whole machine, and it turns out to be exactly the function promised in lesson 1. Running it is the forward pass: lesson 3's neuron formula applied layer by layer. It evaluates a tiny network end to end by hand, introduces the f(x; w, b) framing that separates the per-use input from the fixed weights and biases, and shows that the same skeleton behaves completely differently depending only on its parameter values. The chapter's payoff: a network is a function, and all its capability lives in those numbers, which sets up the question Phase 2 answers, how the right numbers get found.Weights, biases, and the squishhttps://clawdemy.org/lessons/neural-network-intuition/weights-biases-and-the-squish/lesson/https://clawdemy.org/lessons/neural-network-intuition/weights-biases-and-the-squish/lesson/Lesson 3 of Track 11 (Neural Network Intuition). Lesson 2 said hidden neurons get their number from the layer before but never said how. This lesson is the how: the single computation every neuron runs. Multiply each incoming activation by a weight, add them up, add a bias, and squash the result into range with an activation function (sigmoid or ReLU). It works one neuron by hand both ways, explains that weights set attention and biases set eagerness, and counts the knobs, showing the small 784-16-16-10 digit network already needs about 13,002 weights and biases while modern networks have billions. The punchline: a network's behavior lives entirely in those parameter values.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 3 of Track 11 (Neural Network Intuition). Lesson 2 said hidden neurons get their number from the layer before but never said how. This lesson is the how: the single computation every neuron runs. Multiply each incoming activation by a weight, add them up, add a bias, and squash the result into range with an activation function (sigmoid or ReLU). It works one neuron by hand both ways, explains that weights set attention and biases set eagerness, and counts the knobs, showing the small 784-16-16-10 digit network already needs about 13,002 weights and biases while modern networks have billions. The punchline: a network's behavior lives entirely in those parameter values.What backpropagation is really doinghttps://clawdemy.org/lessons/neural-network-intuition/what-backpropagation-is-really-doing/lesson/https://clawdemy.org/lessons/neural-network-intuition/what-backpropagation-is-really-doing/lesson/Lesson 8 of Track 11 (Neural Network Intuition), and the opener of the backpropagation arc. Lesson 7 confessed a gap: gradient descent needs the gradient, and we never said how to get it. This lesson gives the intuition behind the answer, backpropagation, with no calculus. Brute force (nudge each knob, re-run the network) is hopeless at 13,000 knobs, so instead we ask what each output neuron wants, watch those wishes turn into adjustments to weights and biases plus requests of the previous layer, and see those requests roll backward layer by layer. A single forward pass plus a single backward sweep yields the whole gradient for about the cost of running the network once, and averaging the wishes over many examples is why learning needs lots of data.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 8 of Track 11 (Neural Network Intuition), and the opener of the backpropagation arc. Lesson 7 confessed a gap: gradient descent needs the gradient, and we never said how to get it. This lesson gives the intuition behind the answer, backpropagation, with no calculus. Brute force (nudge each knob, re-run the network) is hopeless at 13,000 knobs, so instead we ask what each output neuron wants, watch those wishes turn into adjustments to weights and biases plus requests of the previous layer, and see those requests roll backward layer by layer. A single forward pass plus a single backward sweep yields the whole gradient for about the cost of running the network once, and averaging the wishes over many examples is why learning needs lots of data.What learning really meanshttps://clawdemy.org/lessons/neural-network-intuition/what-learning-really-means/lesson/https://clawdemy.org/lessons/neural-network-intuition/what-learning-really-means/lesson/Lesson 5 of Track 11 (Neural Network Intuition), and the opener of the learning arc. Phase 1 ended on a cliffhanger: a network only works once its roughly 13,000 weights and biases are set well, so how do we find good values? This lesson builds the measure that makes the search possible, the cost function: a single number for how wrong the network is right now. It writes the desired answer as a one-hot output, works the cost by hand on a confident-correct output (about 0.0129) and a total shrug (0.90), reframes cost as a function of the knobs C(w, b), and collapses learning into one idea: adjust the weights and biases to make that number small. The catch (13,000 dials, a bumpy surface, no brute force) sets up lessons 6 and 7.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 5 of Track 11 (Neural Network Intuition), and the opener of the learning arc. Phase 1 ended on a cliffhanger: a network only works once its roughly 13,000 weights and biases are set well, so how do we find good values? This lesson builds the measure that makes the search possible, the cost function: a single number for how wrong the network is right now. It writes the desired answer as a one-hot output, works the cost by hand on a confident-correct output (about 0.0129) and a total shrug (0.90), reframes cost as a function of the knobs C(w, b), and collapses learning into one idea: adjust the weights and biases to make that number small. The catch (13,000 dials, a bumpy surface, no brute force) sets up lessons 6 and 7.Build and share a demohttps://clawdemy.org/lessons/practical-transformers/build-and-share-a-demo/lesson/https://clawdemy.org/lessons/practical-transformers/build-and-share-a-demo/lesson/Lesson 9 of Track 14 and the start of Phase 3. Everything so far has lived in a notebook; this lesson ships. Wrap any model in a browser interface with a few lines of Gradio (gr.Interface plus launch), put your inference code in the function, match components to the model's inputs and outputs, share it with a temporary public link, and publish it permanently on Hugging Face Spaces, all without writing any frontend code.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 9 of Track 14 and the start of Phase 3. Everything so far has lived in a notebook; this lesson ships. Wrap any model in a browser interface with a few lines of Gradio (gr.Interface plus launch), put your inference code in the function, match components to the model's inputs and outputs, share it with a temporary public link, and publish it permanently on Hugging Face Spaces, all without writing any frontend code.Curating high-quality datasetshttps://clawdemy.org/lessons/practical-transformers/curating-datasets/lesson/https://clawdemy.org/lessons/practical-transformers/curating-datasets/lesson/Lesson 11 of Track 14. The last lesson ended on a line worth taking seriously: a model is only as good as its data. This lesson is about that data, why quality (not model size) is increasingly the lever that decides results, and how to curate and evaluate a training dataset with Argilla, the human-in-the-loop annotation and feedback platform that turns raw data into something worth training on, then exports it back to the Hub.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 11 of Track 14. The last lesson ended on a line worth taking seriously: a model is only as good as its data. This lesson is about that data, why quality (not model size) is increasingly the lever that decides results, and how to curate and evaluate a training dataset with Argilla, the human-in-the-loop annotation and feedback platform that turns raw data into something worth training on, then exports it back to the Hub.Debug your training and get unstuckhttps://clawdemy.org/lessons/practical-transformers/debug-and-get-unstuck/lesson/https://clawdemy.org/lessons/practical-transformers/debug-and-get-unstuck/lesson/Lesson 8 of Track 14 and the close of Phase 2. The most universally useful lesson in the track: how to read a Python traceback (bottom to top), debug a pipeline by forming and checking a hypothesis, recognize where training pipelines commonly break, build a minimal reproducible example, and ask the community for help in a way that actually gets answered. These skills outlast every specific API in the track.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 8 of Track 14 and the close of Phase 2. The most universally useful lesson in the track: how to read a Python traceback (bottom to top), debug a pipeline by forming and checking a hypothesis, recognize where training pipelines commonly break, build a minimal reproducible example, and ask the community for help in a way that actually gets answered. These skills outlast every specific API in the track.Fine-tune a pretrained model on your own datahttps://clawdemy.org/lessons/practical-transformers/fine-tune-on-your-data/lesson/https://clawdemy.org/lessons/practical-transformers/fine-tune-on-your-data/lesson/Lesson 3 of Track 14, the hands-on heart of Phase 1. Take a pretrained model and continue training it on a task-specific dataset using the Trainer, then measure whether it actually improved. You will meet the data collator (dynamic padding), the expected head-swap warning, the TrainingArguments config object, the Trainer itself, and the evaluation discipline of compute_metrics, fine-tuning BERT on the MRPC dataset to about 86% accuracy in a few minutes.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 3 of Track 14, the hands-on heart of Phase 1. Take a pretrained model and continue training it on a task-specific dataset using the Trainer, then measure whether it actually improved. You will meet the data collator (dynamic padding), the expected head-swap warning, the TrainingArguments config object, the Trainer itself, and the evaluation discipline of compute_metrics, fine-tuning BERT on the MRPC dataset to about 86% accuracy in a few minutes.Fine-tuning LLMs, supervised and instruction tuninghttps://clawdemy.org/lessons/practical-transformers/fine-tuning-llms/lesson/https://clawdemy.org/lessons/practical-transformers/fine-tuning-llms/lesson/Lesson 10 of Track 14, the first LLM-frontier lesson. The assistant-style models you use went through a different fine-tuning than the classifier of lesson 3. This lesson distinguishes task fine-tuning from supervised fine-tuning (SFT), shows when to reach for SFT versus prompting, explains the chat-formatted data and chat templates it needs, introduces the SFTTrainer from TRL, and covers how LoRA makes fine-tuning large models affordable. It stays strictly at a mechanical, how-it-works level.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 10 of Track 14, the first LLM-frontier lesson. The assistant-style models you use went through a different fine-tuning than the classifier of lesson 3. This lesson distinguishes task fine-tuning from supervised fine-tuning (SFT), shows when to reach for SFT versus prompting, explains the chat-formatted data and chat templates it needs, introduces the SFTTrainer from TRL, and covers how LoRA makes fine-tuning large models affordable. It stays strictly at a mechanical, how-it-works level.Reasoning models and the road aheadhttps://clawdemy.org/lessons/practical-transformers/reasoning-models-and-the-road-ahead/lesson/https://clawdemy.org/lessons/practical-transformers/reasoning-models-and-the-road-ahead/lesson/Lesson 12 of Track 14, the track capstone. You started not knowing what a transformer was; you can now run, fine-tune, share, curate for, and ship one. This final lesson looks at the current frontier, reasoning models: what they add over ordinary LLMs, how reinforcement learning trains a model to think before it answers, where the open Hugging Face ecosystem fits (Open R1), and the durable working method that outlasts any specific frontier. It stays at a mechanical, how-it-works level.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 12 of Track 14, the track capstone. You started not knowing what a transformer was; you can now run, fine-tune, share, curate for, and ship one. This final lesson looks at the current frontier, reasoning models: what they add over ordinary LLMs, how reinforcement learning trains a model to think before it answers, where the open Hugging Face ecosystem fits (Open R1), and the durable working method that outlasts any specific frontier. It stays at a mechanical, how-it-works level.Run a model in a few lines, pipelines and Auto classeshttps://clawdemy.org/lessons/practical-transformers/run-a-model-in-a-few-lines/lesson/https://clawdemy.org/lessons/practical-transformers/run-a-model-in-a-few-lines/lesson/Lesson 2 of Track 14, and the first one where you run code. It starts with the two-line pipeline() call that runs a whole task, then opens the box: the three steps a pipeline hides (a tokenizer, the model, postprocessing) reproduced by hand with the Auto classes. You will see input_ids and attention_mask, the difference between AutoModel and AutoModelForSequenceClassification, why models output logits instead of probabilities, and the single from_pretrained idiom the whole library runs on.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 2 of Track 14, and the first one where you run code. It starts with the two-line pipeline() call that runs a whole task, then opens the box: the three steps a pipeline hides (a tokenizer, the model, postprocessing) reproduced by hand with the Auto classes. You will see input_ids and attention_mask, the difference between AutoModel and AutoModelForSequenceClassification, why models output logits instead of probabilities, and the single from_pretrained idiom the whole library runs on.Share your work on the Hubhttps://clawdemy.org/lessons/practical-transformers/share-on-the-hub/lesson/https://clawdemy.org/lessons/practical-transformers/share-on-the-hub/lesson/Lesson 4 of Track 14, the close of Phase 1. Push a model and tokenizer to the Hugging Face Hub so anyone can load them with from_pretrained, write a model card so the work is actually usable, and understand why sharing is the engine of the whole ecosystem. You will authenticate, compare the three upload routes (push_to_hub API, the huggingface_hub library, git/git-lfs), see what a model repo contains, and learn why the model card is the real deliverable.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 4 of Track 14, the close of Phase 1. Push a model and tokenizer to the Hugging Face Hub so anyone can load them with from_pretrained, write a model card so the work is actually usable, and understand why sharing is the engine of the whole ecosystem. You will authenticate, compare the three upload routes (push_to_hub API, the huggingface_hub library, git/git-lfs), see what a model repo contains, and learn why the model card is the real deliverable.The main NLP tasks, end to endhttps://clawdemy.org/lessons/practical-transformers/the-main-nlp-tasks/lesson/https://clawdemy.org/lessons/practical-transformers/the-main-nlp-tasks/lesson/Lesson 7 of Track 14, where everything comes together. The six common NLP tasks (sequence and token classification, question answering, masked and causal language modeling, summarization, translation) all follow one loop; what changes is the head, the label shape, and the metric. This lesson builds the real applied skill: looking at a problem, naming which task it is, and choosing the right `AutoModelFor<Task>` head, data shape, and metric, plus the two recurring wrinkles of token alignment and sequence-to-sequence.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 7 of Track 14, where everything comes together. The six common NLP tasks (sequence and token classification, question answering, masked and causal language modeling, summarization, translation) all follow one loop; what changes is the head, the label shape, and the metric. This lesson builds the real applied skill: looking at a problem, naming which task it is, and choosing the right `AutoModelFor<Task>` head, data shape, and metric, plus the two recurring wrinkles of token alignment and sequence-to-sequence.Tokenizers up closehttps://clawdemy.org/lessons/practical-transformers/tokenizers-up-close/lesson/https://clawdemy.org/lessons/practical-transformers/tokenizers-up-close/lesson/Lesson 6 of Track 14. Open the tokenizer you have called since lesson 2. This lesson walks the four-stage pipeline a fast tokenizer runs (normalization, pre-tokenization, the subword model, postprocessing), explains why fast tokenizers are fast and what offsets and word IDs buy you, names the three subword algorithms (BPE, WordPiece, Unigram) and who uses them, and trains a brand-new tokenizer on a corpus of Python code with train_new_from_iterator, cutting token counts by about a quarter.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 6 of Track 14. Open the tokenizer you have called since lesson 2. This lesson walks the four-stage pipeline a fast tokenizer runs (normalization, pre-tokenization, the subword model, postprocessing), explains why fast tokenizers are fast and what offsets and word IDs buy you, names the three subword algorithms (BPE, WordPiece, Unigram) and who uses them, and trains a brand-new tokenizer on a corpus of Python code with train_new_from_iterator, cutting token counts by about a quarter.Wrangling data with the Datasets libraryhttps://clawdemy.org/lessons/practical-transformers/wrangle-data-with-datasets/lesson/https://clawdemy.org/lessons/practical-transformers/wrangle-data-with-datasets/lesson/Lesson 5 of Track 14 and the start of Phase 2. Real data is never as tidy as the GLUE dataset made it look, so this lesson turns to the datasets library: load data from the Hub or your own files, then clean and transform it at scale with map and filter, the batched=True superpower that makes it fast, the Arrow backend that handles data larger than RAM, and the train_test_split discipline that prepares data for training.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 5 of Track 14 and the start of Phase 2. Real data is never as tidy as the GLUE dataset made it look, so this lesson turns to the datasets library: load data from the Hub or your own files, then clean and transform it at scale with map and filter, the batched=True superpower that makes it fast, the Arrow backend that handles data larger than RAM, and the train_test_split discipline that prepares data for training.Updating beliefs with evidence: Bayes' theoremhttps://clawdemy.org/lessons/statistics-and-probability/bayes-theorem/lesson/https://clawdemy.org/lessons/statistics-and-probability/bayes-theorem/lesson/Lesson 7 of Track 9 and the close of Phase 2. Bayes' theorem converts the chance of A given B into the chance of B given A, and it is the mathematics of updating a belief when evidence arrives. This lesson builds Bayes from natural frequencies, re-derives lesson 1's base-rate result exactly (a 99%-accurate test that is still 50% right on a positive), shows how a second test updates again to 99%, and connects it to spam filters, base-rate neglect, and combining a prior with new data.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 7 of Track 9 and the close of Phase 2. Bayes' theorem converts the chance of A given B into the chance of B given A, and it is the mathematics of updating a belief when evidence arrives. This lesson builds Bayes from natural frequencies, re-derives lesson 1's base-rate result exactly (a 99%-accurate test that is still 50% right on a positive), shows how a second test updates again to 99%, and connects it to spam filters, base-rate neglect, and combining a prior with new data.When one event tells you about another: conditional probability and independencehttps://clawdemy.org/lessons/statistics-and-probability/conditional-probability-and-independence/lesson/https://clawdemy.org/lessons/statistics-and-probability/conditional-probability-and-independence/lesson/Lesson 6 of Track 9. The multiplication rule needed independence, but the events that matter in AI are dependent. This lesson defines conditional probability (the chance of A given B), reads it off a two-way table, generalizes the multiplication rule to dependent events, redefines independence in those terms, and hammers the subject's costliest confusion: the chance of A given B is not the chance of B given A. It sets up Bayes' theorem in the next lesson.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 6 of Track 9. The multiplication rule needed independence, but the events that matter in AI are dependent. This lesson defines conditional probability (the chance of A given B), reads it off a two-way table, generalizes the multiplication rule to dependent events, redefines independence in those terms, and hammers the subject's costliest confusion: the chance of A given B is not the chance of B given A. It sets up Bayes' theorem in the next lesson.How sure are we? confidence intervalshttps://clawdemy.org/lessons/statistics-and-probability/confidence-intervals/lesson/https://clawdemy.org/lessons/statistics-and-probability/confidence-intervals/lesson/Lesson 12 of Track 9. A single measured number hides its uncertainty; a confidence interval shows it, turning '90% accurate' into '90%, give or take 4 points.' This lesson builds the interval as estimate plus or minus a margin of error (about two standard errors for 95%), shows how data and confidence trade off against width, and corrects the interpretation almost everyone gets wrong: a 95% interval is not a 95% probability that the truth is in this particular range.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 12 of Track 9. A single measured number hides its uncertainty; a confidence interval shows it, turning '90% accurate' into '90%, give or take 4 points.' This lesson builds the interval as estimate plus or minus a margin of error (about two standard errors for 95%), shows how data and confidence trade off against width, and corrects the interpretation almost everyone gets wrong: a 95% interval is not a 95% probability that the truth is in this particular range.Testing a claim: hypothesis testing and p-valueshttps://clawdemy.org/lessons/statistics-and-probability/hypothesis-testing-and-p-values/lesson/https://clawdemy.org/lessons/statistics-and-probability/hypothesis-testing-and-p-values/lesson/Lesson 13 of Track 9. Confidence intervals hinted a difference might be noise; hypothesis testing makes the call. This lesson sets up the null and alternative, explains the logic of assuming the null and measuring how surprising the data is, defines the p-value carefully, and dismantles the misreadings that make it the most abused number in science: it is not the probability the null is true, significant is not important, and failing to reject is not proof.Sun, 24 May 2026 00:00:00 GMTClawdemy13:00falseLesson 13 of Track 9. Confidence intervals hinted a difference might be noise; hypothesis testing makes the call. This lesson sets up the null and alternative, explains the logic of assuming the null and measuring how surprising the data is, defines the p-value carefully, and dismantles the misreadings that make it the most abused number in science: it is not the probability the null is true, significant is not important, and failing to reject is not proof.Probability foundationshttps://clawdemy.org/lessons/statistics-and-probability/probability-foundations/lesson/https://clawdemy.org/lessons/statistics-and-probability/probability-foundations/lesson/Lesson 5 of Track 9 and the opener of Phase 2. A probability is a number from 0 to 1, and combining probabilities takes just three rules: the complement (and the at-least-one shortcut), the addition rule for OR (subtract the overlap), and the multiplication rule for independent ANDs. This lesson works each on dice, coins, and cards, flags that multiplication needs independence, and connects the rules to pipeline reliability and how a language model scores a sentence.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 5 of Track 9 and the opener of Phase 2. A probability is a number from 0 to 1, and combining probabilities takes just three rules: the complement (and the at-least-one shortcut), the addition rule for OR (subtract the overlap), and the multiplication rule for independent ANDs. This lesson works each on dice, coins, and cards, flags that multiplication needs independence, and connects the rules to pipeline reliability and how a language model scores a sentence.Random variables and expected valuehttps://clawdemy.org/lessons/statistics-and-probability/random-variables-and-expected-value/lesson/https://clawdemy.org/lessons/statistics-and-probability/random-variables-and-expected-value/lesson/Lesson 8 of Track 9 and the opener of Phase 3. A random variable is a number whose value comes from chance (a payoff, a count, a loss), and its expected value is the long-run average it settles toward. This lesson defines random variables and their distributions, computes expected value and variance by hand, and shows why expected value is the backbone of machine-learning objectives: the thing a loss function minimizes and a reward an agent maximizes.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 8 of Track 9 and the opener of Phase 3. A random variable is a number whose value comes from chance (a payoff, a count, a loss), and its expected value is the long-run average it settles toward. This lesson defines random variables and their distributions, computes expected value and variance by hand, and shows why expected value is the backbone of machine-learning objectives: the thing a loss function minimizes and a reward an agent maximizes.From sample to population: sampling and the central limit theoremhttps://clawdemy.org/lessons/statistics-and-probability/sampling-and-the-central-limit-theorem/lesson/https://clawdemy.org/lessons/statistics-and-probability/sampling-and-the-central-limit-theorem/lesson/Lesson 11 of Track 9 and the opener of Phase 4. Every number measured on a sample is an estimate that varies from sample to sample. This lesson separates a sample statistic from the population parameter it estimates, introduces the standard error (sigma over root n) and the square-root law behind 'more data helps,' and states the central limit theorem, the reason sample means are normal no matter the data's shape, which makes the rest of inference possible.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 11 of Track 9 and the opener of Phase 4. Every number measured on a sample is an estimate that varies from sample to sample. This lesson separates a sample statistic from the population parameter it estimates, introduces the standard error (sigma over root n) and the square-root law behind 'more data helps,' and states the central limit theorem, the reason sample means are normal no matter the data's shape, which makes the rest of inference possible.Statistics in machine learninghttps://clawdemy.org/lessons/statistics-and-probability/statistics-in-machine-learning/lesson/https://clawdemy.org/lessons/statistics-and-probability/statistics-in-machine-learning/lesson/Lesson 14 of Track 9, the capstone. It walks every tool from the track into a real machine-learning workflow: describing data, reading model outputs as conditional probabilities, expected value as the training objective, and the heart of it, evaluation as inference (a test set is a sample, a metric is an estimate with a confidence interval, comparing models is a hypothesis test). It draws a clean boundary to the Classical ML track for the model-scoring toolkit and closes on the through-line: statistics is the discipline of not fooling yourself about uncertainty.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 14 of Track 9, the capstone. It walks every tool from the track into a real machine-learning workflow: describing data, reading model outputs as conditional probabilities, expected value as the training objective, and the heart of it, evaluation as inference (a test set is a sample, a metric is an estimate with a confidence interval, comparing models is a hypothesis test). It draws a clean boundary to the Classical ML track for the model-scoring toolkit and closes on the through-line: statistics is the discipline of not fooling yourself about uncertainty.Summarizing data: center and spreadhttps://clawdemy.org/lessons/statistics-and-probability/summarizing-data-center-and-spread/lesson/https://clawdemy.org/lessons/statistics-and-probability/summarizing-data-center-and-spread/lesson/Lesson 2 of Track 9. Before any model learns, someone summarizes the data, and the summary can mislead. This lesson covers the two questions every summary answers (where is the center, how spread out is it), the mean-versus-median tradeoff under skew, how to compute variance and standard deviation by hand, and why standardizing features by their mean and standard deviation is one of machine learning's most common first steps.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 2 of Track 9. Before any model learns, someone summarizes the data, and the summary can mislead. This lesson covers the two questions every summary answers (where is the center, how spread out is it), the mean-versus-median tradeoff under skew, how to compute variance and standard deviation by hand, and why standardizing features by their mean and standard deviation is one of machine learning's most common first steps.Counts and trials: the binomial distributionhttps://clawdemy.org/lessons/statistics-and-probability/the-binomial-distribution/lesson/https://clawdemy.org/lessons/statistics-and-probability/the-binomial-distribution/lesson/Lesson 10 of Track 9 and the close of Phase 3. When you count successes in a fixed number of independent yes-or-no trials, the binomial distribution gives the probabilities. This lesson lays out the four conditions, builds the exactly-k probability formula, works it on coins and a model's accuracy, gives the n-times-p expected-count shortcut, separates exactly-k from at-least-k, and connects it to accuracy as a binomial count.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 10 of Track 9 and the close of Phase 3. When you count successes in a fixed number of independent yes-or-no trials, the binomial distribution gives the probabilities. This lesson lays out the four conditions, builds the exactly-k probability formula, works it on coins and a model's accuracy, gives the n-times-p expected-count shortcut, separates exactly-k from at-least-k, and connects it to accuracy as a binomial count.The bell curve: the normal distributionhttps://clawdemy.org/lessons/statistics-and-probability/the-normal-distribution/lesson/https://clawdemy.org/lessons/statistics-and-probability/the-normal-distribution/lesson/Lesson 9 of Track 9. The bell curve named in the histogram lesson gets made precise. This lesson explains how a continuous distribution carries probability as area under a curve, defines the normal by its mean and standard deviation, gives the 68-95-99.7 rule, formalizes the z-score as the standardization met earlier, and connects the normal to AI: feature standardization, the default model of noise, and outlier detection.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 9 of Track 9. The bell curve named in the histogram lesson gets made precise. This lesson explains how a continuous distribution carries probability as area under a curve, defines the normal by its mean and standard deviation, gives the 68-95-99.7 rule, formalizes the z-score as the standardization met earlier, and connects the normal to AI: feature standardization, the default model of noise, and outlier detection.The shape of data: distributions and histogramshttps://clawdemy.org/lessons/statistics-and-probability/the-shape-of-data-distributions-and-histograms/lesson/https://clawdemy.org/lessons/statistics-and-probability/the-shape-of-data-distributions-and-histograms/lesson/Lesson 3 of Track 9. A center and spread summarize data, but a histogram shows its shape, and shape carries information no single number can. This lesson builds the histogram, names the shapes (symmetric, skewed, uniform, bimodal, bell), reconnects skew to the mean-versus-median gap, and shows why inspecting a feature's distribution before modeling catches outliers, hidden subpopulations, and class imbalance that summary numbers miss.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 3 of Track 9. A center and spread summarize data, but a histogram shows its shape, and shape carries information no single number can. This lesson builds the histogram, names the shapes (symmetric, skewed, uniform, bimodal, bell), reconnects skew to the mean-versus-median gap, and shows why inspecting a feature's distribution before modeling catches outliers, hidden subpopulations, and class imbalance that summary numbers miss.When two things move together: correlationhttps://clawdemy.org/lessons/statistics-and-probability/when-two-things-move-together-correlation/lesson/https://clawdemy.org/lessons/statistics-and-probability/when-two-things-move-together-correlation/lesson/Lesson 4 of Track 9 and the close of Phase 1. Correlation measures how tightly two quantities move together; this lesson reads the scatterplot, interprets the correlation coefficient between -1 and +1, warns that it sees only straight lines, and spends real time on the most misused idea in data analysis: correlation is not causation. It connects to machine learning (redundant features, spurious signals) and draws a clean line to where prediction proper lives, the Classical Machine Learning track.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 4 of Track 9 and the close of Phase 1. Correlation measures how tightly two quantities move together; this lesson reads the scatterplot, interprets the correlation coefficient between -1 and +1, warns that it sees only straight lines, and spends real time on the most misused idea in data analysis: correlation is not causation. It connects to machine learning (redundant features, spurious signals) and draws a clean line to where prediction proper lives, the Classical Machine Learning track.Why AI runs on statisticshttps://clawdemy.org/lessons/statistics-and-probability/why-ai-runs-on-statistics/lesson/https://clawdemy.org/lessons/statistics-and-probability/why-ai-runs-on-statistics/lesson/The opener of Track 9 (Statistics & Probability for AI). Every AI system speaks in probabilities, not certainties: a spam filter says 98% spam, a model reports 0.91 confidence, a recommender ranks by likelihood. This orientation lesson situates statistics and probability as the language AI uses to reason under uncertainty. It explains why uncertainty is unavoidable, splits the two directions of statistical reasoning (probability forward, statistics backward), maps where each idea in the track shows up inside real systems, and works the base-rate example to show why a 99%-accurate test can be right only half the time.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseThe opener of Track 9 (Statistics & Probability for AI). Every AI system speaks in probabilities, not certainties: a spam filter says 98% spam, a model reports 0.91 confidence, a recommender ranks by likelihood. This orientation lesson situates statistics and probability as the language AI uses to reason under uncertainty. It explains why uncertainty is unavoidable, splits the two directions of statistical reasoning (probability forward, statistics backward), maps where each idea in the track shows up inside real systems, and works the base-rate example to show why a 99%-accurate test can be right only half the time.Deriving the 3D cross product from dualityhttps://clawdemy.org/lessons/visual-math-linear-algebra/3d-cross-product-via-duality/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/3d-cross-product-via-duality/lesson/Lesson 11 of Track 4 (Visual Math: Linear Algebra). The 3D cross product has a formula that looks like something you just have to memorize. You do not. This lesson derives it from scratch by combining the duality idea from the dot-product lesson with the determinant-as-volume idea, and the famous criss-cross formula, along with its three geometric properties, falls out on its own.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 11 of Track 4 (Visual Math: Linear Algebra). The 3D cross product has a formula that looks like something you just have to memorize. You do not. This lesson derives it from scratch by combining the duality idea from the dot-product lesson with the determinant-as-volume idea, and the famous criss-cross formula, along with its three geometric properties, falls out on its own.Stepping up to 3Dhttps://clawdemy.org/lessons/visual-math-linear-algebra/3d-transformations/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/3d-transformations/lesson/Lesson 5 of Track 4 (Visual Math: Linear Algebra), and the close of Phase 1. Everything so far lived on a flat plane. This lesson steps into three dimensions and shows that almost nothing changes: a third basis vector, a third column, one more number per vector, and every rule you already know carries straight over. The same leap takes you to the hundreds of dimensions a real model uses.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 5 of Track 4 (Visual Math: Linear Algebra), and the close of Phase 1. Everything so far lived on a flat plane. This lesson steps into three dimensions and shows that almost nothing changes: a third basis vector, a third column, one more number per vector, and every rule you already know carries straight over. The same leap takes you to the hundreds of dimensions a real model uses.Vectors that aren't arrows, abstract vector spaceshttps://clawdemy.org/lessons/visual-math-linear-algebra/abstract-vector-spaces/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/abstract-vector-spaces/lesson/Lesson 15 of Track 4 (Visual Math: Linear Algebra), the capstone. The very first lesson said a vector is anything you can add and scale coherently, even if it is not an arrow or a list. This final lesson cashes that promise: functions and polynomials are vectors too, the derivative is an honest matrix, and every tool you built across the track works on objects you cannot draw, including the high-dimensional spaces AI actually lives in.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 15 of Track 4 (Visual Math: Linear Algebra), the capstone. The very first lesson said a vector is anything you can add and scale coherently, even if it is not an arrow or a list. This final lesson cashes that promise: functions and polynomials are vectors too, the derivative is an honest matrix, and every tool you built across the track works on objects you cannot draw, including the high-dimensional spaces AI actually lives in.Coordinates as a choice, change of basishttps://clawdemy.org/lessons/visual-math-linear-algebra/change-of-basis/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/change-of-basis/lesson/Lesson 13 of Track 4 (Visual Math: Linear Algebra). A vector's coordinates are not a fact about the vector; they are a description relative to a basis you happened to choose. This lesson makes that operational: how to translate a vector's coordinates from one basis to another and back, and how the same transformation gets a different matrix in a different basis via the M-inverse-A-M sandwich.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 13 of Track 4 (Visual Math: Linear Algebra). A vector's coordinates are not a fact about the vector; they are a description relative to a basis you happened to choose. This lesson makes that operational: how to translate a vector's coordinates from one basis to another and back, and how the same transformation gets a different matrix in a different basis via the M-inverse-A-M sandwich.Solving by area ratios, Cramer's rulehttps://clawdemy.org/lessons/visual-math-linear-algebra/cramers-rule/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/cramers-rule/lesson/Lesson 12 of Track 4 (Visual Math: Linear Algebra). Several lessons ago we said the solution to a linear system is the inverse times the target, but never computed it. Cramer's rule is one way to get the answer directly from the matrix entries, and it falls out of one idea you already have: a linear transformation scales every area by its determinant.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 12 of Track 4 (Visual Math: Linear Algebra). Several lessons ago we said the solution to a linear system is the inverse times the target, but never computed it. Cramer's rule is one way to get the answer directly from the matrix entries, and it falls out of one idea you already have: a linear transformation scales every area by its determinant.Cross products as signed areahttps://clawdemy.org/lessons/visual-math-linear-algebra/cross-products/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/cross-products/lesson/Lesson 10 of Track 4 (Visual Math: Linear Algebra), opening Phase 3. The dot product measured how much two vectors line up; the cross product measures how much they spread apart, the area they span, with a sign that records which way they turn. In 2D it is one signed number, and it turns out to be exactly the determinant you already know.Sun, 24 May 2026 00:00:00 GMTClawdemy9:00falseLesson 10 of Track 4 (Visual Math: Linear Algebra), opening Phase 3. The dot product measured how much two vectors line up; the cross product measures how much they spread apart, the area they span, with a sign that records which way they turn. In 2D it is one signed number, and it turns out to be exactly the determinant you already know.The determinanthttps://clawdemy.org/lessons/visual-math-linear-algebra/determinant/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/determinant/lesson/Lesson 6 of Track 4 (Visual Math: Linear Algebra), opening Phase 2. A linear transformation stretches and squashes space; the determinant is the single number that says by how much, and whether it flips space inside out. This lesson builds that number from the area of the unit square, derives the ad-bc formula, and shows why a zero determinant signals a collapse that cannot be undone.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 6 of Track 4 (Visual Math: Linear Algebra), opening Phase 2. A linear transformation stretches and squashes space; the determinant is the single number that says by how much, and whether it flips space inside out. This lesson builds that number from the area of the unit square, derives the ad-bc formula, and shows why a zero determinant signals a collapse that cannot be undone.Dot products and projectionhttps://clawdemy.org/lessons/visual-math-linear-algebra/dot-products/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/dot-products/lesson/Lesson 9 of Track 4 (Visual Math: Linear Algebra), closing Phase 2. The dot product turns two vectors into a single number, and it has two formulas that look unrelated yet always agree. This lesson computes it both ways, explains why they match (duality), and cashes the promise from the very first lesson about how AI compares vectors in attention, cosine similarity, and inside every neuron.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 9 of Track 4 (Visual Math: Linear Algebra), closing Phase 2. The dot product turns two vectors into a single number, and it has two formulas that look unrelated yet always agree. This lesson computes it both ways, explains why they match (duality), and cashes the promise from the very first lesson about how AI compares vectors in attention, cosine similarity, and inside every neuron.The stubborn vectors, eigenvectors and eigenvalueshttps://clawdemy.org/lessons/visual-math-linear-algebra/eigenvectors-and-eigenvalues/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/eigenvectors-and-eigenvalues/lesson/Lesson 14 of Track 4 (Visual Math: Linear Algebra). When a transformation moves the plane, most vectors get knocked off their own line. A few stubborn ones stay on their line and only get scaled. Those are eigenvectors, the scaling factor is the eigenvalue, and in the eigenvector basis the transformation becomes a clean diagonal matrix, the simplest it can look.Sun, 24 May 2026 00:00:00 GMTClawdemy13:00falseLesson 14 of Track 4 (Visual Math: Linear Algebra). When a transformation moves the plane, most vectors get knocked off their own line. A few stubborn ones stay on their line and only get scaled. Those are eigenvectors, the scaling factor is the eigenvalue, and in the eigenvector basis the transformation becomes a clean diagonal matrix, the simplest it can look.Undoing a transformation, and when you cannothttps://clawdemy.org/lessons/visual-math-linear-algebra/inverses-column-space-null-space/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/inverses-column-space-null-space/lesson/Lesson 7 of Track 4 (Visual Math: Linear Algebra). Last lesson ended on a warning: when the determinant is zero, information is lost. This lesson makes that precise. It builds the inverse (the undo button), shows it exists only when the determinant is nonzero, and introduces the two ideas that explain exactly what a collapse destroys: column space (everything reachable) and null space (everything crushed to zero).Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 7 of Track 4 (Visual Math: Linear Algebra). Last lesson ended on a warning: when the determinant is zero, information is lost. This lesson makes that precise. It builds the inverse (the undo button), shows it exists only when the determinant is nonzero, and introduces the two ideas that explain exactly what a collapse destroys: column space (everything reachable) and null space (everything crushed to zero).Linear transformations as moveshttps://clawdemy.org/lessons/visual-math-linear-algebra/linear-transformations/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/linear-transformations/lesson/Lesson 3 of Track 4 (Visual Math: Linear Algebra). A matrix looks like a grid of numbers with no obvious meaning. This lesson shows what it actually is: a record of where the two basis vectors land. That single idea turns matrix-vector multiplication from a rule you memorize into a picture you can sketch, and lets you read what any 2x2 matrix does to space straight off its columns.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 3 of Track 4 (Visual Math: Linear Algebra). A matrix looks like a grid of numbers with no obvious meaning. This lesson shows what it actually is: a record of where the two basis vectors land. That single idea turns matrix-vector multiplication from a rule you memorize into a picture you can sketch, and lets you read what any 2x2 matrix does to space straight off its columns.Matrix multiplication as compositionhttps://clawdemy.org/lessons/visual-math-linear-algebra/matrix-multiplication/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/matrix-multiplication/lesson/Lesson 4 of Track 4 (Visual Math: Linear Algebra). Matrix multiplication has a reputation as an arbitrary rows-times-columns rule. It is not arbitrary: multiplying two matrices means doing one transformation, then another. This lesson shows why the product is computed the way it is, why you read it right to left, why order matters (AB is not BA), and why grouping does not.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 4 of Track 4 (Visual Math: Linear Algebra). Matrix multiplication has a reputation as an arbitrary rows-times-columns rule. It is not arbitrary: multiplying two matrices means doing one transformation, then another. This lesson shows why the product is computed the way it is, why you read it right to left, why order matters (AB is not BA), and why grouping does not.Matrices between dimensionshttps://clawdemy.org/lessons/visual-math-linear-algebra/nonsquare-matrices/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/nonsquare-matrices/lesson/Lesson 8 of Track 4 (Visual Math: Linear Algebra). Every matrix so far has been square, taking a space back to a space of the same size. Drop that assumption. A rectangular matrix moves between dimensions, embedding a small space into a bigger one or projecting a big space down into a smaller one, and the rules you already know (columns, rank, null space) still tell the whole story.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 8 of Track 4 (Visual Math: Linear Algebra). Every matrix so far has been square, taking a space back to a space of the same size. Drop that assumption. A rectangular matrix moves between dimensions, embedding a small space into a bigger one or projecting a big space down into a smaller one, and the rules you already know (columns, rank, null space) still tell the whole story.Spans and basishttps://clawdemy.org/lessons/visual-math-linear-algebra/spans-and-basis/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/spans-and-basis/lesson/Lesson 2 of Track 4 (Visual Math: Linear Algebra). Give yourself a couple of vectors and the only two operations you know, adding and scaling, and ask which points you can reach. The answer is the span, and it leads straight to a basis (the smallest set that reaches everything), to linear independence, and to what the dimension of a space really means.Sun, 24 May 2026 00:00:00 GMTClawdemy9:00falseLesson 2 of Track 4 (Visual Math: Linear Algebra). Give yourself a couple of vectors and the only two operations you know, adding and scaling, and ask which points you can reach. The answer is the span, and it leads straight to a basis (the smallest set that reaches everything), to linear independence, and to what the dimension of a space really means.What vectors actually arehttps://clawdemy.org/lessons/visual-math-linear-algebra/what-vectors-actually-are/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/what-vectors-actually-are/lesson/The opener of Track 4 (Visual Math: Linear Algebra). The word vector means an arrow in physics, a list of numbers in code, and an abstract object in a math textbook, and this lesson shows they are one object seen from three angles. It connects the arrow and the list through a coordinate system, pins down the two operations (addition and scaling) that actually define a vector, and shows why this single idea is the atom that everything later in AI math is built from.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseThe opener of Track 4 (Visual Math: Linear Algebra). The word vector means an arrow in physics, a list of numbers in code, and an abstract object in a math textbook, and this lesson shows they are one object seen from three angles. It connects the arrow and the list through a coordinate system, pins down the two operations (addition and scaling) that actually define a vector, and shows why this single idea is the atom that everything later in AI math is built from.What transformers do, and why they took over AIhttps://clawdemy.org/lessons/practical-transformers/what-transformers-do/lesson/https://clawdemy.org/lessons/practical-transformers/what-transformers-do/lesson/Track 14 opens here. The thing that wrote back to you in a chat box this week was almost certainly a transformer, a specific architecture from 2017. This lesson gives the working description (tokens in, tokens out, attention in the middle), explains why transformers replaced the older sequential models, sorts the three architectural shapes you will meet, walks a short timeline, separates the expensive pre-training step from cheap fine-tuning, names the limits honestly, and places the Hugging Face ecosystem the rest of the track is built on. No math required.Sat, 23 May 2026 00:00:00 GMTClawdemy11:00falseTrack 14 opens here. The thing that wrote back to you in a chat box this week was almost certainly a transformer, a specific architecture from 2017. This lesson gives the working description (tokens in, tokens out, attention in the middle), explains why transformers replaced the older sequential models, sorts the three architectural shapes you will meet, walks a short timeline, separates the expensive pre-training step from cheap fine-tuning, names the limits honestly, and places the Hugging Face ecosystem the rest of the track is built on. No math required.The handwritten-digit problemhttps://clawdemy.org/lessons/neural-network-intuition/the-handwritten-digit-problem/lesson/https://clawdemy.org/lessons/neural-network-intuition/the-handwritten-digit-problem/lesson/The opener of Track 11 (Neural Network Intuition). Recognizing a messy handwritten 3 is effortless for you and brutally hard to write as a computer program. This lesson shows why a digit is just a grid of brightness numbers to a computer, why rule-writing falls apart on real handwriting, why handwritten digits became the classic first problem in machine learning, and the paradigm shift that powers almost all of modern AI: stop writing rules, start showing labeled examples.Fri, 22 May 2026 00:00:00 GMTClawdemy8:00falseThe opener of Track 11 (Neural Network Intuition). Recognizing a messy handwritten 3 is effortless for you and brutally hard to write as a computer program. This lesson shows why a digit is just a grid of brightness numbers to a computer, why rule-writing falls apart on real handwriting, why handwritten digits became the classic first problem in machine learning, and the paradigm shift that powers almost all of modern AI: stop writing rules, start showing labeled examples.How chain of thought makes models think out loudhttps://clawdemy.org/lessons/ai-foundations/chain-of-thought-prompting/lesson/https://clawdemy.org/lessons/ai-foundations/chain-of-thought-prompting/lesson/Phase 5 closer in our adaptation of Stanford CME 295 Lectures 3 and 6. Asking a model to produce reasoning steps before its answer reliably improves accuracy on multi-step problems. This lesson covers what chain-of-thought prompting is, the two flavors (zero-shot and few-shot), why it works, when it fails, and how it sets up Phase 6's reasoning models.Fri, 08 May 2026 00:00:00 GMTClawdemy12:00falsePhase 5 closer in our adaptation of Stanford CME 295 Lectures 3 and 6. Asking a model to produce reasoning steps before its answer reliably improves accuracy on multi-step problems. This lesson covers what chain-of-thought prompting is, the two flavors (zero-shot and few-shot), why it works, when it fails, and how it sets up Phase 6's reasoning models.How agent loops workhttps://clawdemy.org/lessons/ai-foundations/how-agent-loops-work/lesson/https://clawdemy.org/lessons/ai-foundations/how-agent-loops-work/lesson/Phase 6 closer in our adaptation of Stanford CME 295 Lecture 7. An agent is a tool-using LLM that loops. This lesson covers the observe-plan-act pattern, how multiple tool calls compose into longer-horizon work, the multi-agent setting and the A2A protocol, and the safety threads (data exfiltration, prompt injection, tool misuse) that weave through everything.Fri, 08 May 2026 00:00:00 GMTClawdemy13:00falsePhase 6 closer in our adaptation of Stanford CME 295 Lecture 7. An agent is a tool-using LLM that loops. This lesson covers the observe-plan-act pattern, how multiple tool calls compose into longer-horizon work, the multi-agent setting and the A2A protocol, and the safety threads (data exfiltration, prompt injection, tool misuse) that weave through everything.How models call functionshttps://clawdemy.org/lessons/ai-foundations/how-models-call-functions/lesson/https://clawdemy.org/lessons/ai-foundations/how-models-call-functions/lesson/Phase 6 lesson on function calling and tool use in our adaptation of Stanford CME 295 Lecture 7. RAG fetched unstructured text. Function calling fetches structured data from APIs (or triggers structured actions). This lesson covers the three-stage mechanism, how function-calling models are trained, and what the LLM actually sees.Fri, 08 May 2026 00:00:00 GMTClawdemy13:00falsePhase 6 lesson on function calling and tool use in our adaptation of Stanford CME 295 Lecture 7. RAG fetched unstructured text. Function calling fetches structured data from APIs (or triggers structured actions). This lesson covers the three-stage mechanism, how function-calling models are trained, and what the LLM actually sees.How models know word orderhttps://clawdemy.org/lessons/ai-foundations/how-models-know-word-order/lesson/https://clawdemy.org/lessons/ai-foundations/how-models-know-word-order/lesson/The Phase 1 closer to the 'how text gets read' arc. Self-attention processes all tokens in parallel and loses the implicit position signal that older recurrent models had for free. The 2017 transformer paper added position information back as a vector (sinusoidal or learned) added to the input embedding. This lesson covers why position info has to exist at all and what the original two answers were, deliberately stopping before the modern attention-injected schemes (Phase 2 picks those up after attention is taught).Fri, 08 May 2026 00:00:00 GMTClawdemy12:00falseThe Phase 1 closer to the 'how text gets read' arc. Self-attention processes all tokens in parallel and loses the implicit position signal that older recurrent models had for free. The 2017 transformer paper added position information back as a vector (sinusoidal or learned) added to the input embedding. This lesson covers why position info has to exist at all and what the original two answers were, deliberately stopping before the modern attention-injected schemes (Phase 2 picks those up after attention is taught).How reasoning models think differentlyhttps://clawdemy.org/lessons/ai-foundations/how-reasoning-models-think/lesson/https://clawdemy.org/lessons/ai-foundations/how-reasoning-models-think/lesson/Phase 6 opener in our adaptation of Stanford CME 295 Lecture 6. Reasoning models are trained to produce long internal reasoning chains as part of their policy, not just when prompted. This lesson covers what makes them different from standard LLMs, the compute-budget framing, the major reasoning benchmarks (AIME, GSM8K, HumanEval, SWE-bench, CodeForces), and how to read a Pass@K claim correctly.Fri, 08 May 2026 00:00:00 GMTClawdemy13:00falsePhase 6 opener in our adaptation of Stanford CME 295 Lecture 6. Reasoning models are trained to produce long internal reasoning chains as part of their policy, not just when prompted. This lesson covers what makes them different from standard LLMs, the compute-budget framing, the major reasoning benchmarks (AIME, GSM8K, HumanEval, SWE-bench, CodeForces), and how to read a Pass@K claim correctly.How we evaluate models, LLM-as-a-Judgehttps://clawdemy.org/lessons/ai-foundations/how-we-evaluate-models/lesson/https://clawdemy.org/lessons/ai-foundations/how-we-evaluate-models/lesson/Phase 7 opener in our adaptation of Stanford CME 295 Lecture 8. Evaluating an LLM is itself an LLM-shaped problem. This lesson covers the LLM-as-a-Judge pattern (one LLM rates another), how it's set up in practice, and the three named biases (position, verbosity, self-enhancement) that production LaaJ systems must defend against.Fri, 08 May 2026 00:00:00 GMTClawdemy12:00falsePhase 7 opener in our adaptation of Stanford CME 295 Lecture 8. Evaluating an LLM is itself an LLM-shaped problem. This lesson covers the LLM-as-a-Judge pattern (one LLM rates another), how it's set up in practice, and the three named biases (position, verbosity, self-enhancement) that production LaaJ systems must defend against.How few-shot examples teach in contexthttps://clawdemy.org/lessons/ai-foundations/in-context-learning-and-few-shot/lesson/https://clawdemy.org/lessons/ai-foundations/in-context-learning-and-few-shot/lesson/Phase 5 lesson on in-context learning and few-shot prompting in our adaptation of Stanford CME 295 Lecture 3. The model's weights are frozen at inference; you can still shape its immediate behavior by putting examples in the prompt. This lesson covers what zero-shot, one-shot, and few-shot mean, why in-context learning works at all, when examples help, and when detailed instructions can do better.Fri, 08 May 2026 00:00:00 GMTClawdemy12:00falsePhase 5 lesson on in-context learning and few-shot prompting in our adaptation of Stanford CME 295 Lecture 3. The model's weights are frozen at inference; you can still shape its immediate behavior by putting examples in the prompt. This lesson covers what zero-shot, one-shot, and few-shot mean, why in-context learning works at all, when examples help, and when detailed instructions can do better.New ways to generate, speculative decoding and diffusion LLMshttps://clawdemy.org/lessons/ai-foundations/new-ways-to-generate/lesson/https://clawdemy.org/lessons/ai-foundations/new-ways-to-generate/lesson/Phase 7 lesson on alternatives to standard autoregressive generation in our adaptation of Stanford CME 295 Lectures 3 and 9. Speculative decoding speeds up generation while preserving the target model's output distribution. Diffusion LLMs borrow from image generation: start from all-mask, denoise into text in parallel refinement passes.Fri, 08 May 2026 00:00:00 GMTClawdemy12:00falsePhase 7 lesson on alternatives to standard autoregressive generation in our adaptation of Stanford CME 295 Lectures 3 and 9. Speculative decoding speeds up generation while preserving the target model's output distribution. Diffusion LLMs borrow from image generation: start from all-mask, denoise into text in parallel refinement passes.How RLHF and DPO align modelshttps://clawdemy.org/lessons/ai-foundations/rlhf-and-dpo/lesson/https://clawdemy.org/lessons/ai-foundations/rlhf-and-dpo/lesson/The Phase 4 closer in our adaptation of Stanford CME 295 Lecture 5. The reward model from the previous lesson can score completions but cannot update an LLM by itself. This lesson covers the two methods that close that gap: RLHF (using PPO and a KL penalty against the reference model) and DPO (the supervised shortcut that derives the same objective without a reward model).Fri, 08 May 2026 00:00:00 GMTClawdemy14:00falseThe Phase 4 closer in our adaptation of Stanford CME 295 Lecture 5. The reward model from the previous lesson can score completions but cannot update an LLM by itself. This lesson covers the two methods that close that gap: RLHF (using PPO and a KL penalty against the reference model) and DPO (the supervised shortcut that derives the same objective without a reward model).Transformers beyond text, ViT and Mixture-of-Expertshttps://clawdemy.org/lessons/ai-foundations/transformers-beyond-text/lesson/https://clawdemy.org/lessons/ai-foundations/transformers-beyond-text/lesson/Phase 7 lesson on transformer adaptations in our adaptation of Stanford CME 295 Lecture 9. The transformer block has been reused for non-text inputs (Vision Transformers) and rewired for sparse routing (Mixture-of-Experts). This lesson covers what each enables and why both matter for understanding modern AI.Fri, 08 May 2026 00:00:00 GMTClawdemy11:00falsePhase 7 lesson on transformer adaptations in our adaptation of Stanford CME 295 Lecture 9. The transformer block has been reused for non-text inputs (Vision Transformers) and rewired for sparse routing (Mixture-of-Experts). This lesson covers what each enables and why both matter for understanding modern AI.Where to be careful, a safety lens on what you've learnedhttps://clawdemy.org/lessons/ai-foundations/where-to-be-careful/lesson/https://clawdemy.org/lessons/ai-foundations/where-to-be-careful/lesson/Track 5 closer. A pull-together of every safety thread woven through Phases 4 through 7: alignment and reward hacking (Phase 4), prompt injection (Phase 5), data exfiltration and tool misuse (Phase 6), evaluation biases (Phase 7). The lesson names what was woven so a coherent safety frame remains.Fri, 08 May 2026 00:00:00 GMTClawdemy13:00falseTrack 5 closer. A pull-together of every safety thread woven through Phases 4 through 7: alignment and reward hacking (Phase 4), prompt injection (Phase 5), data exfiltration and tool misuse (Phase 6), evaluation biases (Phase 7). The lesson names what was woven so a coherent safety frame remains.Why benchmarks can misleadhttps://clawdemy.org/lessons/ai-foundations/why-benchmarks-can-mislead/lesson/https://clawdemy.org/lessons/ai-foundations/why-benchmarks-can-mislead/lesson/Phase 7 lesson on benchmark literacy in our adaptation of Stanford CME 295 Lecture 8. Benchmark numbers are easy to compare and easy to get wrong about. This lesson covers the major benchmark categories (knowledge, reasoning, coding, common sense), what each one actually measures, and the structural reasons benchmark scores can rise faster than real capability.Fri, 08 May 2026 00:00:00 GMTClawdemy12:00falsePhase 7 lesson on benchmark literacy in our adaptation of Stanford CME 295 Lecture 8. Benchmark numbers are easy to compare and easy to get wrong about. This lesson covers the major benchmark categories (knowledge, reasoning, coding, common sense), what each one actually measures, and the structural reasons benchmark scores can rise faster than real capability.Why tool-using models failhttps://clawdemy.org/lessons/ai-foundations/why-tool-using-models-fail/lesson/https://clawdemy.org/lessons/ai-foundations/why-tool-using-models-fail/lesson/Phase 7 lesson on tool-use failure modes in our adaptation of Stanford CME 295 Lecture 8. Tool-use failures fall into three buckets: tool-prediction errors (the LLM picked wrong), tool-execution errors (the tool itself misbehaved), and synthesis errors (the LLM mishandled the structured response). This lesson walks all three with named sub-failures and the lecturer's debugging methodology.Fri, 08 May 2026 00:00:00 GMTClawdemy13:00falsePhase 7 lesson on tool-use failure modes in our adaptation of Stanford CME 295 Lecture 8. Tool-use failures fall into three buckets: tool-prediction errors (the LLM picked wrong), tool-execution errors (the tool itself misbehaved), and synthesis errors (the LLM mishandled the structured response). This lesson walks all three with named sub-failures and the lecturer's debugging methodology.How preferences become reward signalshttps://clawdemy.org/lessons/ai-foundations/preferences-into-reward-signals/lesson/https://clawdemy.org/lessons/ai-foundations/preferences-into-reward-signals/lesson/The second lesson of Phase 4 in our adaptation of Stanford CME 295 Lecture 5. SFT teaches the model what to predict, not what not to predict. This lesson covers how that gap is filled: what a preference pair is, why pairwise comparison is the standard collection format, and how the resulting data is used to train a reward model. The reward model is stage one of RLHF and the bridge between human preferences and the RL update in the next lesson.Thu, 07 May 2026 00:00:00 GMTClawdemy19:00falseThe second lesson of Phase 4 in our adaptation of Stanford CME 295 Lecture 5. SFT teaches the model what to predict, not what not to predict. This lesson covers how that gap is filled: what a preference pair is, why pairwise comparison is the standard collection format, and how the resulting data is used to train a reward model. The reward model is stage one of RLHF and the bridge between human preferences and the RL update in the next lesson.Pretraining: how a model learns language by predicting the next wordhttps://clawdemy.org/lessons/ai-foundations/how-models-are-pretrained/lesson/https://clawdemy.org/lessons/ai-foundations/how-models-are-pretrained/lesson/Lesson 1 of Phase 3 (How models are trained at scale) in Track 5. Pretraining is the most expensive single thing in modern AI (millions of dollars per run, months of GPU time on large clusters), and for the decoder-only models that dominate generative AI today, also the simplest. Feed the model the open internet, ask it to predict the next token, repeat trillions of times. The lesson traces the path from older one-model-per-task paradigms to the transfer-learning shape we have today, walks one training step concretely (the cat sat on the [_]: predicted distribution over vocabulary, training signal is whatever was actually next, cross-entropy loss is the negative log of the probability the model assigned to the right answer), names Common Crawl + code repositories + books as the dominant data sources, and grounds the scale (Llama 4 Scout ~40T tokens, frontier scale roughly doubled to tripled since Llama 3's 15T).Wed, 06 May 2026 00:00:00 GMTClawdemy22:00falseLesson 1 of Phase 3 (How models are trained at scale) in Track 5. Pretraining is the most expensive single thing in modern AI (millions of dollars per run, months of GPU time on large clusters), and for the decoder-only models that dominate generative AI today, also the simplest. Feed the model the open internet, ask it to predict the next token, repeat trillions of times. The lesson traces the path from older one-model-per-task paradigms to the transfer-learning shape we have today, walks one training step concretely (the cat sat on the [_]: predicted distribution over vocabulary, training signal is whatever was actually next, cross-entropy loss is the negative log of the probability the model assigned to the right answer), names Common Crawl + code repositories + books as the dominant data sources, and grounds the scale (Llama 4 Scout ~40T tokens, frontier scale roughly doubled to tripled since Llama 3's 15T).Why pretraining is a memory engineering problem (parallelism and Flash Attention)https://clawdemy.org/lessons/ai-foundations/parallelism-and-flash-attention/lesson/https://clawdemy.org/lessons/ai-foundations/parallelism-and-flash-attention/lesson/Lesson 3 of Phase 3 (How models learn from text: pretraining and scale) in Track 5. A Chinchilla-aligned pretraining run does not fit on one GPU, and attention turns out to be memory-bound rather than compute-bound. This lesson covers the four engineering tricks that make Chinchilla-scale training tractable on real hardware: data parallelism, the ZeRO optimization, model parallelism, and Flash Attention. Three distribute memory across many GPUs; one rearranges the memory hierarchy inside a single GPU.Wed, 06 May 2026 00:00:00 GMTClawdemy26:00falseLesson 3 of Phase 3 (How models learn from text: pretraining and scale) in Track 5. A Chinchilla-aligned pretraining run does not fit on one GPU, and attention turns out to be memory-bound rather than compute-bound. This lesson covers the four engineering tricks that make Chinchilla-scale training tractable on real hardware: data parallelism, the ZeRO optimization, model parallelism, and Flash Attention. Three distribute memory across many GPUs; one rearranges the memory hierarchy inside a single GPU.Why precision matters: quantization and mixed precisionhttps://clawdemy.org/lessons/ai-foundations/quantization-and-mixed-precision/lesson/https://clawdemy.org/lessons/ai-foundations/quantization-and-mixed-precision/lesson/Lesson 4 of Phase 3 (How models learn from text: pretraining and scale) in Track 5. The Phase 3 closer. The fourth and last memory lever in the pretraining-engineering toolkit. Lower-precision floating-point representations cost less memory per parameter and run faster on hardware that supports them. Quantization converts a trained model from one precision to another; mixed precision training uses different precisions in different parts of one training step to keep the savings without losing the model in numerical noise.Wed, 06 May 2026 00:00:00 GMTClawdemy18:00falseLesson 4 of Phase 3 (How models learn from text: pretraining and scale) in Track 5. The Phase 3 closer. The fourth and last memory lever in the pretraining-engineering toolkit. Lower-precision floating-point representations cost less memory per parameter and run faster on hardware that supports them. Quantization converts a trained model from one precision to another; mixed precision training uses different precisions in different parts of one training step to keep the savings without losing the model in numerical noise.Why scale matters: scaling laws and the Chinchilla rulehttps://clawdemy.org/lessons/ai-foundations/why-scale-matters/lesson/https://clawdemy.org/lessons/ai-foundations/why-scale-matters/lesson/Lesson 2 of Phase 3 (How models learn from text: pretraining and scale) in Track 5. Pretraining works because of scale, the lecturer says. This lesson gives that claim its empirical foundation. Two papers: the Kaplan scaling laws (loss falls predictably with more compute, more data, more parameters), and the Chinchilla compute-optimal rule (with a fixed budget, train on roughly 20 tokens per parameter). Together they explain why GPT-3 was undertrained and what changed when frontier labs rebalanced toward more data.Wed, 06 May 2026 00:00:00 GMTClawdemy20:00falseLesson 2 of Phase 3 (How models learn from text: pretraining and scale) in Track 5. Pretraining works because of scale, the lecturer says. This lesson gives that claim its empirical foundation. Two papers: the Kaplan scaling laws (loss falls predictably with more compute, more data, more parameters), and the Chinchilla compute-optimal rule (with a fixed budget, train on roughly 20 tokens per parameter). Together they explain why GPT-3 was undertrained and what changed when frontier labs rebalanced toward more data.How modern models inject position into attention (RoPE)https://clawdemy.org/lessons/ai-foundations/position-embeddings-and-rope/lesson/https://clawdemy.org/lessons/ai-foundations/position-embeddings-and-rope/lesson/The Phase 2 lesson on the architectural shift from input-added position embeddings (sinusoidal, learned) to attention-injected ones. Covers the 'closer-tokens-more-similar' intuition that motivates the shift, the two intermediate schemes (T5 relative bias, ALiBi), and the RoPE deep-dive. Phase 1 covered the original 2017 schemes; this lesson covers what modern LLMs do differently.Sun, 03 May 2026 00:00:00 GMTClawdemy20:00falseThe Phase 2 lesson on the architectural shift from input-added position embeddings (sinusoidal, learned) to attention-injected ones. Covers the 'closer-tokens-more-similar' intuition that motivates the shift, the two intermediate schemes (T5 relative bias, ALiBi), and the RoPE deep-dive. Phase 1 covered the original 2017 schemes; this lesson covers what modern LLMs do differently.How instruction tuning makes a model helpfulhttps://clawdemy.org/lessons/ai-foundations/how-models-learn-to-be-helpful/lesson/https://clawdemy.org/lessons/ai-foundations/how-models-learn-to-be-helpful/lesson/Lesson 1 of Phase 4 (How models learn to be helpful) in Track 5. A pretrained transformer is a great autocompleter, not an assistant. Supervised fine-tuning (SFT) is the bridge: same next-token-prediction objective as pretraining, but on a much smaller curated set of instruction-response examples (typically thousands to hundreds of thousands rather than trillions). The lesson covers what SFT changes (response shape: when the model sees an instruction, the most-likely continuation is now a response rather than more text in the same style) and what stays the same (the knowledge already in the weights), why a few high-quality examples can transform surface behavior, where parameter-efficient methods like LoRA fit, what kind of model you have after SFT (instruction-following but not yet preference-aligned), and the structural limitation (no negative signal: SFT can teach what to predict but not what NOT to predict) that makes the next lesson on preference data necessary.Thu, 30 Apr 2026 00:00:00 GMTClawdemy18:00falseLesson 1 of Phase 4 (How models learn to be helpful) in Track 5. A pretrained transformer is a great autocompleter, not an assistant. Supervised fine-tuning (SFT) is the bridge: same next-token-prediction objective as pretraining, but on a much smaller curated set of instruction-response examples (typically thousands to hundreds of thousands rather than trillions). The lesson covers what SFT changes (response shape: when the model sees an instruction, the most-likely continuation is now a response rather than more text in the same style) and what stays the same (the knowledge already in the weights), why a few high-quality examples can transform surface behavior, where parameter-efficient methods like LoRA fit, what kind of model you have after SFT (instruction-following but not yet preference-aligned), and the structural limitation (no negative signal: SFT can teach what to predict but not what NOT to predict) that makes the next lesson on preference data necessary.Token by token: how a transformer generates texthttps://clawdemy.org/lessons/ai-foundations/how-text-is-generated/lesson/https://clawdemy.org/lessons/ai-foundations/how-text-is-generated/lesson/Lesson 1 of Phase 5 (How we steer models at inference) in Track 5. This one shows what a trained transformer actually does at runtime: predict the next token, sample from a distribution, append, then repeat. The lesson walks the autoregressive prediction loop end-to-end (forward pass, logits, softmax, sample, append), compares the decoding strategies that shape the sample step (greedy, pure sampling, top-k, top-p, plus temperature as a separate dial), explains KV caching honestly (it removes the recompute cost that would have made naive generation grow quadratically with output length, so per-token cost grows linearly with cache length, not constant; the dominant constant per-token model cost is what makes streaming feel steady until contexts get long), and closes on speculative decoding as the production layer on top (TensorRT-LLM, vLLM, SGLang ship it natively in 2026).Wed, 29 Apr 2026 00:00:00 GMTClawdemy22:00falseLesson 1 of Phase 5 (How we steer models at inference) in Track 5. This one shows what a trained transformer actually does at runtime: predict the next token, sample from a distribution, append, then repeat. The lesson walks the autoregressive prediction loop end-to-end (forward pass, logits, softmax, sample, append), compares the decoding strategies that shape the sample step (greedy, pure sampling, top-k, top-p, plus temperature as a separate dial), explains KV caching honestly (it removes the recompute cost that would have made naive generation grow quadratically with output length, so per-token cost grows linearly with cache length, not constant; the dominant constant per-token model cost is what makes streaming feel steady until contexts get long), and closes on speculative decoding as the production layer on top (TensorRT-LLM, vLLM, SGLang ship it natively in 2026).Multi-head attention: many lenses on the same sentencehttps://clawdemy.org/lessons/ai-foundations/multi-head-attention/lesson/https://clawdemy.org/lessons/ai-foundations/multi-head-attention/lesson/Lesson 2 of Phase 2 (How models think) in Track 5. One attention head can only weight every token one way per token, so it has to choose which structure to track in a sentence that has many running through it at once. Real transformers run 8 to 32 heads in parallel, each with its own Q, K, V projections, looking at the same sentence through a different lens. The lesson builds the one-head-isn't-enough intuition (back to the animal-street-it example), walks the split-run-concatenate pattern (h smaller heads each at d_k = d_model / h, concatenated and projected through W_O), traces the dimension flow on a 12-head 768-dim example, and closes with what real model specs mean by head counts and the 2026 production variants beyond vanilla MHA (MQA, GQA, MLA).Wed, 29 Apr 2026 00:00:00 GMTClawdemy22:00falseLesson 2 of Phase 2 (How models think) in Track 5. One attention head can only weight every token one way per token, so it has to choose which structure to track in a sentence that has many running through it at once. Real transformers run 8 to 32 heads in parallel, each with its own Q, K, V projections, looking at the same sentence through a different lens. The lesson builds the one-head-isn't-enough intuition (back to the animal-street-it example), walks the split-run-concatenate pattern (h smaller heads each at d_k = d_model / h, concatenated and projected through W_O), traces the dimension flow on a 12-head 768-dim example, and closes with what real model specs mean by head counts and the 2026 production variants beyond vanilla MHA (MQA, GQA, MLA).The transformer block: where everything comes togetherhttps://clawdemy.org/lessons/ai-foundations/transformer-block/lesson/https://clawdemy.org/lessons/ai-foundations/transformer-block/lesson/Lesson 3 of Phase 2 (How models think) in Track 5. Tokens, embeddings, attention, multi-head attention. All the load-bearing pieces. This lesson assembles them into a real transformer block: the repeating unit stacked many times to build a real model. Covers the four wrapping pieces (position encoding, feed-forward network, residual connections, layer normalization), why each one is structurally required (attention alone is order-blind, has no per-token nonlinearity, suffers rank collapse without an FFN, and cannot be stacked deep without residuals + normalization), the Pre-LN vs Post-LN ordering (Pre-LN is the modern default; the original 2017 paper used Post-LN), and what every component on the canonical 'Attention Is All You Need' architecture diagram represents.Wed, 29 Apr 2026 00:00:00 GMTClawdemy25:00falseLesson 3 of Phase 2 (How models think) in Track 5. Tokens, embeddings, attention, multi-head attention. All the load-bearing pieces. This lesson assembles them into a real transformer block: the repeating unit stacked many times to build a real model. Covers the four wrapping pieces (position encoding, feed-forward network, residual connections, layer normalization), why each one is structurally required (attention alone is order-blind, has no per-token nonlinearity, suffers rank collapse without an FFN, and cannot be stacked deep without residuals + normalization), the Pre-LN vs Post-LN ordering (Pre-LN is the modern default; the original 2017 paper used Post-LN), and what every component on the canonical 'Attention Is All You Need' architecture diagram represents.Embeddings: how words become vectors with meaninghttps://clawdemy.org/lessons/ai-foundations/how-words-become-vectors/lesson/https://clawdemy.org/lessons/ai-foundations/how-words-become-vectors/lesson/Lesson 2 of Phase 1 (How models read text) in Track 5. Token IDs are just arbitrary numbers with no meaning attached. Embeddings are the dense vectors that fix that, by carrying meaning into the model as geometry: similar words are close together on a high-dimensional map, and certain consistent kinds of difference (gender, tense, country and capital) point along consistent directions. The lesson builds the words-on-a-map intuition, walks through the lookup-table mechanism (one row per token, the embedding matrix W_E), pays off the king-queen demonstration as actual vector arithmetic (Mikolov et al., Word2Vec 2013, predating transformers by four years), and lands on why every modern semantic-search and retrieval-augmented system in production runs on this idea.Tue, 28 Apr 2026 00:00:00 GMTClawdemy22:00falseLesson 2 of Phase 1 (How models read text) in Track 5. Token IDs are just arbitrary numbers with no meaning attached. Embeddings are the dense vectors that fix that, by carrying meaning into the model as geometry: similar words are close together on a high-dimensional map, and certain consistent kinds of difference (gender, tense, country and capital) point along consistent directions. The lesson builds the words-on-a-map intuition, walks through the lookup-table mechanism (one row per token, the embedding matrix W_E), pays off the king-queen demonstration as actual vector arithmetic (Mikolov et al., Word2Vec 2013, predating transformers by four years), and lands on why every modern semantic-search and retrieval-augmented system in production runs on this idea.How AI reads: turning text into tokenshttps://clawdemy.org/lessons/ai-foundations/how-ai-reads-tokens/lesson/https://clawdemy.org/lessons/ai-foundations/how-ai-reads-tokens/lesson/The opener of Phase 1 (How models read text) in Track 5. A transformer never sees raw text; it sees a sequence of integer IDs called tokens. This lesson walks why neither whole words nor individual characters work as units, what byte-pair encoding does (with one merge worked by hand), why a token is atomic to the model (the structural reason older models fail at letter-counting), and how special tokens like BOS, EOS, and chat-role markers create the prompt-injection surface that becomes load-bearing later.Mon, 27 Apr 2026 00:00:00 GMTClawdemy20:00falseThe opener of Phase 1 (How models read text) in Track 5. A transformer never sees raw text; it sees a sequence of integer IDs called tokens. This lesson walks why neither whole words nor individual characters work as units, what byte-pair encoding does (with one merge worked by hand), why a token is atomic to the model (the structural reason older models fail at letter-counting), and how special tokens like BOS, EOS, and chat-role markers create the prompt-injection surface that becomes load-bearing later.Inside the transformer: how attention decides which word goes with whichhttps://clawdemy.org/lessons/ai-foundations/how-attention-works/lesson/https://clawdemy.org/lessons/ai-foundations/how-attention-works/lesson/The opener of Phase 2 (How models think) in Track 5. Self-attention is how a transformer figures out, for every word, which other words in the sentence it should be paying attention to. The lesson opens on the canonical *the animal didn't cross the street because it was too tired* example (your reading brain connects *it* to *animal*, not *street*, without effort), names what RNNs structurally couldn't do (long-range decay, no parallelism), builds the query-key-value (Q-K-V) library analogy, walks the three-step formula (similarity / scale by sqrt(d_k) / softmax-weighted sum) without burying you in linear algebra, distinguishes self-attention from cross-attention by which sequence supplies each vector, and works one full attention computation by hand on three tokens so the formula stops being a black box.Sun, 26 Apr 2026 00:00:00 GMTClawdemy25:00falseThe opener of Phase 2 (How models think) in Track 5. Self-attention is how a transformer figures out, for every word, which other words in the sentence it should be paying attention to. The lesson opens on the canonical *the animal didn't cross the street because it was too tired* example (your reading brain connects *it* to *animal*, not *street*, without effort), names what RNNs structurally couldn't do (long-range decay, no parallelism), builds the query-key-value (Q-K-V) library analogy, walks the three-step formula (similarity / scale by sqrt(d_k) / softmax-weighted sum) without burying you in linear algebra, distinguishes self-attention from cross-attention by which sequence supplies each vector, and works one full attention computation by hand on three tokens so the formula stops being a black box.AI won't replace you. But it will expose you.https://clawdemy.org/lessons/getting-started/ai-wont-replace-you/lesson/https://clawdemy.org/lessons/getting-started/ai-wont-replace-you/lesson/The mission lesson. AI amplifies capability, but only if there's something worth amplifying. That something is your human delta. Start here.Wed, 22 Apr 2026 00:00:00 GMTClawdemy12:00falseThe mission lesson. AI amplifies capability, but only if there's something worth amplifying. That something is your human delta. Start here.