BERT, part one: the bidirectional encoder

The previous lesson set up three branches: encoder-decoder (the T5 family), encoder-only (the BERT family), and decoder-only (most modern LLMs). BERT is the heart of the encoder-only branch. It is also one of the most influential models the field has built. The lecturer notes that its paper has been cited around 170,000 times.

BERT is a single idea, split across two lessons. This lesson is the architecture. It covers what BERT looks like as a stack of transformer blocks. It covers why the two-way design is the key choice. It covers what the structural tokens (CLS and SEP) do. And it covers how the input gets shaped before any encoder block sees it. The next lesson covers how this architecture is trained. That means the two pretraining goals (MLM and NSP), the train-then-fine-tune workflow, and the patterns that turn a pretrained encoder into a task classifier.

The architectural move

The move is small to describe and large in effect. Drop the decoder. Keep the encoder. The encoder’s self-attention is naturally two-way. There is no causal mask forcing it to look only at past tokens. Every token attends to every other token in the sequence, in both directions, in one pass. The output is a stack of context-aware vectors, one per input token. Each one has pulled in information from the whole surrounding context.

That two-way reading is what BERT’s name puts front and center. BERT stands for Bidirectional Encoder Representations from Transformers. The lecturer walks through each piece of the acronym on its own. The encoder part is the easiest: just drop the decoder. The bidirectional part is the more interesting claim. It depends on what the model is being asked to do.

A model with two-way self-attention cannot do next-token prediction the way a decoder-only model can. If the model can see the future tokens during training, predicting them is trivial. So BERT was trained on different goals, which the next lesson covers. This lesson stays inside the architecture. It is about what BERT is, before training even begins.

Bidirectional, made concrete

In a decoder-only model, the attention layer is masked. Token N can attend to tokens 1 through N-1, but not to N+1, N+2, and so on. The mask makes attention causal. Causal attention is what makes next-token prediction sensible. You cannot peek at the future when you are trying to predict it.

In BERT’s encoder, that mask is gone. Self-attention runs without restriction. The token at position 5 can attend to tokens 1, 2, 3, 4 (the past) and tokens 6, 7, 8 (the future). One forward pass through the encoder produces one vector per token. Each vector has pulled in information from every other token in the sequence.

The lecturer flags this contrast directly. GPT-style decoder-only models are not truly two-way, since causal masking prevents it. BERT’s encoder representations, used for classification, truly are. That two-way reading is what makes the encoder a strong choice for certain tasks. These are tasks where you need a representation of the whole input (classification, embeddings, span detection) rather than a continuation of it.

A paper from the same period, ELMo (Embeddings from Language Models), also chased two-way representations. ELMo was based on two-way LSTMs and had similar insights. But it lost steam against BERT, because LSTMs are harder to scale than transformers. (Both are Sesame Street characters, which is the joke.)

The structural tokens: CLS and SEP

BERT’s input is not just tokenized text. Two special tokens get added that determine what the model can do.

CLS stands for classification. It is added at the start of every input sequence, before any of the actual content tokens. Its job is to carry the whole-input representation. Later you fine-tune BERT for a classification task (covered in the next lesson). To do that, you attach a classification head on top of the CLS token’s output vector. That vector has pulled in context from every token in the input. So it works as a sentence-level (or document-level) summary.

SEP stands for separator. It marks the boundary between sentences when BERT’s input has more than one. BERT was designed to take one or two sentences as input. The SEP token tells the model where one ends and the next begins. This two-sentence shape is what enables one of the next lesson’s pretraining goals.

A typical BERT input looks like:

[CLS] this teddy bear is so cute . [SEP]

Or, for two sentences:

[CLS] this teddy bear is so cute . [SEP] she carries it everywhere . [SEP]

The CLS sits at position 0; SEPs mark sentence boundaries; padding tokens (omitted above) fill out the rest of a fixed-length training batch.

The input embedding: token + position + segment

Every token in BERT’s input becomes a vector. BERT builds that vector by adding three embeddings together. This is a small shift from the original transformer, which added only two: token and position.

Token embedding. A learned vector per token in the vocabulary, looked up by token ID. BERT’s vocabulary is around 30,000 tokens, built by a tokenizer called WordPiece. That is the same rough size as most modern LLM tokenizers.

Position embedding. A vector that tells the model where each token sits in the sequence. BERT used learned position embeddings, one per absolute slot, up to the model’s maximum length. The original transformer used fixed sinusoidal positions instead. Both shapes work about the same in practice.

Segment embedding. A learned vector that tells the model which sentence a token belongs to. There are exactly two values: Segment A for the first sentence, and Segment B for the second. Every token in a sentence shares the same segment embedding. This part is new in BERT. The original transformer did not need it, because it handled one sequence at a time.

These three embeddings are added together, slot by slot, to form the input vector for each token. The encoder then takes that vector through its stack of two-way self-attention and feed-forward blocks.

WordPiece, briefly

BERT uses a tokenizer called WordPiece, an early subword method. The idea is close to byte-pair encoding from the Phase 1 lesson. You build a vocabulary from a training corpus by merging frequent character pairs over and over, until the vocabulary hits a target size (around 30,000 for BERT). At run time, each input string is split into the longest matching tokens from that learned vocabulary.

Two practical points are worth knowing. First, WordPiece marks the start of a new word apart from a piece that continues one, using a ## prefix on the continuation pieces in BERT’s variant. Second, BERT comes in cased and uncased variants. The cased variant keeps capitalization. The uncased variant lowercases the input first. Which one to use depends on whether casing carries meaning for your task. Named-entity recognition usually wants cased. Sentiment classification often does fine with uncased.

The walked example: input through the encoder

To ground the architecture concretely, walk one example end-to-end on the architecture side. Input: “this teddy bear is so cute.”

Step 1: Pre-process. In the uncased BERT variant, lowercase everything: “this teddy bear is so cute.” (Cased variants would skip this.)

Step 2: Tokenize with WordPiece. The tokenizer breaks the input into the tokens it has in its vocabulary. “teddy” might tokenize as one token, or as two if WordPiece breaks it on a learned boundary; either way the result is a sequence of integer IDs.

Step 3: Add structural tokens. Prepend [CLS] and append [SEP]:

[CLS] this teddy bear is so cute . [SEP]

Step 4: Compute the three input embeddings. For each token: token embedding + position embedding + segment embedding (all Segment A here since there’s only one sentence). Sum them component-wise.

Step 5: Run through the encoder. Every token’s input vector flows through the stack of bidirectional self-attention plus feed-forward blocks. At the end, every input position has a corresponding output embedding that integrates context from the entire sequence.

That is where this lesson stops. We have an architecture and a sequence of context-aware token representations coming out of it. Notably, we have not yet asked the model to do anything with those representations. The next lesson covers what we ask of them: which pretraining objectives turn the architecture into a model that has learned something, and which fine-tuning patterns plug a small task-specific head on top of the encoder to turn it into a classifier or span-detector for a specific job.

Why this matters when you use AI

Two consequences worth holding onto when you read AI tooling docs or model cards.

The BERT family is still everywhere for classification tasks, and 2026’s production standard is ModernBERT. The original 2018 BERT is foundational and still available, but most new production stacks reach for ModernBERT (Answer.AI / LightOn, late 2024), an encoder-only model that keeps BERT’s shape while folding in the modern transformer tweaks: native FlashAttention, RoPE-based position encoding, an 8K context window, and a faster training recipe. When a 2026 stack uses an off-the-shelf encoder for sentiment, intent, named-entity recognition, or any other classification-flavored task, the model is almost certainly ModernBERT or one of its descendants. Knowing the architecture (encoder-only, bidirectional, CLS-as-classification-head) lets you read the documentation and reason about what the model can and cannot do, regardless of which generation you load.
The architectural choice constrains the use case. An encoder-only model produces representations of input; it does not generate text. If a model card says “encoder-only” or “BERT-like,” you immediately know what shape of task it is good for and what shape it is not. The next lesson covers how that representation actually gets trained and steered.

Common pitfalls

A few mistakes are common enough to be worth naming, even before we get to training.

Conflating BERT’s architecture with GPT’s. Different shape (encoder-only vs decoder-only), different attention masking (bidirectional vs causal), different uses (classification and embedding vs generation). The transformer block itself is similar in both, but the design choices around it diverge.

Thinking BERT can do text generation. It cannot, at least not naturally. The encoder-only architecture has no decoder, no cross-attention, no autoregressive loop. You can pull tricks (use the model’s masked-prediction capability iteratively to “unmask” tokens), but generation is not what BERT is built for.

Thinking “bidirectional” means two passes. It does not. One forward pass through the encoder produces all the bidirectional context. The bidirectionality comes from the absence of a causal mask, not from running attention twice.

Forgetting that the segment embedding is BERT-specific. It is one of the small, easy-to-miss differences between BERT and the original 2017 transformer. It exists because BERT’s input shape can include two sentences and the model needs to know which is which.

What you should remember

BERT is the encoder-only branch’s defining model. Drop the decoder, keep the encoder. Self-attention without a causal mask is bidirectional: every token attends to every other token in one pass.
Two structural tokens shape the input. CLS at position 0 carries the sentence-level (or sentence-pair-level) representation used for classification heads. SEP marks sentence boundaries when there are multiple sentences.
Three additive embeddings. Token + position (learned, one per absolute position) + segment (new in BERT; Segment A vs Segment B for the two-sentence case). All added component-wise to produce the input vector.
WordPiece tokenizer, ~30k vocabulary. Cased and uncased variants of the model exist depending on whether casing matters for the task.
The architecture produces context-aware token representations. What you do with them (classification, span detection, similarity) is a separate question, covered in the next lesson.

What’s next

Now that you have seen BERT’s architecture, the next lesson covers how BERT is trained: the two pretraining objectives (MLM and NSP), why bidirectionality forced new objectives, and the fine-tuning patterns that make BERT useful for downstream tasks.

If you remember one thing

BERT drops the decoder and removes the causal mask.
That is what makes the encoder bidirectional.
CLS, SEP, and three additive embeddings shape the input. The next lesson trains it.