Practice: Why sequences need memory

Self-check

Seven short questions. Try to answer each in your head (or on paper) before opening the collapsible. Active retrieval is where the learning sticks; rereading is comfortable but does much less.

1. What makes data a “sequence,” and why does it need different handling?

Show answer

A sequence is ordered data where the order itself carries the meaning (sentences, audio, time series). “The clouds are in the sky” means something; its shuffle does not. The order is part of the message, so a network has to account for it, not just see a bag of inputs.

2. Name the three problems a plain feedforward network has with sequences.

Show answer

(1) Fixed input size: it has a fixed number of input slots, but sequences vary in length. (2) No order and no memory: it sees one snapshot and answers in one shot, forgetting any previous input. (3) No sharing across positions: something learned in slot two does nothing for slot nine, so it would have to relearn the same thing at every position.

3. What is the hidden state, in one sentence?

Show answer

It is the network’s memory: a fixed-size running summary of everything seen so far, updated at each step and fed back into the network for the next one.

4. At each step of a recurrent network, what two things combine to produce the new hidden state?

Show answer

The current piece of the sequence (for example, the next word) and the previous step’s hidden state (the memory so far). They are combined with the familiar weighted-sum-plus-squish machinery to produce the updated memory, which is passed forward.

5. The same weights are reused at every step of an RNN. Which two of the three problems does that reuse fix?

Show answer

It fixes the variable-length problem (just keep looping the same small network for any length) and the no-sharing problem (whatever it learns applies at every position, because it is literally the same weights each step). The third problem, memory, is fixed by the hidden state carrying context forward.

6. Why does a simple recurrent network forget the distant past, and what helps?

Show answer

The early signal fades: as the hidden state is overwritten step after step, information from the start gets washed out, so long-range dependencies (like “France” → “French” many words later) slip away. Gated designs (LSTM, GRU) add little gates that decide what to keep, overwrite, or forget, so important early information can be held onto longer.

7. Fill in the blank. “A feedforward network sees a ______; a recurrent network reads a ______, carrying a memory forward.”

Show answer

A snapshot and a story. The feedforward network takes the whole input at once; the recurrent network reads one piece at a time and updates a running memory.

Try it yourself: trace the memory, and find where it strains

No tool and no arithmetic, just the lesson’s idea applied with a pen. About 10 minutes. The point is to feel what the hidden state has to carry, and how far back the key information sits.

Part A: short-range memory. Read this sentence one word at a time, and at each word, note in a phrase what a useful “memory so far” would contain:

“The chef tasted the soup and added more ____.”

What you should notice

By the blank, a useful memory holds something like “cooking, soup, seasoning context,” and the natural prediction is salt (or pepper, spice). The words that matter (chef, tasted, soup) are all close to the blank, so even a short memory carries them easily. This is the regime recurrent networks handle well.

Part B: long-range memory. Now do the same for this one:

“Marie was born and raised in Paris, and although her career later took her around the world, she still dreams in fluent ____.”

What you should notice

The answer is French, and it depends on “Paris” near the very start, with a long, distracting clause in between. For a network to get it, the memory of “Paris” has to survive many steps without being overwritten. This is exactly where a simple recurrent network strains: the early signal fades. A gated design (LSTM/GRU) is built to hold that distant fact longer, and attention (next lesson) drops the step-by-step march so any word can reach any other directly.

Part C: spot the distance. For each, say whether the key context is short-range (easy for a simple RNN) or long-range (where it strains): “I turned the key and the engine ____.” versus “The package I ordered three weeks ago, after a long delay and two emails to support, finally ____.”

What you should notice

First is short-range (“engine” → started/roared, context right there). Second is long-range (“package … finally” → arrived, with the subject far from the verb). Naming the distance is naming exactly where recurrence is comfortable versus where it struggles.

Flashcards

Eleven cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.

Q. What is a sequence, and why does it need special handling?

Ordered data where the order carries the meaning (text, audio, time series). The order is part of the message, so the network must account for what came before, not just see a snapshot.

Q. Why does a fixed-input feedforward network fail at sequences?

Three reasons: fixed input size (sequences vary in length), no order or memory (it answers from one snapshot and forgets prior inputs), and no sharing across positions (it would relearn the same pattern at every slot).

Q. What is the hidden state in a recurrent network?

The network’s memory: a fixed-size running summary of everything seen so far, updated at each step and fed back into the network for the next.

Q. What does a recurrent network do at each step?

It combines the current input with the previous hidden state (using weighted-sum-plus-squish) to produce a new hidden state, which it passes forward. Optionally it reads an answer out of the current state.

Q. Why is weight-sharing across steps important in an RNN?

One small network’s weights are reused at every step, so it handles sequences of any length and applies what it learns at every position. It fixes the variable-length and no-sharing problems at once.

Q. Why does a simple RNN forget the distant past?

The early signal fades: the hidden state is overwritten step after step, so information from the start gets washed out. Short-range context survives; long-range context slips away.

Q. What do LSTMs and GRUs add, and are they a new idea?

They add gated memory: little gates that decide what to keep, overwrite, or forget, so important early information is held longer. Not a new idea, the same recurrence loop with a smarter memory.

Q. Does the hidden state store the whole sequence?

No. It is a fixed-size running summary (a compression of what mattered), not a transcript. That is exactly why long-range detail can be lost.

Q. What is a 'context window,' and how does it relate to this lesson?

How much of a sequence a model can keep in mind at once. It is a direct descendant of this lesson’s problem: holding information across a long sequence is the hard part.

Q. What replaced step-by-step recurrence for many tasks?

Attention (next lesson): instead of marching through a sequence carrying a memory, it lets every position look at every other position at once and weigh what matters.

Q. What is the one-sentence takeaway of this lesson?

A feedforward network sees a snapshot; a recurrent network reads a story, carrying a memory forward word by word.