Practice: Matrix multiplication as composition
Self-check
Section titled “Self-check”Six short questions. Answer each one in your head (or on paper) before opening the collapsible. Trying to retrieve the answer is where the learning sticks; rereading feels productive but does much less.
1. What does the product AB mean as a sequence of moves?
Show answer
First apply B, then apply A. The matrix on the right acts first. The product is a single matrix that records where the basis lands after both transformations, in that order.
2. Why do you read a matrix product right to left?
Show answer
Because of how the product hits a vector: (AB) · v = A · (B · v). The vector meets B first (it is closest), then the result meets A. It is the same reason the inner function runs first in f(g(x)).
3. How do you compute AB column by column?
Show answer
Apply A to each column of B. The first column of AB is A applied to the first column of B; the second column is A applied to the second column of B. The columns of B are where B sent the basis vectors, and running them through A gives where the combined move sends them.
4. Is matrix multiplication commutative? That is, does AB = BA?
Show answer
No, not in general. Order is the sequence of physical moves, and rotating-then-shearing does not land in the same place as shearing-then-rotating. Only in special cases do two transformations happen to commute.
5. Is matrix multiplication associative? Does grouping matter?
Show answer
It is associative: (AB)C = A(BC) always. Grouping does not matter because the chain of moves is one fixed sequence; parentheses only change which adjacent pair you bundle first, never the order they happen in. That is why ABC needs no parentheses.
6. Why does stacking only linear layers in a neural network gain nothing, and what fixes it?
Show answer
Composing linear transformations gives a single linear transformation, so a hundred linear layers collapse to one matrix and add no expressive power. Inserting a nonlinear function between layers breaks the collapse: the layers can no longer be multiplied into one matrix, so each added layer genuinely extends what the network can express. That is why depth helps at all.
Try it yourself, part 1: compose two transformations
Section titled “Try it yourself, part 1: compose two transformations”Use these two transformations throughout. A is a horizontal stretch by 2; B is a 90-degree counterclockwise rotation.
A = [ 2 0 ] B = [ 0 -1 ] [ 0 1 ] [ 1 0 ]About 8 minutes, pen and paper. Recall: to compute a product, apply the left matrix to each column of the right matrix.
Step 1. Compute AB (apply A to each column of B). The columns of B are [0, 1] and [-1, 0].
Step 2. Compute BA (apply B to each column of A). The columns of A are [2, 0] and [0, 1].
Step 3. Apply both products to v = [1, 2]. Do you get the same point?
Check your work
Step 1. A · [x, y] = x · [2, 0] + y · [0, 1].
A · [0, 1] = [0, 1]A · [-1, 0] = [-2, 0]
So AB = [[0, -2], [1, 0]] (columns [0, 1] and [-2, 0]).
Step 2. B · [x, y] = x · [0, 1] + y · [-1, 0].
B · [2, 0] = [0, 2]B · [0, 1] = [-1, 0]
So BA = [[0, -1], [2, 0]] (columns [0, 2] and [-1, 0]).
Step 3. AB · [1, 2] = 1 · [0, 1] + 2 · [-2, 0] = [0, 1] + [-4, 0] = [-4, 1]. BA · [1, 2] = 1 · [0, 2] + 2 · [-1, 0] = [0, 2] + [-2, 0] = [-2, 2]. Different points, because AB ≠ BA: stretch-then-rotate is not rotate-then-stretch.
Try it yourself, part 2: predict, then verify
Section titled “Try it yourself, part 2: predict, then verify”Stay with AB = [[0, -2], [1, 0]] from part 1. About 6 minutes.
Step 1. Apply AB directly to v = [1, 2].
Step 2. Now do it the slow way: first apply B to [1, 2], then apply A to that result. Confirm you land on the same point, and write one sentence explaining why B acts first.
Check your work
Step 1. AB · [1, 2] = 1 · [0, 1] + 2 · [-2, 0] = [-4, 1] (same as part 1, step 3).
Step 2. First B · [1, 2] = 1 · [0, 1] + 2 · [-1, 0] = [0, 1] + [-2, 0] = [-2, 1]. Then A · [-2, 1] = -2 · [2, 0] + 1 · [0, 1] = [-4, 0] + [0, 1] = [-4, 1]. Same point.
B acts first because in AB · v = A · (B · v), the vector is closest to B, so it meets B before it meets A, exactly like the inner function in f(g(x)).
Flashcards
Section titled “Flashcards”Ten cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.
Q. What does the product AB mean as a sequence of moves?
First apply B, then apply A; the rightmost matrix acts first. The product is one matrix recording where the basis lands after both transformations, in that order.
Q. Why do you read a matrix product right to left?
Because (AB) · v = A · (B · v): the vector meets the nearest matrix (B) first, then the result meets A. Same reason the inner function runs first in f(g(x)).
Q. How do you compute AB column by column?
Apply A to each column of B. Each column of B is where B sent a basis vector; running it through A gives where the combined move sends it. The results are the columns of AB.
Q. Is matrix multiplication commutative?
No. AB ≠ BA in general, because order is the sequence of physical moves and rotate-then-shear differs from shear-then-rotate. Only special pairs of transformations commute.
Q. Is matrix multiplication associative?
Yes. (AB)C = A(BC) always, because the chain of moves is one fixed sequence; parentheses only change which adjacent pair you bundle first, not the order. So ABC needs no parentheses.
Q. What is the difference between not-commutative and not-associative?
Not-commutative means order matters (AB ≠ BA). Matrix multiplication IS associative, so grouping does not matter ((AB)C = A(BC)). You may reparenthesize the chain but not reorder it.
Q. What does (AB)·v equal, and why does that show the order?
(AB) · v = A · (B · v). The vector reaches B first because it is closest, then the output reaches A. The nesting makes the right-to-left order explicit.
Q. Why does stacking only linear layers in a network gain nothing?
Composing linear transformations yields a single linear transformation, so any stack of linear layers collapses to one matrix. A hundred linear layers can do no more than one well-chosen layer.
Q. What breaks the linear-layer collapse, and why does depth then help?
A nonlinear function placed between layers. It bends the grid so the layers can no longer be multiplied into a single matrix, letting each added layer extend what the network can express. That is why depth helps at all.
Q. What is the meaning behind the rote row-times-column rule?
Each column of AB is A applied to that column of B: where the basis lands after both moves. The row-times-column recipe gives the same numbers but hides this composition meaning.