Lesson: Matrix multiplication as composition
Last lesson landed on a single idea: a matrix is a record of where the basis vectors go, and that record is enough to move all of space. One matrix, one move. This lesson asks the obvious next question. What if you want two moves in a row? Rotate the plane, then shear it. Stretch it, then rotate it. What records that combined motion?
The answer is matrix multiplication, and the reason it is worth a lesson is that almost everyone meets the multiplication rule first, as a memorized recipe of rows times columns, and never learns what it means. It means composition: do one transformation, then another. Once you see that, the rule stops being arbitrary and the strange parts (why order matters, why you read right to left) become obvious instead of memorized.
Two moves make one move
Section titled “Two moves make one move”Take any two linear transformations and do them in sequence: apply the first, then apply the second to the result. The combined effect is itself a linear transformation. The origin still does not move (neither step moved it), and grid lines stay straight, parallel, and evenly spaced (neither step bent them). So the combined motion is linear too, which means it has its own matrix, its own record of where i-hat and j-hat end up after both steps.
That combined matrix is what we call the product of the two original matrices. Matrix multiplication is not a new operation invented to torture students. It is the bookkeeping for “do this transformation, then that one.”
Right to left, like nested functions
Section titled “Right to left, like nested functions”Write the product as A times B, the product AB. The convention, which feels backward until it clicks, is that AB means first apply B, then apply A. The matrix on the right goes first.
This is not arbitrary either. It comes straight from how we already write functions. When you see a nested function, an outer function wrapped around an inner one, the inner function runs first, because the input reaches the inner function before it reaches the outer. Matrices work the same way. Applying the product to a vector shows it directly:
(AB) · v = A · (B · v)The vector meets B first (it is closest), then the result meets A. The matrix nearest the vector acts first. Read right to left, always.
How to actually compute the product
Section titled “How to actually compute the product”Here is the meaning-preserving way to multiply two matrices, and it is just last lesson’s idea applied twice.
The columns of a matrix are where it sends i-hat and j-hat. To find the product AB, ask where the combined transformation sends each basis vector. Since B acts first, B sends i-hat to its first column; then A acts on that. So:
- The first column of AB is A applied to the first column of B.
- The second column of AB is A applied to the second column of B.
That is the whole computation. Take each column of the right-hand matrix (the destinations after B), run it through A, and the results are the columns of the product. You can do the standard row-times-column arithmetic instead and get the same numbers, but this column-by-column version is the one that tells you what is happening: you are tracking where the basis lands after both moves.
Worked example: rotate, then shear
Section titled “Worked example: rotate, then shear”Take the two transformations from last lesson. Let R be the ninety-degree counterclockwise rotation and S the shear:
R = [ 0 -1 ] S = [ 1 1 ] [ 1 0 ] [ 0 1 ]We want “rotate first, then shear,” which is the product SR (shear on the left, rotation on the right, because the rightmost acts first). Apply S to each column of R.
The columns of R are the vectors with components 0, 1 and negative-1, 0. Recall how S acts on a general vector, and apply it to each column of R:
S · [x, y] = x · [1, 0] + y · [1, 1]
S · [0, 1] = 0 · [1, 0] + 1 · [1, 1] = [1, 1]S · [-1, 0] = -1 · [1, 0] + 0 · [1, 1] = [-1, 0]So the combined matrix is
SR = [ 1 -1 ] [ 1 0 ]Check it against doing the steps one at a time on the vector with components 3, 4. Rotating first, then shearing the result, then applying the combined matrix directly, all land on the same point:
R · [3, 4] = [-4, 3] (rotate first)S · [-4, 3] = -4 · [1, 0] + 3 · [1, 1] = [-1, 3] (then shear)SR · [3, 4] = 3 · [1, 1] + 4 · [-1, 0] = [-1, 3] (combined, directly)Same answer. The product matrix really does capture both moves at once.
Worked example: the other order
Section titled “Worked example: the other order”Now do the moves in the opposite sequence, shear first then rotate, which is the product RS. Apply R to each column of S. The columns of S are the vectors 1, 0 and 1, 1, and R acts on a general vector as below:
R · [x, y] = x · [0, 1] + y · [-1, 0]
R · [1, 0] = [0, 1]R · [1, 1] = 1 · [0, 1] + 1 · [-1, 0] = [-1, 1]So
RS = [ 0 -1 ] [ 1 1 ]Apply it to the vector with components 3, 4:
RS · [3, 4] = 3 · [0, 1] + 4 · [-1, 1] = [-4, 7]Compare the two results. Rotate-then-shear sent the vector 3, 4 to the point negative-1, 3. Shear-then-rotate sent it to negative-4, 7. Different points, different matrices.
SR = [ 1 -1 ] RS = [ 0 -1 ] [ 1 0 ] [ 1 1 ]Matrix multiplication is not commutative. The product AB and the product BA are usually different, and now you can see why: rotating a shape and then shearing it does not land in the same place as shearing it and then rotating it. The order of operations is the order of physical moves, and moves do not generally commute. This is the single most important way matrix multiplication differs from multiplying numbers.
Worked example: order of grouping does not matter
Section titled “Worked example: order of grouping does not matter”There is one thing matrix multiplication keeps from ordinary arithmetic: it is associative. For three transformations, grouping the first two or the last two gives the same result:
(AB)C = A(BC)You may group the chain however you like, as long as you do not change the left-to-right order. Take A to be a horizontal stretch by 2, B the rotation R, and C the shear S:
A = [ 2 0 ] B = [ 0 -1 ] C = [ 1 1 ] [ 0 1 ] [ 1 0 ] [ 0 1 ]Group it as (AB)C. First form AB by applying A to the columns of B, then apply that to the columns of C:
AB = columns A·[0, 1] = [0, 1] and A·[-1, 0] = [-2, 0] -> [[0, -2], [1, 0]](AB)C = columns (AB)·[1, 0] = [0, 1] and (AB)·[1, 1] = [-2, 1] -> [[0, -2], [1, 1]]Now group it as A(BC). First form BC by applying B to the columns of C, then apply A to that:
BC = columns [0, 1] and [-1, 1] -> [[0, -1], [1, 1]]A(BC) = columns A·[0, 1] = [0, 1] and A·[-1, 1] = [-2, 1] -> [[0, -2], [1, 1]]Same matrix both ways. The reason is not arithmetic luck: “apply C, then B, then A” is one fixed sequence of moves, and where you put the parentheses only changes which two adjacent moves you bundle up first, never the order they happen in. That is why we can write the chain ABC with no parentheses at all.
Why this matters when you use AI
Section titled “Why this matters when you use AI”This lesson hides the answer to a question that puzzles many people learning about neural networks: why do networks put a nonlinear step between layers?
A layer’s core is a matrix, a linear transformation. Stack two linear layers back to back and you are composing two linear transformations, which, as you just saw, is itself a single linear transformation with a single matrix. Stack a hundred of them and you still collapse down to one linear transformation. All that depth would buy you nothing; a hundred linear layers in a row can do no more than one carefully chosen layer.
The fix is to insert a nonlinear function between the layers, something that bends the grid rather than keeping its lines straight. That nonlinearity breaks the collapse: now the layers cannot be multiplied together into one matrix, and each added layer genuinely extends what the network can express. The reason depth helps at all is exactly the composition rule you just learned, plus the one ingredient that defeats it.
Common pitfalls
Section titled “Common pitfalls”Reading the product left to right. The product AB applies B first. The rightmost matrix is closest to the vector and acts first, exactly like the inner function in a nested function. If you apply them left to right, you get the wrong order and usually the wrong answer.
Assuming you can swap the order. The product AB is generally not the product BA. Order is the sequence of physical moves, and rotating-then-shearing differs from shearing-then-rotating. Only in special cases do two transformations happen to commute.
Confusing not-commutative with not-associative. Order of operations matters (AB is not BA), but grouping does not ((AB)C equals A(BC)). These are different statements. You may not reorder the chain, but you may parenthesize it any way you like.
Falling back on the rote rule and losing the meaning. The row-times-column recipe gives correct numbers, but if it ever confuses you, return to the column-by-column picture: each column of the product is the left matrix applied to the corresponding column of the right matrix, which is just “where does the basis land after both moves.”
What you should remember
Section titled “What you should remember”- Multiplying matrices means composing transformations: the product AB is “do B, then A.” Read right to left, like a nested function, inner first. The product is a single matrix recording where the basis lands after both moves.
- To compute AB, apply A to each column of B. Each column of B is where B sent a basis vector; running it through A gives where the combined transformation sends it. That is the whole rule, and it preserves the meaning.
- Order matters, grouping does not. The product AB is not BA in general (rotate-then-shear is not shear-then-rotate), but (AB)C equals A(BC) always, because the chain of moves is one fixed sequence no matter how you parenthesize it.
Matrix multiplication is composition wearing a number grid. Do one move, then the next, and the product records the whole journey. The next lesson takes everything so far and steps it up a dimension, into transformations of 3D space.