Skip to content

Lesson: Matrix multiplication as composition

Last lesson landed on a single idea: a matrix is a record of where the basis vectors go, and that record is enough to move all of space. One matrix, one move. This lesson asks the obvious next question. What if you want two moves in a row? Rotate the plane, then shear it. Stretch it, then rotate it. What records that combined motion?

The answer is matrix multiplication, and the reason it is worth a lesson is that almost everyone meets the multiplication rule first, as a memorized recipe of rows times columns, and never learns what it means. It means composition: do one transformation, then another. Once you see that, the rule stops being arbitrary and the strange parts (why order matters, why you read right to left) become obvious instead of memorized.

Take any two linear transformations and do them in sequence: apply the first, then apply the second to the result. The combined effect is itself a linear transformation. The origin still does not move (neither step moved it), and grid lines stay straight, parallel, and evenly spaced (neither step bent them). So the combined motion is linear too, which means it has its own matrix, its own record of where i-hat and j-hat end up after both steps.

That combined matrix is what we call the product of the two original matrices. Matrix multiplication is not a new operation invented to torture students. It is the bookkeeping for “do this transformation, then that one.”

Write the product as A times B, the product AB. The convention, which feels backward until it clicks, is that AB means first apply B, then apply A. The matrix on the right goes first.

This is not arbitrary either. It comes straight from how we already write functions. When you see a nested function, an outer function wrapped around an inner one, the inner function runs first, because the input reaches the inner function before it reaches the outer. Matrices work the same way. Applying the product to a vector shows it directly:

(AB) · v = A · (B · v)

The vector meets B first (it is closest), then the result meets A. The matrix nearest the vector acts first. Read right to left, always.

Matrix multiplication is composition: AB times v means apply B first, then A, reading right to left Three panels read right to left, showing successive stages of applying B then A to a vector v. The rightmost panel shows the original v equal to [1, 1] as a teal arrow. An arrow labeled B points to the middle panel, which shows B times v equal to [negative 1, 1] as an amber arrow, the result of a 90 degree counter-clockwise rotation. A second arrow labeled A points to the left panel, which shows A of B times v equal to [negative 2, 1] as a purple arrow, after a horizontal stretch by 2. The footer notes AB times v equals A times the quantity B times v, with composition reading right to left. A · (B · v) [-2, 1] B · v [-1, 1] v [1, 1] B A AB · v = A · (B · v), apply B first, then A
Read the product AB times v from right to left. The vector v meets B first; the result then meets A. That is matrix multiplication as composition. The order matters: A and B are functions and they get to act in turn.

Here is the meaning-preserving way to multiply two matrices, and it is just last lesson’s idea applied twice.

The columns of a matrix are where it sends i-hat and j-hat. To find the product AB, ask where the combined transformation sends each basis vector. Since B acts first, B sends i-hat to its first column; then A acts on that. So:

  • The first column of AB is A applied to the first column of B.
  • The second column of AB is A applied to the second column of B.

That is the whole computation. Take each column of the right-hand matrix (the destinations after B), run it through A, and the results are the columns of the product. You can do the standard row-times-column arithmetic instead and get the same numbers, but this column-by-column version is the one that tells you what is happening: you are tracking where the basis lands after both moves.

Matrix multiplication column by column: AB equals A applied to each column of B The equation AB equals the matrix whose columns are A times column 1 of B and A times column 2 of B is shown at the top. Below, two mini-panels. The left mini-panel shows B's two columns as a teal arrow at [1, 1] and an amber arrow at [0, 1]. An arrow labeled "apply A" points from the left to the right mini-panel. The right mini-panel shows AB's two columns as a teal arrow at [2, 1] and an amber arrow at [0, 1], A having stretched the first column horizontally by 2 and left the second column unchanged. AB = [ A · col1(B) | A · col2(B) ] column k of AB = A times column k of B columns of B col1 = [1, 1] col2 = [0, 1] columns of AB A·col1 = [2, 1] A·col2 = [0, 1] apply A [[2, 0], [0, 1]]
The product AB is not magic. Column by column: take each column of B, treat it as a vector, apply A, and the result is that column of AB. The first column of AB is what A does to the first column of B; the second is what A does to the second.

Take the two transformations from last lesson. Let R be the ninety-degree counterclockwise rotation and S the shear:

R = [ 0 -1 ] S = [ 1 1 ]
[ 1 0 ] [ 0 1 ]

We want “rotate first, then shear,” which is the product SR (shear on the left, rotation on the right, because the rightmost acts first). Apply S to each column of R.

The columns of R are the vectors with components 0, 1 and negative-1, 0. Recall how S acts on a general vector, and apply it to each column of R:

S · [x, y] = x · [1, 0] + y · [1, 1]
S · [0, 1] = 0 · [1, 0] + 1 · [1, 1] = [1, 1]
S · [-1, 0] = -1 · [1, 0] + 0 · [1, 1] = [-1, 0]

So the combined matrix is

SR = [ 1 -1 ]
[ 1 0 ]

Check it against doing the steps one at a time on the vector with components 3, 4. Rotating first, then shearing the result, then applying the combined matrix directly, all land on the same point:

R · [3, 4] = [-4, 3] (rotate first)
S · [-4, 3] = -4 · [1, 0] + 3 · [1, 1] = [-1, 3] (then shear)
SR · [3, 4] = 3 · [1, 1] + 4 · [-1, 0] = [-1, 3] (combined, directly)

Same answer. The product matrix really does capture both moves at once.

Now do the moves in the opposite sequence, shear first then rotate, which is the product RS. Apply R to each column of S. The columns of S are the vectors 1, 0 and 1, 1, and R acts on a general vector as below:

R · [x, y] = x · [0, 1] + y · [-1, 0]
R · [1, 0] = [0, 1]
R · [1, 1] = 1 · [0, 1] + 1 · [-1, 0] = [-1, 1]

So

RS = [ 0 -1 ]
[ 1 1 ]

Apply it to the vector with components 3, 4:

RS · [3, 4] = 3 · [0, 1] + 4 · [-1, 1] = [-4, 7]

Compare the two results. Rotate-then-shear sent the vector 3, 4 to the point negative-1, 3. Shear-then-rotate sent it to negative-4, 7. Different points, different matrices.

SR = [ 1 -1 ] RS = [ 0 -1 ]
[ 1 0 ] [ 1 1 ]

Matrix multiplication is not commutative. The product AB and the product BA are usually different, and now you can see why: rotating a shape and then shearing it does not land in the same place as shearing it and then rotating it. The order of operations is the order of physical moves, and moves do not generally commute. This is the single most important way matrix multiplication differs from multiplying numbers.

Matrix multiplication is not commutative: rotate-then-shear and shear-then-rotate produce visibly different parallelograms Two side-by-side panels. The left panel applies SR, rotating the unit square 90 degrees counter-clockwise first and then shearing horizontally by 0.5, producing a parallelogram with corners at the origin, [0.5, 1], [negative 0.5, 1], and [negative 1, 0]. The right panel applies RS, shearing first and then rotating, producing a parallelogram with corners at the origin, [0, 1], [negative 1, 1.5], and [negative 1, 0.5]. The two parallelograms are visibly different shapes, showing that order matters. SR: rotate then shear SR(î) = [0.5, 1], SR(ĵ) = [-1, 0] RS: shear then rotate RS(î) = [0, 1], RS(ĵ) = [-1, 0.5]
Apply two transformations in one order and you get one parallelogram; apply them in the other order and you get a different one. SR is not the same as RS. That is matrix multiplication being non-commutative: AB and BA almost never agree.

Worked example: order of grouping does not matter

Section titled “Worked example: order of grouping does not matter”

There is one thing matrix multiplication keeps from ordinary arithmetic: it is associative. For three transformations, grouping the first two or the last two gives the same result:

(AB)C = A(BC)

You may group the chain however you like, as long as you do not change the left-to-right order. Take A to be a horizontal stretch by 2, B the rotation R, and C the shear S:

A = [ 2 0 ] B = [ 0 -1 ] C = [ 1 1 ]
[ 0 1 ] [ 1 0 ] [ 0 1 ]

Group it as (AB)C. First form AB by applying A to the columns of B, then apply that to the columns of C:

AB = columns A·[0, 1] = [0, 1] and A·[-1, 0] = [-2, 0] -> [[0, -2], [1, 0]]
(AB)C = columns (AB)·[1, 0] = [0, 1] and (AB)·[1, 1] = [-2, 1] -> [[0, -2], [1, 1]]

Now group it as A(BC). First form BC by applying B to the columns of C, then apply A to that:

BC = columns [0, 1] and [-1, 1] -> [[0, -1], [1, 1]]
A(BC) = columns A·[0, 1] = [0, 1] and A·[-1, 1] = [-2, 1] -> [[0, -2], [1, 1]]

Same matrix both ways. The reason is not arithmetic luck: “apply C, then B, then A” is one fixed sequence of moves, and where you put the parentheses only changes which two adjacent moves you bundle up first, never the order they happen in. That is why we can write the chain ABC with no parentheses at all.

This lesson hides the answer to a question that puzzles many people learning about neural networks: why do networks put a nonlinear step between layers?

A layer’s core is a matrix, a linear transformation. Stack two linear layers back to back and you are composing two linear transformations, which, as you just saw, is itself a single linear transformation with a single matrix. Stack a hundred of them and you still collapse down to one linear transformation. All that depth would buy you nothing; a hundred linear layers in a row can do no more than one carefully chosen layer.

The fix is to insert a nonlinear function between the layers, something that bends the grid rather than keeping its lines straight. That nonlinearity breaks the collapse: now the layers cannot be multiplied together into one matrix, and each added layer genuinely extends what the network can express. The reason depth helps at all is exactly the composition rule you just learned, plus the one ingredient that defeats it.

Reading the product left to right. The product AB applies B first. The rightmost matrix is closest to the vector and acts first, exactly like the inner function in a nested function. If you apply them left to right, you get the wrong order and usually the wrong answer.

Assuming you can swap the order. The product AB is generally not the product BA. Order is the sequence of physical moves, and rotating-then-shearing differs from shearing-then-rotating. Only in special cases do two transformations happen to commute.

Confusing not-commutative with not-associative. Order of operations matters (AB is not BA), but grouping does not ((AB)C equals A(BC)). These are different statements. You may not reorder the chain, but you may parenthesize it any way you like.

Falling back on the rote rule and losing the meaning. The row-times-column recipe gives correct numbers, but if it ever confuses you, return to the column-by-column picture: each column of the product is the left matrix applied to the corresponding column of the right matrix, which is just “where does the basis land after both moves.”

  • Multiplying matrices means composing transformations: the product AB is “do B, then A.” Read right to left, like a nested function, inner first. The product is a single matrix recording where the basis lands after both moves.
  • To compute AB, apply A to each column of B. Each column of B is where B sent a basis vector; running it through A gives where the combined transformation sends it. That is the whole rule, and it preserves the meaning.
  • Order matters, grouping does not. The product AB is not BA in general (rotate-then-shear is not shear-then-rotate), but (AB)C equals A(BC) always, because the chain of moves is one fixed sequence no matter how you parenthesize it.

Matrix multiplication is composition wearing a number grid. Do one move, then the next, and the product records the whole journey. The next lesson takes everything so far and steps it up a dimension, into transformations of 3D space.