Skip to content

Summary: Matrix multiplication as composition

Matrix multiplication has a reputation as an arbitrary rule to memorize, rows times columns, with no meaning attached. The lesson replaces that with one idea: matrix multiplication is composition, so AB means “do B, then A,” read right to left. Once you see the product as one move recording two moves in sequence, the strange parts (why order matters, why grouping does not, why you read backward) stop being rules and become obvious. This is the scan-it-in-five-minutes version.

  • Doing two linear transformations in a row is itself a linear transformation (the origin still does not move; lines stay straight), so it has its own matrix. That combined matrix is the product of the two originals. Matrix multiplication is the bookkeeping for “do this transformation, then that one.”
  • The order convention reads right to left: AB means first apply B, then apply A, because (AB) · v = A · (B · v) sends the vector through B first (it is closest), then A. It is the same order as the inner function in f(g(x)).
  • The meaning-preserving way to compute the product is column by column: each column of AB is A applied to the corresponding column of B. The columns of B are where it sends the basis vectors; running them through A gives where the combined move sends them. The row-times-column recipe gives the same numbers but hides this.
  • Matrix multiplication is not commutative: AB and BA are usually different matrices, because the order of operations is the order of physical moves and moves do not generally commute. Worked anchor: with the rotation R and shear S, rotate-then-shear sends [3,4] to [-1,3] while shear-then-rotate sends it to [-4,7]. This is the single biggest way it differs from multiplying numbers.
  • Matrix multiplication is associative: (AB)C = A(BC) always. The chain of moves is one fixed sequence, so where you put the parentheses only changes which adjacent pair you bundle first, never the order they happen in. That is why ABC needs no parentheses.
  • This is why neural networks put a nonlinear step between layers. A layer’s core is a matrix, and composing two linear layers gives one linear transformation, so stacking linear layers alone collapses to a single layer and buys no extra power. The nonlinearity between layers bends the grid and breaks that collapse, which is exactly why depth helps; the composition rule here is the reason the nonlinearity is needed at all.

Before this lesson, matrix multiplication was probably a procedure you could execute but not explain, and the right-to-left convention felt like a trap. Now the product has a physical meaning: it is two moves fused into one, and the order on the page is the order in time. When you next see a chain of matrices in a graphics pipeline, a robotics transform stack, or a network architecture, you can read it as a sequence of moves applied right to left, and you know why the order cannot be casually swapped. The next lesson keeps every rule identical and steps the whole picture up into three dimensions, where a transformation becomes a 3-by-3 matrix.