One attention mechanism is limiting. It can only capture one type of relationship.

So we split into multiple heads. 8 to 16 typically.

Each head gets its own smaller set of Q/K/V projections. They run in parallel.

Why multiple heads matter:

  • Head 1 might focus on grammar (subject-verb agreement)
  • Head 2 might track long-range dependencies
  • Head 3 might capture semantic meaning
  • Head 4 might notice syntax patterns

Same input, different learned patterns. Richer understanding.

After all heads compute their attention, we concatenate results. Then project back to the original embedding size.

More heads = more ways to understand relationships. But also more compute.

Modern models balance this. Enough heads to capture complexity, not so many that training becomes impossible.