Before transformers can process text, they convert it to numbers.

Two parts: token embedding and positional encoding.

Token embedding maps words to vectors. “king” becomes a list of numbers that sits near “queen” in vector space. Math that captures meaning.

Positional encoding adds location information. First word, second word, third word. Without this, the model can’t tell “dog bites man” from “man bites dog”.

Combined, they create the input: what each word means and where it sits.

These embeddings flow into transformer blocks where the real processing happens.

Good embeddings are everything. If similar words don’t cluster together, transfer learning breaks. If position encoding fails, word order collapses.