-
Tokenize
just giving the tokenIDs to words and subwords
some known tokenizer→ tiktoken
-
Word embedding First initializing the [vocabsize, embedding_dim] matrix with random values embedding_dim is chosen by the user and the size of the model.
So the result would be like
\text{Embedding Matrix} \in \mathbb{R}^{1000 \times 768} = \begin{bmatrix} 0.12 & -0.45 & 0.33 & \cdots & 0.01 \ -0.67 & 0.22 & 0.91 & \cdots & -0.08 \ \vdots & \vdots & \vdots & \ddots & \vdots \ -0.23 & 0.85 & 0.04 & \cdots & 0.76 \ \end{bmatrix}
where each rows are words represented as a vector in 768 dimension space. embedding_matrix[0] → embedding vector for token ID 0 embedding_matrix[1] → embedding vector for token ID 1 ... embedding_matrix[389] → embedding vector for token ID 389 ("cat") Put embedding layer at the very first of the transformer, so that it can tune and learn the vectors 1. **Input token IDs** → embeddings $x \in \mathbb{R}^{d}$ 2. These embeddings go through the transformer (QKV, attention, etc.) 3. The model outputs a **prediction distribution** over the vocabulary 4. Compare that prediction to the **true next token** using **cross-entropy loss** 5. Backpropagation flows **all the way back**, updating: * The output layer * Transformer blocks * **And the embeddings** --- Let’s say we are training on this sequence: ```plaintext "The cat sat" ``` And model is predicting `"sat"` given `"The cat"`. * Suppose `"cat"` is token ID `389` * Embedding for `"cat"` is `x = embedding_matrix[389]` When the model **fails** to predict `"sat"` correctly, the **loss is high**, and gradients will flow **back to `x`**, which updates `embedding_matrix[389]`. So the embedding for `"cat"` is: * Penalized (changed) if it's not helping the model predict correctly * Reinforced (stabilized) if it helps the model get the right prediction > Eventually, semantically similar words would end up close together in vector space!