Transformer-based-llm Attention

It starts with the question

If attention works so well at highlighting important inputs, why do we need all these recurrent things to start with?

A model where everything attends to everything else

Overall Structure

Multi-head attention

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

Motivation

Instead of performing a single attention function, we run multiple attention heads in parallel. This allows the model to capture different aspects of the relationships between tokens (e.g., syntax vs. semantics).

Formula Overview

Given:

  • Input: Queries Q, Keys K, Values V
  • Weight matrices: for each head
  • Output projection:

For each head ( h ), we compute:

Q, K, V is

Perform scaled dot-product attention:

Divide by , so that we don’t have to hyperparam tune whenever we change dimension

Then concatenate all heads and project:

Step-by-Step Breakdown
  1. Linear projections of Q, K, and V using learned weights ,,

  2. Scaled dot-product attention is computed separately for each head

  3. Concatenate all heads’ outputs

  4. Final linear projection using 

Matrix Dimensions
ComponentShape
Q, K, V(seq_len, d_model)
W^Q, W^K, W^V(d_model, d_k or d_v)
Q^{(h)}, K^{(h)}(seq_len, d_k)
Output per head(seq_len, d_v)
Final output(seq_len, H × d_v) → (seq_len, d_model) after W^O
Why Multiple Heads?
  • Attend to different things (e.g., subject-verb agreement, word order, phrase-level context)

  • Learn different subspace representations

  • Aggregate richer and more diverse information

  • Shown empirically to improve performance over single-head attention

Each attention head performs:

Q, K, V → Scaled Dot-Product Attention →

Then:

→ Final Output

Layer Normalization

  1. Compute mean and std of activation:
  2. Subtract mean and divide by std

This way we can scale and shift, providing flexibility.

Why transformer architecture rather than RNN

  1. Everything is feedforward: Parallel training, Full leverage of GPU
  2. Every time step can attend to any others: No range limit of interaction, No more memory or vanishing gradient through time