Attention itself does not have information of word orders in sequence. Need positional information
The encoder must be
- Uniquely identifies each position that is likely to be observed
- Bounded scale [-1,1]→ why?
- Easy to reason computationally about time deltas( easily identify if tow encodings are for positions some fixed distance apart) → care about the relative position
- Generalizes well to longer sequence