Attention itself does not have information of word orders in sequence. Need positional information

The encoder must be

  • Uniquely identifies each position that is likely to be observed
  • Bounded scale [-1,1] why?
  • Easy to reason computationally about time deltas( easily identify if tow encodings are for positions some fixed distance apart) care about the relative position
  • Generalizes well to longer sequence