Main Idea

  • Modern llm facing communication bottlenecks on current hardware ; not computational limit
  • Compressing the key-value cache using low-rank mat