Main Idea Modern llm facing communication bottlenecks on current hardware ; not computational limit Compressing the key-value cache using low-rank mat