At Augment, we are focused on providing developers with full codebase context for effective AI assistance. We understand that context is crucial for coding, as changes are influenced by various factors within the codebase. We have optimized our LLM inference process to balance context processing and decoding speed, resulting in superior quality and speed for our customers. Our unique approach enables us to deliver low-latency responses with high-quality context processing, setting a new standard in the industry. Through our optimization journey, we have discovered innovative techniques such as CUDA Graphs, FP8, FlashAttention-3, and efficient communication for multi-GPU execution. Our goal is to continue evolving and providing the best AI-driven development experience for our users.
https://www.augmentcode.com/blog/rethinking-llm-inference-why-developer-ai-needs-a-different-approach?