We are revolutionizing system reliability investigations with our new AI-assisted root cause analysis system, utilizing a blend of heuristic-based retrieval and large language model-based ranking to accelerate root cause identification. Testing has shown a 42% accuracy in pinpointing root causes related to our web monorepo. Investigating issues in monolithic repositories can be challenging, making AI invaluable for efficiency. Through a combination of a heuristic retriever and LLM ranker system, we cut down the search space significantly to pinpoint potential code changes as root causes. Fine-tuning a Llama 2 (7B) model using historical data was crucial in achieving this accuracy. While AI brings immense benefits, ensuring explainability and precision is essential to avoid misleading results. We envision expanding AI capabilities to autonomously execute workflows and prevent incidents proactively.
https://engineering.fb.com/2024/06/24/data-infrastructure/leveraging-ai-for-efficient-incident-response/