Article contents
Generative AI–Driven Observability for Automated Root Cause Analysis in Modern IT Systems: Architecture and Vision
Abstract
Contemporary IT environments are increasingly complex, driven by distributed microservices, ephemeral infras- tructure, and exponential telemetry growth. Traditional observ- ability methods struggle to deliver timely and accurate root cause analysis (RCA) in such settings. This paper presents a conceptual framework that integrates Generative Artificial Intel- ligence (GenAI) with observability pipelines through multimodal telemetry fusion, retrieval-augmented generation (RAG), and agentic AI principles. The proposed four-layer reference ar- chitecture—comprising telemetry ingestion, data normalization, multimodal fusion, and generative RCA engines—illustrates how large language models (LLMs) and agentic modules can enable contextual reasoning and incident triage. While an illustrative proof-of-concept simulation demonstrates feasibility, the primary contribution of this work lies in its architecture and research vision rather than definitive empirical validation. Benchmark comparisons against rule-based, ML, and commercial AIOps solutions demonstrate improved RCA accuracy (89.7%), reduced MTTR (26.4 minutes), and lower false positives, highlighting both feasibility and performance advantages. The paper further outlines open challenges, including scalability, hallucination risks, and integration with heterogeneous monitoring systems, thereby providing a roadmap for future research at the intersection of GenAI, observability, and IT operations.