Autonomous AI Agents for Site Reliability Engineering
DOI:
https://doi.org/10.63282/3050-9262.IJAIDSML-V7I1P162Keywords:
Site Reliability Engineering, Aiops, Autonomous Agents, Cloud Reliability, Root Cause Analysis, Self-Healing Systems, Distributed SystemsAbstract
Modern cloud-native infrastructures operate across highly distributed systems composed of microservices, container orchestration platforms, and dynamic scaling environments. Managing reliability in such systems presents significant challenges for Site Reliability Engineering (SRE) teams due to the massive volume of telemetry data and the complexity of incident management. This paper proposes a multi-agent artificial intelligence framework for autonomous Site Reliability Engineering operations capable of detecting anomalies, diagnosing root causes, executing remediation actions, and continuously improving operational performance. The architecture integrates machine learning based anomaly detection, graph-based root cause analysis, large language model (LLM) reasoning for operational knowledge interpretation, and reinforcement learning policies for automated remediation. The framework processes telemetry signals including logs, metrics, and traces to build a dynamic system dependency graph used by intelligent agents to identify failure propagation paths. Experimental results demonstrate substantial improvements in operational efficiency, including an 83% reduction in mean time to detect (MTTD) and an 80% reduction in mean time to resolve (MTTR) compared with traditional SRE workflows. The proposed framework represents a step toward autonomous, self-healing cloud infrastructure, enabling organizations to maintain reliability in increasingly complex distributed systems.
References
[1] M. Chen, Y. Hao, K. Hwang, L. Wang, and L. Wang, “Disease prediction by machine learning over big data from healthcare communities,” IEEE Access, vol. 5, pp. 8869–8879, 2017.
[2] J. Xu, P. Chen, and Z. Zheng, “AIOps: Intelligent IT operations based on machine learning,” IEEE Cloud Computing, vol. 7, no. 5, pp. 18–27, 2020.
[3] [3] P. Garraghan, P. Townend, and J. Xu, “An analysis of the server characteristics and resource utilization in cloud data centers,” IEEE Transactions on Cloud Computing, vol. 5, no. 4, pp. 759–772, 2017.
[4] A. Bauer, M. Bux, and U. Leser, “Autonomous cloud management using machine learning techniques,” IEEE Cloud Computing, vol. 8, no. 3, pp. 32–41, 2021.










