Autonomous AI Agents for Site Reliability Engineering

Authors

  • Selvakumar Kalyanasundaram Independent Researcher, Texas, USA. Author

DOI:

https://doi.org/10.63282/3050-9262.IJAIDSML-V7I1P162

Keywords:

Site Reliability Engineering, Aiops, Autonomous Agents, Cloud Reliability, Root Cause Analysis, Self-Healing Systems, Distributed Systems

Abstract

Modern cloud-native infrastructures operate across highly distributed systems composed of microservices, container orchestration platforms, and dynamic scaling environments. Managing reliability in such systems presents significant challenges for Site Reliability Engineering (SRE) teams due to the massive volume of telemetry data and the complexity of incident management. This paper proposes a multi-agent artificial intelligence framework for autonomous Site Reliability Engineering operations capable of detecting anomalies, diagnosing root causes, executing remediation actions, and continuously improving operational performance. The architecture integrates machine learning based anomaly detection, graph-based root cause analysis, large language model (LLM) reasoning for operational knowledge interpretation, and reinforcement learning policies for automated remediation. The framework processes telemetry signals including logs, metrics, and traces to build a dynamic system dependency graph used by intelligent agents to identify failure propagation paths. Experimental results demonstrate substantial improvements in operational efficiency, including an 83% reduction in mean time to detect (MTTD) and an 80% reduction in mean time to resolve (MTTR) compared with traditional SRE workflows. The proposed framework represents a step toward autonomous, self-healing cloud infrastructure, enabling organizations to maintain reliability in increasingly complex distributed systems.

References

[1] M. Chen, Y. Hao, K. Hwang, L. Wang, and L. Wang, “Disease prediction by machine learning over big data from healthcare communities,” IEEE Access, vol. 5, pp. 8869–8879, 2017.

[2] J. Xu, P. Chen, and Z. Zheng, “AIOps: Intelligent IT operations based on machine learning,” IEEE Cloud Computing, vol. 7, no. 5, pp. 18–27, 2020.

[3] [3] P. Garraghan, P. Townend, and J. Xu, “An analysis of the server characteristics and resource utilization in cloud data centers,” IEEE Transactions on Cloud Computing, vol. 5, no. 4, pp. 759–772, 2017.

[4] A. Bauer, M. Bux, and U. Leser, “Autonomous cloud management using machine learning techniques,” IEEE Cloud Computing, vol. 8, no. 3, pp. 32–41, 2021.

Published

2026-03-24

Issue

Section

Articles

How to Cite

1.
Kalyanasundaram S. Autonomous AI Agents for Site Reliability Engineering. IJAIDSML [Internet]. 2026 Mar. 24 [cited 2026 Apr. 24];7(1):390-5. Available from: https://ijaidsml.org/index.php/ijaidsml/article/view/518