Cognitive Load Reduction in On-Call Rotations via Predictive Alert Severity Scoring Using Machine Learning in Financial Cloud Operations
DOI:
https://doi.org/10.63282/3050-9262.IJAIDSML-V6I1P131Keywords:
Alert Fatigue, Machine Learning, Gradient Boosting, PagerDuty, Dynatrace, On-Call Engineering, Cognitive Load, Site Reliability Engineering, AIOps, Severity Scoring, Financial Services, SOC 2, Engineer WellbeingAbstract
Alert fatigue in cloud-native site reliability engineering represents a systemic threat to both operational reliability and engineer wellbeing. In high-availability financial services environments, on-call engineers receive hundreds of automated alerts per month from integrated observability platforms, of which a substantial fraction are non-actionable noise. This paper presents an ML-driven alert severity scoring framework deployed in production across six credit union banking applications, designed to reduce the cognitive load of on-call engineers by intelligently triaging PagerDuty alert streams before human engagement. The framework trains a gradient-boosted classifier on over 50,000 historical alert events sourced from Dynatrace Davis AI and PagerDuty, engineering 17 features capturing service dependency depth, deployment recency, historical false positive rates, time-of-day patterns, and cross-source metric correlations. Evaluated through a 30-day shadow validation protocol against actual engineer triage decisions, the classifier achieved 89.3% severity classification accuracy with a 2.1% P1 miss rate. Live deployment produced a 34% reduction in actionable alert volume, a 41% reduction in after-hours pages, and a 28% improvement in P1 mean time to acknowledge. A structured engineer wellbeing measurement program documented improvement in post-shift satisfaction scores from 5.8 to 7.4 out of 10 following system activation. This paper presents the feature engineering methodology, the shadow validation protocol, the deployment architecture within a SOC 2 regulated banking environment, and the case for treating engineer wellbeing as a first-class reliability metric.
References
[1] B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media, 2016.
[2] G. Kim, J. Humble, P. Debois, and J. Willis, the DevOps Handbook. IT Revolution Press, 2016.
[3] X. Zhao, Y. Liu, and H. Zhang, "Alert classification in enterprise IT operations using gradient boosting," in Proc. IEEE ICSE, 2020, pp. 112–123.
[4] M. Luo, F. Xu, and J. Chen, "Cross-source feature engineering for IT alert severity prediction," in Proc. ACM SIGKDD, 2022, pp. 3412–3421.
[5] T. Chen and C. Guestrin, "XGBoost: A scalable tree boosting system," in Proc. ACM KDD, 2016, pp. 785–794.
[6] J. Sweller, "Cognitive load during problem solving: Effects on learning," Cognitive Science, vol. 12, no. 2, pp. 257–285, 1988.
[7] S. M. Milajerdi et al., "AACT: Automated alert classification and triage using organizational context," in Proc. IEEE S&P, 2023, pp. 1442–1459.
[8] C. Bansal et al., "Gandalf: An intelligent, end-to-end analytics service for safe deployment in large-scale cloud infrastructure," in Proc. USENIX NSDI, 2020, pp. 1–16.
[9] Dynatrace, "Davis AI: Causal AI for IT operations," Dynatrace Documentation, 2024. [Online]. Available: https://www.dynatrace.com/support/help/how-dynatrace-works/davis-ai
[10] PagerDuty, "Alert routing and suppression documentation," PagerDuty Documentation, 2024. [Online]. Available: https://support.pagerduty.com/docs/event-orchestration
[11] J. Bergstra and Y. Bengio, "Random search for hyper-parameter optimization," Journal of Machine Learning Research, vol. 13, pp. 281–305, 2012.
[12] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[13] H. Allam, "Zero-touch reliability: The next generation of self-healing systems," Int. J. Artificial Intelligence, Machine Learning and Data Science, vol. 5, no. 4, pp. 59–71, 2024.
[14] Amazon Web Services, "Amazon CloudWatch documentation," AWS Documentation, 2024. [Online]. Available: https://docs.aws.amazon.com/cloudwatch/










