From Alert Floods to Action: Correlated Telemetry for High-Volume Banking Systems
DOI:
https://doi.org/10.63282/3050-9262.IJAIDSML-V6I1P127Keywords:
AIOps, Observability, Alert Correlation, Incident Response, SRE, DevOps, Regulated Financial Platforms, Telemetry AnalyticsAbstract
High-volume banking systems (e.g., payments, digital channels, core banking) generate many types of observability data (metrics, logs, traces, topology events), and the sheer volume and variety of signals mean that simple per-service thresholds will frequently trigger waves of alerts. These alert floods increase the cost of being on call, scatter evidence across multiple tools, and slow diagnosis and mitigation, which is particularly risky for critical transaction flows that span authorization, fraud scoring, ledger posting, and customer notifications. This paper proposes a correlated-telemetry approach, which first normalizes multiple signal types using OpenTelemetry-style semantic conventions, second correlates alerts across a service-dependency graph and over sliding time windows, and third produces a compact, auditable incident record with ranked candidate root causes and runbook suggestions. Real banking telemetry is typically private and regulated, so we evaluate the approach with a synthetic replay based on publicly-inspectable statistical priors taken from Kaggle and UCI financial risk datasets (rarity, class imbalance, burstiness), which we map into incident frequency, weak-signal emission, duplication, and noise parameters to enable reproducible experiments without exposing proprietary telemetry. Across 30 randomized replays, the proposed method results in consistent improvements, including a 61.6% (±1.2) decrease in non-actionable pages, a 23.3±0.3 minute increase in median pre-escalation lead time, a 35.5% (±0.5) decrease in mean time to restore, a 39.1% (±2.7) decrease in high-severity incidents, and a 7.8±0.3 basis point increase in impact-weighted availability. The paper also includes an ablation study that separates the unique contributions of weak-signal inclusion and evidence-based triage, and it describes a shadow-mode validation protocol that banks and other regulated industries (healthcare, telecom, energy, public sector) can use to validate this approach using de-identified, aggregated telemetry.
References
[1] B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media, 2016.
[2] N. R. Murphy, B. Beyer, C. Jones, and J. Petoff, The Site Reliability Workbook: Practical Ways to Implement SRE. O’Reilly Media, 2018.
[3] N. Forsgren, J. Humble, and G. Kim, Accelerate: The Science of Lean Software and DevOps. IT Revolution Press, 2018.
[4] N. Forsgren, J. Humble, and G. Kim, “2019 Accelerate State of DevOps Report,” DORA / Google Cloud, 2019.
[5] OpenTelemetry, “OpenTelemetry Specification,” v1.x (project specification lineage), 2021. [Online]. Available: https://github.com/open-telemetry/opentelemetry-specification
[6] M. Du, F. Li, G. Zheng, and V. Srikumar, “DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning,” in Proc. ACM CCS, 2017.
[7] W. Meng, Y. Liu, Y. Zhu, S. Zhang, D. Pei, Y. Liu, R. Zhang, and S. Chen, “LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs,” in Proc. IJCAI, 2019.
[8] P. He, J. Zhu, S. He, J. Li, and M. R. Lyu, “Drain: An Online Log Parsing Approach with Fixed Depth Tree,” in Proc. IEEE ICWS, 2017.
[9] J. Brandón, A. Sánchez, E. Pastor, and C. Canal, “A Graph-Based Approach for Root Cause Analysis in Microservice Architectures,” Journal of Systems and Software, vol. 152, pp. 38–54, 2019.
[10] G. Notaro, P. Mariani, A. R. Zamani, and M. A. Sullo, “AIOps: A Systematic Survey on Failure Management in IT Operations,” ACM Computing Surveys, 2021.
[11] K. Salah, A. Maciá-Fernández, and J. Díaz-Verdejo, “A Survey on Model-Based Alert Correlation,” Computer Networks, vol. 57, no. 5, pp. 1289–1317, 2013.
[12] C. Kuang et al., “Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach,” arXiv:2403.06485, 2024.
[13] W. Zhang et al., “A Survey of AIOps for Failure Management in the Era of Large Language Models,” arXiv:2406.11213, 2024.
[14] European Union, “Regulation (EU) 2022/2554 (Digital Operational Resilience Act),” Official Journal of the European Union, 2022.
[15] H. Chen, P. Chen, and G. Yu, “A Framework of Virtual War Room and Matrix Sketch-Based Streaming Anomaly Detection for Microservice Systems,” IEEE Access, vol. 8, pp. 43413–43426, 2020, doi: 10.1109/ACCESS.2020.2977464.
[16] E. Kidmose, M. Stevanovic, S. Brandbyge, and J. M. Pedersen, “Featureless Discovery of Correlated and False Intrusion Alerts,” IEEE Access, 2020, doi: 10.1109/ACCESS.2020.3001374.
[17] L. Korzeniowski and K. Goczyla, “Landscape of Automated Log Analysis: A Systematic Literature Review and Mapping Study,” IEEE Access, vol. 10, pp. 21892–21913, 2022.
[18] S. Ahmed, M. Singh, B. Doherty, E. I. Ramlan, K. Harkin, M. Bucholc, and D. Coyle, “Knowledge-based Intelligent System for IT Incident DevOps,” in Proc. 2023 IEEE/ACM International Workshop on Cloud Intelligence & AIOps (AIOps), 2023, pp. 1–7, doi: 10.1109/AIOps59134.2023.00005.
[19] Kaggle, “Financial risk and fraud datasets (e.g., credit card fraud detection, loan default, and credit scoring),” accessed 2025-01. [Online]. Available: https://www.kaggle.com/datasets
[20] UCI Machine Learning Repository, “Credit scoring and related datasets,” accessed 2025-01. [Online]. Available: https://archive.ics.uci.edu
[21] Ashish Babubhai Sakariya (2018). Leveraging CRM Tools to Boost Marketing Efficiency in the Rubber Industry. International Journal of Scientific Research in Science, Engineering and Technology (IJSRSET), 4(11) 354-363.










