Post-Mortem Intelligence: Using Large Language Models to Build Proactive Reliability Knowledge Graphs from Incident Documentation
DOI:
https://doi.org/10.63282/3050-9262.IJAIDSML-V6I3P122Keywords:
Large Language Models, Knowledge Graphs, Amazon Neptune, Post-Mortem Analysis, Site Reliability Engineering, Incident Management, Semantic Search, Claude API, AIOps, Root Cause Analysis, Financial Services, SOC 2Abstract
Production incident post-mortem documentation represents one of the most underutilized assets in site reliability engineering. Organizations that conduct thorough post-mortem analyses accumulate structured knowledge about failure patterns, root causes, contributing factors, and remediation actions knowledge stored as unstructured text rarely accessed after the specific incident it describes. This paper presents a post-mortem intelligence system that applies Large Language Models to extract structured knowledge from incident documentation and organize it into a searchable reliability knowledge graph, enabling on-call engineers to retrieve historically analogous incidents and their resolution pathways in real time during active investigations. The system was implemented and validated against a corpus of over 300 post-mortem documents accumulated across six credit union banking applications. LLM-based entity extraction using the Claude API identifies services, failure modes, contributing factors, timeline events, and remediation actions from unstructured post-mortem text. Extracted entities and relationships are stored in Amazon Neptune as a directed graph, with semantic similarity search enabling retrieval of analogous historical incidents based on current alert context. Analysis of the accumulated knowledge graph identified 14 systemic anti-patterns recurring across the incident corpus, and proactive infrastructure improvements derived from these patterns are estimated to prevent approximately 6 high-severity incidents annually. This paper presents the extraction methodology, graph schema design, retrieval architecture, and the organizational implications of treating post-mortem documentation as a continuously compounding engineering asset.
References
[1] B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media, 2016.
[2] T. Blazytko et al., "AURORA: Statistical analysis of predicates for root cause analysis," in Proc. USENIX Security Symposium, 2020, pp. 1–18.
[3] M. Attariyan, M. Chow, and J. Flinn, "X-ray: Automating root-cause diagnosis of performance anomalies in production software," in Proc. USENIX OSDI, 2012, pp. 307–320.
[4] T. Ahmed et al., "Extracting key information from unstructured cloud incident reports using LLMs," arXiv:2603.16818, 2024.
[5] Y. Liu et al., "Log2Graph: From logs to knowledge — LLM powered dynamic knowledge graphs," The SAI Organization, 2024.
[6] Anthropic, "Claude API Documentation," Anthropic, 2024. [Online]. Available: https://docs.anthropic.com/claude/reference/
[7] Amazon Web Services, "Amazon Neptune User Guide," AWS Documentation, 2024. [Online]. Available: https://docs.aws.amazon.com/neptune/latest/userguide/
[8] J. Devlin, M. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," in Proc. NAACL, 2019, pp. 4171–4186.
[9] T. Brown et al., "Language models are few-shot learners," in Proc. NeurIPS, 2020, pp. 1877–1901.
[10] S. Minaee et al., "Large language models: A survey," arXiv:2402.06196, 2024.
[11] Y. Cheng et al., "A survey of AIOps in the era of large language models," arXiv, 2024.
[12] Dynatrace, "Davis AI: Causal AI for IT operations," Dynatrace Documentation, 2024. [Online]. Available: https://www.dynatrace.com/support/help/how-dynatrace-works/davis-ai
[13] PagerDuty, "Incident management documentation," PagerDuty, 2024. [Online]. Available: https://support.pagerduty.com/docs/
[14] J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of Massive Datasets, 3rd ed. Cambridge University Press, 2020.










