Adaptive Data Quality Management for Multi-Cloud Healthcare Warehouses: FHIR-Aware Semantics and Unsupervised Thresholding

Sai Kiran Yadav Battula

doi:10.63282/3050-9262.IJAIDSML-V6I4P130

Authors

Sai Kiran Yadav Battula Independent Researcher, Pittsburgh, Pennsylvania, United States. Author

DOI:

https://doi.org/10.63282/3050-9262.IJAIDSML-V6I4P130

Keywords:

Multi-Cloud Healthcare, Data Quality Management, FHIR, Semantic Validation, Schema Drift, Unsupervised Anomaly Detection, Isolation Forest, Autoencoder, Adaptive Thresholding, Synthetic Health Data

Abstract

The rapid proliferation of multi-cloud architectures in healthcare promises elastic scalability and regional redundancy, but it also introduces acute challenges in data consistency, latency, and governance. Traditional, centrally orchestrated, rule-based Data Quality Management (DQM) tools are ill-equipped to handle the volume, heterogeneity, and semantic complexity of distributed electronic health records (EHRs) and claims data. As schemas drift and new data sources are onboarded, static checks generate escalating false positives, incur avoidable data movement costs, and contribute to “data swamps” that compromise clinical decision-making. This paper presents an Adaptive Data Quality Management framework for multi-cloud healthcare warehouses that combines unsupervised anomaly detection with FHIR-aware semantic validation. The framework deploys lightweight quality components alongside analytic workloads to profile and score data streams, while a cloud-agnostic control layer dynamically adjusts quality thresholds using rolling statistics over anomaly scores. A FHIR-based semantic distance metric decomposes deviations into structural, vocabulary, and cardinality components, enabling graded policies rather than binary pass/fail checks.

Using a synthetic but structurally realistic workload of approximately 500,000 patients generated by a Synthea-style engine and partitioned across AWS, Azure, and GCP, we evaluate the framework under controlled “chaos engineering” scenarios including schema drift, value-set drift, and volume anomalies. Compared with a centralized, rule-based DQM baseline, the adaptive framework reduces false-positive quality alerts by roughly 40% while increasing precision from about 0.62 to 0.74 at comparable recall. These results demonstrate that combining FHIR-aware semantics with unsupervised, adaptively thresholded quality scoring can substantially reduce noise in quality monitoring while preserving anomaly detection performance in multi-cloud healthcare analytics and decision-support systems

References

[1] N. G. Weiskopf and C. Weng, “Methods and dimensions of electronic health record data quality assessment: Enabling reuse for clinical research,” J. Am. Med. Inform. Assoc., vol. 20, no. 1, pp. 144–151, Jan. 2013.

[2] M. G. Kahn et al., “A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data,” EGEMS (Wash. DC), vol. 4, no. 1, p. 1244, 2016.

[3] C. Batini and M. Scannapieco, Data Quality: Concepts, Methodologies and Techniques. Berlin, Germany: Springer, 2006.

[4] A. E. Lewis et al., “Electronic health record data quality assessment and tools: A systematic review,” J. Am. Med. Inform. Assoc., vol. 30, no. 10, pp. 1730–1742, Oct. 2023.

[5] Z. Wang, J. R. Talburt, N. Wu, S. Dagtas, and M. N. Zozus, “A rule-based data quality assessment system for electronic health record data,” Appl. Clin. Inform., vol. 11, no. 4, pp. 622–634, Aug. 2020.

[6] R. Y. Wang and D. M. Strong, “Beyond accuracy: What data quality means to data consumers,” J. Manage. Inf. Syst., vol. 12, no. 4, pp. 5–33, Spring 1996.

[7] J. A. Walonoski et al., “Synthea: An approach, method, and software mechanism for generating synthetic electronic health records,” J. Am. Med. Inform. Assoc., vol. 25, no. 3, pp. 230–238, Mar. 2018.

[8] National Academies of Sciences, Engineering, and Medicine, “Synthetic health data generation engine to accelerate patient-centered outcomes research,” in Understanding the Impacts of OS-PCORTF Projects on Data Infrastructure and Methods. Washington, DC, USA: National Academies Press, 2023.

[9] Health Level Seven International, “FHIR Release 4 (R4): Fast Healthcare Interoperability Resources,” 2019. [Online]. Available: http://hl7.org/fhir/R4/

[10] C. N. Vorisek et al., “Fast Healthcare Interoperability Resources (FHIR) for interoperability in health research: A systematic review,” JMIR Med. Inform., vol. 10, no. 7, p. e35724, Jul. 2022.

[11] B. Moses, “The rise of data observability: Architecting the future of data trust,” Monte Carlo Data, 2021. [Online]. Available: https://www.montecarlodata.com/blog/the-rise-of-data-observability/

[12] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in Proc. 8th IEEE Int. Conf. Data Mining (ICDM), 2008, pp. 413–422.

[13] U.S. Dept. of Health and Human Services, Office of the National Coordinator for Health IT, “Health Data, Technology, and Interoperability: Certification Program Updates, Algorithm Transparency, and Information Sharing (HTI-1) Final Rule,” Fed. Regist., vol. 89, no. 7, pp. 1192–1279, Jan. 9, 2024.

[14] N. Yaraghi, A. A. Seixas, and F. Zizi, “How ONC can strengthen its HTI-1 rule to ensure transparency, fairness, and equity in AI,” Health Affairs Forefront, Jun. 26, 2024.

Adaptive Data Quality Management for Multi-Cloud Healthcare Warehouses: FHIR-Aware Semantics and Unsupervised Thresholding

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

call for paper

Make a Submission

Cover Image

CURRENT INDEX

TOOLS

Latest publications

Information