AI-Enhanced ETL Framework for Improving Data Quality in Clinical Decision Support Systems
DOI:
https://doi.org/10.63282/3050-9262.IJAIDSML-V5I2P113Keywords:
Clinical Decision Support Systems, Data Quality, Artificial Intelligence, ETL, HL7 -FHIR, Machine Learning, Health Informatics, OntologyAbstract
Clinical Decision Support System (CDSS) are inalienable parts of modern healthcare in that they offer an informational support to clinicians regarding diagnosis, treatment planning, and management of patients. Their credibility however depends on the quality of the information obtained on the Electronic Health Records (EHRs) and other heterogeneous sources. Lacking, disjunctive, and semantically varied data remain as major hindrances to fruitful decision-making. This paper suggests an AI-powered Extract, Transform, Load (ETL) system that relies on machine learning, natural language processing, and ontology-based reasoning systems to improve healthcare data quality automatically. The framework uses anomaly detection using autoencoders, entity detection using BioBERT, and semantic harmonisation using HL7-FHIr and SNOMED-CT ontologies. A feedback loop that is reinforcement-learning further optimises changes with time. The experimental analysis of publicly available MIMIC-III and PhysioNet data indicates a 26% high data completeness, a 17% higher rate in consistency, and a 10% higher rate of CDSS diagnostic stability compared to the traditional ETL tasks. These results show that AI-based ETL technology can significantly improve data quality, interoperability, as well as generating clinical insights, thus leading to a platform that enables more credible and scalable CDSS designs
References
[1] E. H. Shortliffe and M. J. Sepúlveda, “Clinical Decision Support in the Era of Artificial Intelligence,” JAMA, vol. 320, no. 21, pp. 2199–2200, 2018.
[2] N. G. Weiskopf and C. Weng, “Methods and dimensions of electronic health record data quality assessment,” Appl. Clin. Inform., vol. 4, no. 3, pp. 271–282, 2023.
[3] M. Khanbhai, J. Crotty, and A. Gillies, “Assessing data quality in electronic health records for clinical decision support,” BMC Med. Inform. Decis. Mak., vol. 22, no. 1, 2022.
[4] D. F. Sittig and H. Singh, “A new socio-technical model for studying health IT safety,” J. Biomed. Inform., vol. 127, 2022.
[5] B. Shickel et al., “Deep learning in healthcare: Review, opportunities, and challenges,” IEEE Access, vol. 9, pp. 115795–115818, 2021.
[6] S. Mudgal, H. Li, T. Rekatsinas, et al., “DeepMatcher: A neural approach to entity matching,” Proc. ACM SIGMOD Conf., pp. 19–34, 2018.
[7] T. M. Derkatch, S. Silva, and M. R. Taylor, “Improving data profiling for medical databases using machine learning,” Front. Artif. Intell., vol. 6, 2023.
[8] U.S. Department of Health & Human Services, “Guidance Regarding Methods for De-Identification of Protected Health Information,” 2021.
[9] J. Beaulieu-Jones et al., “Privacy-preserving deep learning for clinical data,” Nat. Commun., vol. 10, 2019.
[10] J. Lee et al., “BioBERT: A pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, 2020.
[11] A. Alsentzer et al., “ClinicalBERT: Modeling clinical notes and predicting hospital readmission,” arXiv:1904.05342, 2019.
[12] O. Bodenreider, “The Unified Medical Language System (UMLS): Integrating biomedical terminology,” Nucleic Acids Res., vol. 32, pp. D267–D270, 2004.
[13] L. Chen, J. Zhang, and M. Li, “Ontology reasoning for semantic interoperability in clinical data integration,” Health Technol., vol. 12, pp. 1123–1136, 2022.
[14] L. Lin, X. Zhu, and W. Cheng, “Reinforcement learning for data cleaning automation,” Inf. Sci., vol. 592, pp. 483–496, 2022.
[15] A. E. W. Johnson et al., “MIMIC-III, a freely accessible critical care database,” Sci. Data, vol. 3, 2016.
[16] R. Y. Wang and D. M. Strong, “Beyond accuracy: What data quality means to data consumers,” Data Base, vol. 26, no. 1, 1996.
[17] X. Zhu, Q. Chu, X. Song, P. Hu and L. Peng, “Explainable prediction of loan default based on machine learning models,” Data Science and Management, vol. 6, no. 3, pp. 123-133, 2023.
[18] H. Wang and M. Zhang, “Semantic interoperability in clinical data integration: A systematic review,” Health Informatics J., vol. 27, 2021.
[19] P. D. Nguyen and T. P. Le, “AI-assisted ETL pipelines for health informatics,” IEEE Access, vol. 9, pp. 156789–156802, 2021.
[20] A. Rahman, S. Tahir, and P. Kumar, “Impact of data quality on machine learning predictions in healthcare,” PLoS ONE, vol. 18, e0280537, 2023.
[21] K. D. Mandl et al., “The SMART on FHIR platform: Technology for interoperable healthcare apps,” J. Am. Med. Inform. Assoc., vol. 27, no. 4, pp. 637–642, 2020.
[22] HL7 International, “FHIR Release 5 Specification,” 2023.
[23] B. B. Adams, “GDPR-compliant data processing in CDSS environments,” IEEE Access, vol. 9, 2021.
[24] X. Li, J. Wu, and L. Chen, “Federated learning for privacy-preserving healthcare,” Front. Public Health, vol. 10, 2022.
[25] G. A. Kaissis, M. R. Makowski, D. Rückert, and R. F. Braren, “Secure, privacy-preserving and federated machine learning in medical imaging,” Nat. Mach. Intell., vol. 2, pp. 305–311, 2020.
[26] R. P. Singh, “Evaluating CDSS effectiveness under data variability and interoperability constraints,” BMC Med. Inform. Decis. Mak., vol. 23, no. 1, 2023.
[27] A. Adadi and M. Berrada, “Peeking inside the black-box: A survey on explainable artificial intelligence (XAI),” IEEE Access, vol. 6, pp. 52138–52160, 2018.
[28] J.-B. Lamy, “Explainable artificial intelligence for healthcare: State of the art and future challenges,” Health Informatics J., vol. 28, no. 2, pp. 1460–1477, 2022.
[29] M. Patel, S. Wang and E. Johnson, “Automated clinical data governance and semantic interoperability using AI,” Front. Digit. Health, vol. 3, 2023.
[30] P. B. Jensen et al., “Mining electronic health records: Towards better research applications,” Nat. Rev. Genet., vol. 23, pp. 123–140, 2022.










