Data Quality Assessment and Cleaning Framework for Healthcare Databases Using Python

Authors

  • Chitiz Tayal Senior Director, Data and AI. Author

DOI:

https://doi.org/10.63282/3050-9262.IJAIDSML-V3I4P112

Keywords:

Healthcare Data Quality, Data Cleaning Framework, Data Quality Assessment, Python Programming, Data Preprocessing, Data Consistency, Data Completeness, Data Validity, Electronic Health Records (EHR), Healthcare Analytics, Duplicate Detection, Data Profiling, Rule-Based Validation, Automated Data Cleaning, Data Integrity, Data Quality Dimensions, Pandas, NumPy, Seaborn, Real-World Healthcare Data

Abstract

The arrival of data-driven healthcare systems requires high-quality data sets to enable clinical research and decision-making. However, healthcare databases always contain various data quality issues, including inconsistency, duplication, and logical error, which prevent the analysts to conduct an analysis that has validation. This paper develops a DQACFHD that integrates advanced theories of data quality management with the feasible procedures for implementing the theory. We retrieved a public health care data set from Kaggle and utilized it to implement our DQACFHD tool. The methodologies conducted data profiling, rule-based validation technique, duplicate identification, outlier removal, and logical inconsistency checks using the four quality dimensions. These are dubbed completeness, uniqueness, validity, and accuracy. We utilized Pythons libraries, such as Pandas, NumPy, Matplotlib, to automate the data cleaning process and machine learning model and visualize the data quality post-processing. Our DQACFHD tool removed all the duplicates and inconsistencies in the data, as evidenced by the 100% score on the quality dimensions by applying different strategies in the automated cleaning process. The post-processing analysis further demonstrated a realistic data distribution across demographic and financial dimensions. Therefore, the results show how automation based on Python has made it realistic to attain high data quality, efficiency, and reproducibility for health purposes given the application use. Thus, the application may ensure better data in different health data applications. The tool can also facilitate future work on data governance and analytical frameworks

References

[1] E. Bacry, S. Gaïffas, F. Leroy, M. Morel, D.-P. Nguyen, Y. Sebiat, and D. Sun, “SCALPEL3: A scalable open-source library for healthcare claims databases,” Int. J. Med. Inform., vol. 141, pp. 104211, May 2020.

[2] S. T. Liaw, J. G. N. Guo, S. Ansari, J. Jonnagaddala, M. A. Godinho, A. J. Borelli, S. de Lusignan, D. Capurro, H. Liyanage, N. Bhattal, V. Bennett, J. Chan, M. G. Kahn, “Quality assessment of real-world data repositories across the data life cycle: A literature review”, Journal of the American Medical Informatics Association, Vol. 28, Issue 7, July 2021, pp. 1591-1599.

[3] G. Singh, B. Soman, A. Mitra, “A Systematic Approach to Cleaning Routine Health Surveillance Datasets: An Illustration Using National Vector Borne Disease Control Programme Data of Punjab, India,” preprint / arXiv, 2021.

[4] S. Binkheder, M. A. Asiri, K. W. Altowayan, T. M. Alshehri, M. F. Alzarie, R. N. Aldekhyyel, I. A. Almaghlouth, & J. A. Almulhem, “Real-World Evidence of COVID-19 Patients’ Data Quality in the Electronic Health Records,” Healthcare, vol. 9, no. 12, article 1648, 2021.

[5] E. Ndabarora, J. A. Chipps, and L. Uys, “Systematic Review of Health Data Quality Management and Best Practices at Community and District Levels in Low and Middle Income Countries,” Information Development, vol. 30, no. 2, pp. 103–120, 2014.

[6] K. H. Tae, Y. Roh, Y. H. Oh, H. Kim, S. E. Whang, “Data Cleaning for Accurate, Fair, and Robust Models: A Big Data – AI Integration Approach,” arXiv preprint (2019).

[7] L. G. Qualls, T. A. Phillips, B. G. Hammill, J. Topping, D. M. Louzao, J. S. Brown, L. H. Curtis, and K. Marsolo, “Evaluating Foundational Data Quality in the National Patient-Centered Clinical Research Network (PCORnet®),” eGEMs (Generating Evidence & Methods to improve patient outcomes), vol. 6, no. 1, p. 3, 2018.

[8] X. Shi, C. Prins, G. Van Pottelbergh, P. Mamouris, B. Vaes, and B. De Moor, “An Automated Data Cleaning Method for Electronic Health Records by Incorporating Clinical Knowledge,” BMC Med. Inform. Decis. Mak., vol. 21, no. 1, p. 267, 2021. doi:

[9] Shi, C. Prins, G. Van Pottelbergh, et al., “An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge,” BMC Medical Informatics and Decision Making, vol. 21, article 267, 2021.

[10] L. G. Qualls, T. A. Phillips, B. G. Hammill, J. Topping, D. M. Louzao, J. S. Brown, L. H. Curtis, & K. Marsolo, “Evaluating Foundational Data Quality in the National Patient-Centered Clinical Research Network (PCORnet®),” eGEMs (Generating Evidence & Methods to improve patient outcomes), vol. 6, no. 1, p. 3, 2018.

Published

2022-12-30

Issue

Section

Articles

How to Cite

1.
Tayal C. Data Quality Assessment and Cleaning Framework for Healthcare Databases Using Python. IJAIDSML [Internet]. 2022 Dec. 30 [cited 2026 Apr. 24];3(4):107-12. Available from: https://ijaidsml.org/index.php/ijaidsml/article/view/308