Autonomous ETL Pipelines: Using Generative AI to Design, Validate, and Deploy Dataflows
DOI:
https://doi.org/10.63282/3050-9262.IJAIDSML-V6I4P128Keywords:
ETL, Generative AI, Large Language Models, Data Engineering, Autonomous Systems, Data Pipelines, AI Copilots, Infrastructure as Code, Self-Healing Systems, Data ValidationAbstract
The Extract, Transform, and Load (ETL) process, a cornerstone of data engineering, has traditionally been laborious, time-consuming, and error-prone, resulting in a substantial bottleneck in today's data-driven enterprises. This study investigates the paradigm shift toward autonomous ETL pipelines powered by Generative AI and Large Language Models (LLMs). Provides a comprehensive architectural framework in which prompt-driven AI agents analyze natural language needs to build pipeline logic, construct sophisticated schema mappings, and create infrastructure-as-code (IaC) templates for deployment. The study examines the integration of AI copilots into modern platforms such as Microsoft Fabric and Databricks, which enable not only initial code generation but also continuous validation of transformation logic, dynamic optimization, and self-healing of failed jobs via reinforcement learning feedback loops. The result is a scalable, adaptive dataflow system that requires minimal human intervention, dramatically increasing development speed, operational consistency, cost-efficiency, and overall observability across complex, hybrid data ecosystems. The report finishes with a discussion of the obstacles, including delusion, security, and expense, as well as future research possibilities for fully autonomous, trustworthy data management
References
[1] M. Rice, "The Future of ETL in a Cloud-Native World," *Gartner*, Tech. Rep., 2023.
[2] Jung S, Lee BJ, Han I. Gomez, Ł. Kaiser, and I. Polosukhin,“Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.[31]
[3] K. Ito et al.,“The lj speech dataset,” 2017.[32] F. Ribeiro, D. Florêncio, C. Zhang, and M. Seltzer,“Crowdmos. ADE DE SÃ. 2011:97." https://learn.microsoft.com/en-us/training/paths/get-started-fabric/
[4] Barto AG. Reinforcement learning: Connections, surprises, and challenge. AI Magazine. 2019 Mar 28;40(1):3-15.
[5] Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, Chaudhary V, Young M, Crespo JF, Dennison D. Hidden technical debt in machine learning systems. Advances in neural information processing systems. 2015;28.
[6] Abioye SO, Oyedele LO, Akanbi L, Ajayi A, Delgado JM, Bilal M, Akinade OO, Ahmed A. Artificial intelligence in the construction industry: A review of present status, opportunities and future challenges. Journal of Building Engineering. 2021 Dec 1;44:103299.7.
[7] M. Armbrust et al., "Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics," in *Conference on Innovative Data Systems Research (CIDR)*, 2021.
[8] Mohammed AH, Ali AH. Survey of bert (bidirectional encoder representation transformer) types. InJournal of Physics: Conference Series 2021 Jul 1 (Vol. 1963, No. 1, p. 012173). IOP Publishing. https://www.oracle.com/in/autonomous-database/what-is-autonomous-database/ https://www.cidrdb.org/cidr2019/
[9] Liu L, Hasegawa S, Sampat SK, Xenochristou M, Chen WP, Kato T, Kakibuchi T, Asai T. AutoDW: Automatic Data Wrangling Leveraging Large Language Models. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering 2024 Oct 27 (pp. 2041-2052). https://www.databricks.com/product/databricks-assistant










