A Generative AI Framework for Data Pipeline Optimization and Analytical Performance Enhancement
DOI:
https://doi.org/10.63282/3050-9262.IJAIDSML-V6I4P116Keywords:
Generative AI, Data Pipeline, Optimization, Analytical Performance, Data Engineering, Automation, Model-driven Systems, Cloud AnalyticsAbstract
The surprising increase in data-intensive workloads in enterprise and cloud environments has increased the call to find smart and agile methods of optimizing data pipelines. The dynamic nature of the contemporary modern analytical ecosystems makes traditional rule-based and fixed optimization strategies inadequate, leading to bottlenecks, latency as well as poor utilization of resources. The given paper suggests a Generative AI-based framework that analyzes, designs, and optimizes data pipelines autonomously to improve the overall analytical performance. The architecture uses large language models (LLM) and reinforcement learning to identify optimal pipeline designs, dynamically adapt resource deployment and optimize flow real-time tune ethics. The architecture also incorporates an automatic response loop of self-optimization constantly learning in response to workload trends, metadata and past performance indicators. Single-run benchmark performance of analytics workloads on experimental platforms show a substantial decrease of pipeline latency 42.7%, throughput 58.4% and compute cost 31.2% in comparison with automation benchmark methods. In addition, its machine engine record attained 92.3% percent accuracy in pipeline recommendation that accentuates its flexibility in different data processing settings. The suggested design provides the basis of self-engineered data architecture to support scalable cost-effective and self-optimizing analytic engines and grocery data valuable operations in the next generation of data-driven businesses
References
[1] Wang, K., Wang, J., Li, Y., Kallus, N., Trummer, I., & Sun, W. (2023). JoinGym: An Efficient Query Optimization Environment for Reinforcement Learning. arXiv preprint arXiv:2307.11704.
[2] Alves, J. M., Honório, L. M., & Capretz, M. A. (2019). ML4IoT: A framework to orchestrate machine learning workflows on internet of things data. IEEE Access, 7, 152953-152967.
[3] Sambath Narayanan, D. B. G. (2024). Data Engineering for Responsible AI: Architecting Ethical and Transparent Analytical Pipelines. International Journal of Emerging Research in Engineering and Technology, 5(3), 97-105. https://doi.org/10.63282/3050-922X.IJERET-V5I3P110
[4] Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., ... & Stoica, I. (2016). Apache spark: a unified engine for big data processing. Communications of the ACM, 59(11), 56-65.
[5] Marcus, R., Negi, P., Mao, H., Tatbul, N., Alizadeh, M., & Kraska, T. (2021, June). Bao: Making learned query optimization practical. In Proceedings of the 2021 International Conference on Management of Data (pp. 1275-1288).
[6] Chen, X., Chen, H., Liang, Z., Liu, S., Wang, J., Zeng, K., ... & Zheng, K. (2023). Leon: A new framework for ml-aided query optimization. Proceedings of the VLDB Endowment, 16(9), 2261-2273.
[7] Urbanowicz, R., Zhang, R., Cui, Y., & Suri, P. (2023). STREAMLINE: a simple, transparent, end-to-end automated machine learning pipeline facilitating data analysis and algorithm comparison. In Genetic Programming Theory and Practice XIX (pp. 201-231). Singapore: Springer Nature Singapore.
[8] Sambath Narayanan, D. B. G. (2025). AI-Driven Data Engineering Workflows for Dynamic ETL Optimization in Cloud-Native Data Analytics Ecosystems. American International Journal of Computer Science and Technology, 7(3), 99-109. https://doi.org/10.63282/3117-5481/AIJCST-V7I3P108
[9] Ortiz, J., Balazinska, M., Gehrke, J., & Keerthi, S. S. (2018, June). Learning state representations for query optimization with deep reinforcement learning. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning (pp. 1-4).
[10] Yu, X., Chai, C., Li, G., & Liu, J. (2022). Cost-based or learning-based? A hybrid query optimizer for query plan selection. Proceedings of the VLDB Endowment, 15(13), 3924-3936.
[11] Integrating AI-Powered Agents for Data Pipeline Optimization, xenonstack, 2025. Online.https://www.xenonstack.com/blog/ai-agents-for-data-pipeline-optimization - image
[12] Siddiqi, S., Kern, R., & Boehm, M. (2023). SAGA: A scalable framework for optimizing data cleaning pipelines for machine learning applications. Proceedings of the ACM on Management of Data, 1(3), 1-26.
[13] Sambath Narayanan, D. B. G. (2025). Generative AI–Enabled Intelligent Query Optimization for Large-Scale Data Analytics Platforms. International Journal of Artificial Intelligence, Data Science, and Machine Learning, 6(2), 153-160. https://doi.org/10.63282/3050-9262.IJAIDSML-V6I2P117
[14] Automating Data Pipeline Optimization with Generative AI, indium, 2025. Online. https://www.indium.tech/blog/automating-data-pipeline-optimization-generative-ai/
[15] Katam, B. R. (2024). Optimizing Data Pipeline Efficiency with Machine Learning Techniques. July, 202, 1-15.
[16] Krishnan, S., Yang, Z., Goldberg, K., Hellerstein, J., & Stoica, I. (2018). Learning to optimize join queries with deep reinforcement learning. arXiv preprint arXiv:1808.03196.
[17] Yang, Y., Wang, R., Liu, X., Krishnan, A., Tao, Y., Deng, Y., ... & Kong, C. (2025). Declarative Data Pipeline for Large Scale ML Services. arXiv preprint arXiv:2508.15105.
[18] Sambath Narayanan, D. B. G. (2025). Semantic Layer Construction in Data Warehouses Using GenAI for Contextualized Analytical Query Processing. American International Journal of Computer Science and Technology, 7(4), 93-102. https://doi.org/10.63282/3117-5481/AIJCST-V7I4P108
[19] Gakhar, S., Kondoju, V. P., Chauhan, S. S., Kumar, A., Kulkarni, S. S., & Goel, P. (2025, February). Generative AI for Real-Time Data Augmentation in Big Data Pipelines. In 2025 First International Conference on Advances in Computer Science, Electrical, Electronics, and Communication Technologies (CE2CT) (pp. 1386-1391). IEEE.
[20] Kundavaram, V. N. K. (2025). Optimizing Data Pipelines for Generative AI Workflows: Challenges and Best Practices. IJSAT-International Journal on Science and Technology, 16(1).
[21] Yang, Z., Chiang, W. L., Luan, S., Mittal, G., Luo, M., & Stoica, I. (2022, June). Balsa: Learning a query optimizer without expert demonstrations. In Proceedings of the 2022 International Conference on Management of Data (pp. 931-944).
[22] Rausch, O., Ben-Nun, T., Dryden, N., Ivanov, A., Li, S., & Hoefler, T. (2022, June). A data-centric optimization framework for machine learning. In Proceedings of the 36th ACM International Conference on Supercomputing (pp. 1-13).
[23] Ogunwole, O., Onukwulu, E. C., Sam-Bulya, N. J., Joel, M. O., & Achumie, G. O. (2022). Optimizing automated pipelines for realtime data processing in digital media and e-commerce. International Journal of Multidisciplinary Research and Growth Evaluation, 3(1), 112-120.
[24] Lazebnik, T., Somech, A., & Weinberg, A. I. (2022). SubStrat: A subset-based strategy for faster AutoML. arXiv preprint arXiv:2206.03070.
[25] Sambath Narayanan, D. B. G. (2025). A Multi-Agent Generative AI Framework for Automated Data Engineering, Governance, and Analytical Optimization. International Journal of AI, BigData, Computational and Management Studies, 6(4), 114-122. https://doi.org/10.63282/3050-9416.IJAIBDCMS-V6I4P113










