From SQL to Spark: My Journey into Big Data and Scalable Systems How I Debug Complex Issues in Large Codebases

Authors

  • Bhavitha Guntupalli ETL/Data Warehouse Developer at Blue Cross Blue Shield of Illinois, USA. Author

DOI:

https://doi.org/10.63282/3050-9262.IJAIDSML-V6I1P119

Keywords:

Big Data, SQL, Apache Spark, Scalable Systems, Debugging, Distributed Computing, Data Engineering, Codebase Complexity, Performance Optimization

Abstract

From a well-lit workshop to the vast, dynamic environment of modern big data systems is like traversing SQL to Apache Spark. Beginning in the familiar realm of relational databases, regulated information and careful searches dictated my road forward. But as batch projects needing hoursor even days started to demand growing data volumes, I realized that traditional SQL systems were insufficient. At that time I asked Spark for help. The transformation was not erratic. I started to rethink data pipelines, fault tolerance, and distributed computing. Debugging challenging problems in large-scale Spark systems posed hitherto unmet challenges including tracking lineage using Directed Acyclic Graphs (DAGs), memory spill management, and cluster performance optimization. I spent days reviewing Spark UI logs to find how little coding mistakes may seriously affect performance. These encounters sharpened my intuition and made it obvious that scalable systems demand scalable thinking: build with possible failure in mind, break out jobs for parallel execution, and often check what you produce. The most important lesson was changing perspectivefrom seeing data operations as individual SQL queries to seeing they were a part of a dynamic, strong architecture. Today, my Spark path has enhanced my technological knowledge as well as my engineering abilities in terms of handling uncertainty, scalability, and simplicity-based change. This links to the development of our cognitive processes in a period when data grows at hitherto unheard-of speed, transcending simple tool replacement

References

[1] Armbrust, Michael, et al. "Spark sql: Relational data processing in spark." Proceedings of the 2015 ACM SIGMOD international conference on management of data. 2015.

[2] Syed, Ali Asghar Mehdi, and Erik Anazagasty. "AI-Driven Infrastructure Automation: Leveraging AI and ML for Self-Healing and Auto-Scaling Cloud Environments." International Journal of Artificial Intelligence, Data Science, and Machine Learning 5.1 (2024): 32-43.

[3] Gulzar, Muhammad Ali, et al. "Bigdebug: Debugging primitives for interactive big data processing in spark." Proceedings of the 38th International Conference on Software Engineering. 2016.

[4] Kumar Tarra, Vasanta, and Arun Kumar Mittapelly. “AI-Driven Lead Scoring in Salesforce: Using Machine Learning Models to Prioritize High-Value Leads and Optimize Conversion Rates”. International Journal of Emerging Trends in Computer Science and Information Technology, vol. 5, no. 2, June 2024, pp. 63-72

[5] Guller, Mohammed. "Big data analytics with spark." ISBN-13 (pbk) (2015): 978-1.

[6] Jani, Parth, and Sangeeta Anand. “Compliance-Aware AI Adjudication Using LLMs in Claims Engines (Delta Lake + LangChain)”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 5, no. 2, May 2024, pp. 37-46

[7] Karau, Holden, and Rachel Warren. High performance Spark: best practices for scaling and optimizing Apache Spark. " O'Reilly Media, Inc.", 2017.

[8] Chaganti, Krishna Chaitanya. "AI-Powered Threat Detection: Enhancing Cybersecurity with Machine Learning." International Journal of Science And Engineering 9 (2023): 10-18.

[9] Talakola, Swetha. “The Optimization of Software Testing Efficiency and Effectiveness Using AI Techniques”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 5, no. 3, Oct. 2024, pp. 23-34

[10] Marra, Matteo. A live debugging approach for big data processing applications. Diss. Ph. D. thesis, Vrije Universiteit Brussel, 2022.

[11] Lalith Sriram Datla, and Samardh Sai Malay. “Transforming Healthcare Cloud Governance: A Blueprint for Intelligent IAM and Automated Compliance”. Journal of Artificial Intelligence & Machine Learning Studies, vol. 9, Jan. 2025, pp. 15-37

[12] Tang, Shanjiang, et al. "A survey on spark ecosystem: Big data processing infrastructure, machine learning, and applications." IEEE Transactions on Knowledge and Data Engineering 34.1 (2020): 71-91.

[13] Arugula, Balkishan. “Prompt Engineering for LLMs: Real-World Applications in Banking and Ecommerce”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 6, no. 1, Jan. 2025, pp. 115-23

[14] Gulzar, Muhammad Ali. Automated testing and debugging for big data analytics. University of California, Los Angeles, 2020.

[15] Abdul Jabbar Mohammad, and Guru Modugu. “Behavioral TimekeepingUsing Behavioral Analytics to Predict Time Fraud and Attendance Irregularities”. Artificial Intelligence, Machine Learning, and Autonomous Systems, vol. 9, Jan. 2025, pp. 68-95

[16] Damus Ros, Nicolas. "A Business Intelligence Solution, based on a Big Data Architecture, for processing and analyzing the World Bank data." (2023).

[17] Veluru, Sai Prasad. “Bidirectional Curriculum Learning: Decelerating and Re-Accelerating Learning for Robust Convergence”. International Journal of Emerging Trends in Computer Science and Information Technology, vol. 5, no. 2, June 2024, pp. 93-102

[18] Jambi, Sahar Hussain. Engineering Scalable Distributed Services for Real-Time Big Data Analytics. Diss. University of Colorado at Boulder, 2016.

[19] Chaganti, Krishna Chaitanya. "Securing Enterprise Java Applications: A Comprehensive Approach." International Journal of Science And Engineering 10.2 (2024): 18-27.

[20] Bhaskaran, Shinoy Vengaramkode. "Integrating data quality services (dqs) in big data ecosystems: Challenges, best practices, and opportunities for decision-making." Journal of Applied Big Data Analytics, Decision-Making, and Predictive Modelling Systems 4.11 (2020): 1-12.

[21] Al Samisti, Fanti Machmount. "Visual Debugging of Dataflow Systems." (2017).

[22] Balkishan Arugula, and Suni Karimilla. “Modernizing Core Banking Systems: Leveraging AI and Microservices for Legacy Transformation”. Artificial Intelligence, Machine Learning, and Autonomous Systems, vol. 9, Feb. 2025, pp. 36-67

[23] Zhang, Jian. "Exploring and Evaluating the Scalability and Efficiency of Apache Spark using Educational Datasets." (2018).

[24] Allam, Hitesh. “Intent-Based Infrastructure: Moving BeyondIaC to Self-Describing Systems”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 6, no. 1, Jan. 2025, pp. 124-36

[25] Talakola, Swetha. “Transforming BOL Images into Structured Data Using AI”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 6, no. 1, Mar. 2025, pp. 105-14

[26] Akil, Bilal. A Comparative Study of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science. Diss. 2018.

[27] Jabbar Mohammad, Abdul. “Integrating Timekeeping and Payroll Systems During Organizational TransitionsMergers, Layoffs, Spinoffs, and Relocations”. Los Angeles Journal of Intelligent Systems and Pattern Recognition, vol. 5, Feb. 2025, pp. 25-53

[28] Veluru, Sai Prasad, and Mohan Krishna Manchala. "Using LLMs as Incident Prevention Copilots in Cloud Infrastructure." International Journal of AI, BigData, Computational and Management Studies 5.4 (2024): 51-60.

[29] Jani, Parth. "Modernizing Claims Adjudication Systems with NoSQL and Apache Hive in Medicaid Expansion Programs." JOURNAL OF RECENT TRENDS IN COMPUTER SCIENCE AND ENGINEERING (JRTCSE) 7.1 (2019): 105-121.

[30] Wolohan, John. Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code. Simon and Schuster, 2020.

[31] Datla, Lalith Sriram. “Infrastructure That Scales Itself: How We Used DevOps to Support Rapid Growth in Insurance Products for Schools and Hospitals”. International Journal of AI, BigData, Computational and Management Studies, vol. 3, no. 1, Mar. 2022, pp. 56-65

[32] Kupunarapu, Sujith Kumar. "Data Fusion and Real-Time Analytics: Elevating Signal Integrity and Rail System Resilience." International Journal of Science And Engineering 9 (2023): 53-61.

[33] Allam, Hitesh. “Code Meets Intelligence: AI-Augmented CI CD Systems for DevOps at Scale.” International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 6, no. 1, Jan. 2025, pp. 137-46

[34] Sangaraju, Varun Varma, and Senthilkumar Rajagopal. "Applications of Computational Models in OCD." Nutrition and Obsessive-Compulsive Disorder. CRC Press, 2023. 26-35.

[35] Bagherzadeh, Mehdi, and Raffi Khatchadourian. "Going big: a large-scale study on what big data developers ask." Proceedings of the 2019 27th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 2019.

[36] S. S. Nair, G. Lakshmikanthan, J.ParthaSarathy, D. P. S, K. Shanmugakani and B.Jegajothi, ""Enhancing Cloud Security with Machine Learning: Tackling Data Breaches and Insider Threats,"" 2025 International Conference on Electronics and Renewable Systems (ICEARS), Tuticorin, India, 2025, pp. 912-917, doi: 10.1109/ICEARS64219.2025.10940401.

[37] R. Daruvuri, K. K. Patibandla, and P. Mannem, “Data Driven Retail Price Optimization Using XGBoost and Predictive Modeling”, in Proc. 2025 International Conference on Intelligent Computing and Control Systems (ICICCS), Chennai, India. 2025, pp. 838–843.

Published

2025-02-05

Issue

Section

Articles

How to Cite

1.
Guntupalli B. From SQL to Spark: My Journey into Big Data and Scalable Systems How I Debug Complex Issues in Large Codebases. IJAIDSML [Internet]. 2025 Feb. 5 [cited 2025 Oct. 25];6(1):174-85. Available from: https://ijaidsml.org/index.php/ijaidsml/article/view/206