Optimizing Large-Scale ML Training using Cloud-based Distributed Computing

Authors

  • Yasodhara Varma Vice President at JPMorgan Chase & Co, USA. Author
  • Manivannan Kothandaraman Vice President, Senior Lead Software Engineer, JP Morgan Chase & Co. USA. Author

DOI:

https://doi.org/10.63282/3050-9262.IJAIDSML-V3I3P105

Keywords:

Cloud computing, distributed ML training, AWS EMR, GCP, Azure, Dask, Ray, Apache Spark, cost optimization, autoscaling, GPU acceleration, fraud detection, real-time ML, Kubernetes, SageMaker, Vertex AI, Azure ML, scalable ML workflows, cloud cost management

Abstract

Large-scale machine learning (ML) models development depend on a strong and effective cloud- based infrastructure. This work investigates best ways to create cloud-native machine learning training environments using Azure, GCP, and AWS. We underline how important distributed computing systems as Apache Spark, Dask, and Ray are in handling large amounts of data and accelerating training times. Apart from performance, one major issue is cost control. We investigate using spot instances, auto-scaling & the storage improves the equilibrium between computing capability & budgetary constraints. We present a useful case study on Dask on AWS EMR financial fraud detection. This case study shows how distributed machine learning workloads may be optimized for maximum efficiency, therefore lowering time and cost while maintaining model correctness by means of this reduction of the burden. This article presents doable solutions for cloud architects, data scientists, and ML engineers on creating a scalable and reasonably affordable ML training pipeline in the cloud

References

[1] Duong, Ta Nguyen Binh, and Nguyen Quang Sang. "Distributed machine learning on IAAS clouds." 2018 5th IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS). IEEE, 2018.

[2] Li, Mingwei, et al. "Distributed machine learning load balancing strategy in cloud computing services." Wireless Networks 26 (2020): 5517-5533.

[3] Ta, Nguyen Binh Duong. "FC 2: cloud- based cluster provisioning for distributed machine learning." Cluster Computing 22.4 (2019): 1299-1315.

[4] Mohamed, Sanaa Hamid, Taisir EH El- Gorashi, and Jaafar MH Elmirghani. "A survey of big data machine learning applications optimization in cloud data centers and networks." arXiv preprint arXiv:1910.00731 (2019).

[5] Simic, Visnja, Boban Stojanovic, and Milos Ivanovic. "Optimizing the performance of optimization in the cloud environment–An intelligent auto-scaling approach." Future Generation Computer Systems 101 (2019):909-920.

[6] Selvarajan, Guru Prasad. "OPTIMISING MACHINE LEARNING WORKFLOWS IN SNOWFLAKEDB: A COMPREHENSIVE FRAMEWORK SCALABLE CLOUD-BASED

[7] DATA ANALYTICS." Technix International Journal for Engineering Research 8 (2021): a44-a52.

[8] Pop, Daniel. "Machine learning and cloud computing: Survey of distributed and saas solutions." arXiv preprint arXiv:1603.08767 (2016).

[9] Shi, Shaohuai, et al. "Towards scalable distributed training of deep learning on public cloud clusters." Proceedings of Machine Learning and Systems 3 (2021): 401- 412.

[10] Lattuada, Marco, et al. "Optimal resource allocation of cloud-based spark applications." IEEE Transactions on Cloud Computing 10.2 (2020): 1301-1316.

[11] García, Álvaro López, et al. "A cloud- based framework for machine learning workloads and applications." IEEE access 8 (2020): 18681-18692.

[12] Jiang, Haotian, et al. "Distributed deep learning optimized system over the cloud and smart phone devices." IEEE Transactions on Mobile Computing 20.1 (2019): 147-161.

[13] Ranjan, Rajiv. "Streaming big data processing in datacenter clouds." IEEE cloud computing 1.01 (2014): 78-83.

[14] Huang, Botong, et al. "Resource elasticity for large-scale machine learning." Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 2015.

[15] Rosen, Joshua, et al. "Iterative mapreduce for large scale machine learning." arXiv preprint arXiv:1303.3517 (2013).

[16] Lehmhus, Dirk, et al. "Cloud-based automated design and additive manufacturing: a usage data-enabled paradigm shift." Sensors 15.12 (2015): 32079- 32122.

Published

2022-10-30

Issue

Section

Articles

How to Cite

1.
Varma Y, Kothandaraman M. Optimizing Large-Scale ML Training using Cloud-based Distributed Computing. IJAIDSML [Internet]. 2022 Oct. 30 [cited 2025 Sep. 15];3(3):45-54. Available from: https://ijaidsml.org/index.php/ijaidsml/article/view/84