Distributed Machine Learning for Big Data Analytics: Challenges, Architectures, and Optimizations
DOI:
https://doi.org/10.63282/3050-9262.IJAIDSML-V4I3P103Keywords:
Distributed Machine Learning, Big Data Analytics, Federated Learning, Edge Computing, Scalability, Optimization Techniques, Cloud ComputingAbstract
The tremendous development of big data has led to the establishment of Distributed Machine Learning (DML) strategies for processing vast information in many computing nodes. Traditional models for machine learning are not efficient regarding scalability, computational cost and real-time processing; hence, distributed architectures form the solution to large-scale data analytics. In this paper, the various architectures of DML are presented and discussed, and some of them include centralised, decentralised, federated and edge-based computing paradigms. In DML, viable problems are network latency, system scale increase, security concerns, and convergence of the models. Preserving data privacy has become necessary through homomorphic encryption for computation on encrypted data, differential privacy to prevent leakage of sensitive data and secure aggregation for federated learning. Some of the further priorities include properly distributing resources and minimizing the amount of traffic in the system. Later, optimization techniques of DML include gradient compression, adaptive learning, and resource management using reinforcement learning. An example involving real-time driver performance analysis for IoT-based fleet management under high traffic shows how DML can be applied practically in terms of its applicability incorporated into a large-scale working system. These three proactive structures kept the predictive accuracy high, and through scaling through Kubernetes, the operational cost was further brought down. Finally, regarding future work, some study extensions propose blockchain-adapted federated learning, neuromorphic computing, and AI automation in distributed environments. That is why it will remain important to address these challenges to realize the full potential of DML and, therefore, of big data analytics in different industries
References
[1] Zhou, L., Pan, S., Wang, J., & Vasilakos, A. V. (2017). Machine learning on big data: Opportunities and challenges. Neurocomputing, 237, 350-361.
[2] Xing, E. P., Ho, Q., Xie, P., & Wei, D. (2016). Strategies and principles of distributed machine learning on big data. Engineering, 2(2), 179-195.
[3] The history of Machine Learning, light on data, online. https://www.lightsondata.com/the-history-of-machine-learning
[4] Mohamed, S. H., El-Gorashi, T. E., & Elmirghani, J. M. (2019). A survey of optimisation of big data machine learning applications in cloud data centers and networks. arXiv preprint arXiv:1910.00731.
[5] About Big Data Storage and the Challenges, cybiant, online. https://www.cybiant.com/knowledge/about-big-data-storage-and-the-challenges/
[6] Zerka, F., Barakat, S., Walsh, S., Bogowicz, M., Leijenaar, R. T., Jochems, A., ... & Lambin, P. (2020). A systematic review of privacy-preserving distributed machine learning from federated databases in health care. JCO clinical cancer informatics, 4, 184-200.
[7] Guo, H., & Zhang, J. (2016, July). A Distributed and Scalable Machine Learning Approach for Big Data. In IJCAI (pp. 1512-1518).
[8] Comparing Deep Learning and Traditional Machine Learning, The CEO views online. https://theceoviews.com/comparing-deep-learning-and-traditional-machine-learning/
[9] Zuo, Y., Wu, Y., Min, G., Huang, C., & Zhang, X. (2017). Distributed machine learning in the big data era for smart cities. In From Internet of Things to Smart Cities (pp. 151-177). Chapman and Hall/CRC.
[10] The Top Challenges of Big Data: Volume, Velocity, Variety, and Veracity, Linkedin, online. https://www.linkedin.com/pulse/top-challenges-big-data-volume-velocity-variety-veracity-elevondata
[11] Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., & Rellermeyer, J. S. (2020). A survey on distributed machine learning. Acm computing surveys (csur), 53(2), 1-33.
[12] Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., ... & Zimmermann, T. (2019, May). Software engineering for machine learning: A case study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (pp. 291-300). IEEE.
[13] The Seven V's of Big Data Analytics, Trigyn technology, 2023. online. https://www.trigyn.com/insights/seven-vs-big-data-analytics
[14] Indirman, M. D. C., Wiriasto, G. W., & Akbar, L. A. S. I. (2023). Distributed Machine Learning using HDFS and Apache Spark for Big Data Challenges. In E3S Web of Conferences (Vol. 465, p. 02058). EDP Sciences.
[15] PyTorch on Databricks - Introducing the Spark PyTorch Distributor, databricks, online. https://www.databricks.com/blog/2023/04/20/pytorch-databricks-introducing-spark-pytorch-distributor.html
[16] Asad, M., Moustafa, A., & Ito, T. (2021). Federated learning versus classical machine learning: A convergence comparison. arXiv preprint arXiv:2107.10976.
[17] 10 Real World Data Science Case Studies Projects with Example, Project Pro, online. https://www.projectpro.io/article/data-science-case-studies-projects-with-examples-and-solutions/519
[18] Choosing the Right Infrastructure: Cloud, Hybrid, or On-Premise? Sojourn, 2023. online. https://softjourn.com/insights/cloud-vs-on-premise
[19] Alruhaymi, A. Z., & Kim, C. J. (2021). Case Study on Data Analytics and Machine Learning Accuracy. Journal of Data Analysis and Information Processing, 9(4), 249-270.
[20] Real-Time Analytics: Examples, Use Cases, Tools & FAQs, Tinybird, online. https://www.tinybird.co/blog-posts/real-time-analytics-a-definitive-guide
[21] Uddin, S., Ong, S., & Lu, H. (2022). Machine learning in project analytics: a data-driven framework and case study. Scientific Reports, 12(1), 15252.
[22] L’heureux, A., Grolinger, K., Elyamany, H. F., & Capretz, M. A. (2017). Machine learning with big data: Challenges and approaches. Ieee Access, 5, 7776-7797.
[23] López-Martínez, F., Núñez-Valdez, E. R., García-Díaz, V., & Bursac, Z. (2020). A case study for a big data and machine learning platform to improve medical decision support in population health management. Algorithms, 13(4), 102.