AI-Assisted Forecasting and Capacity Planning in Multi-Tenant Environments
DOI:
https://doi.org/10.63282/3050-9262.IJAIDSML-V7I1P120Keywords:
Artificial Intelligence, Capacity Planning, Infrastructure as Code, Site Reliability Engineering, Machine Learning, Reliability EngineeringAbstract
Capacity planning and infrastructure automation are still some of the hardest problems in modern distributed systems, with a direct impact on overall costs. However, they are often underestimated, especially as hardware keeps improving and each CPU core becomes faster over time. As infrastructure scales in size and complexity, historical statistics and operational models based on historical data fail to provide reliable forecasting metrics and cost projections. Building on prior research that examines operational ownership, infrastructure reliability, and deployment correctness, this paper investigates the application of Artificial Intelligence (AI) to capacity planning and Infrastructure as Code (IaC). We demonstrate how learning-based optimization, predictive analysis and automation can improve planning accuracy, reduce operational toil, and help achieving reliability and efficiency objectives. The paper positions AI not as a replacement for capacity management systems but presents a force multiplier to ensure the accuracy of such pieces of software is high ensuring forecasts and reality do not differ by a large amount.
References
[1] Google SRE Team, Site Reliability Engineering, O’Reilly Media, 2016.
[2] B. Burns et al., Designing Distributed Systems, O’Reilly Media, 2018.
[3] D. G. Feitelson, Workload Modeling for Computer Systems Performance Evaluation, Cambridge University Press, 2019.
[4] M. Isard et al., “Autopilot: Automatic Data Center Management,” OSDI, 2017.
[5] A. Verma et al., “Large-scale Cluster Management at Google with Borg,” OSDI, 2015.
[6] M. Schwarzkopf et al., “Omega: Flexible, Scalable Scheduler for Large Clusters,” SOSP, 2013.
[7] NVIDIA, “GPU Operator for Kubernetes,” Technical Documentation, 2023.
[8] A. Gupta et al., “etcd: A Distributed, Reliable Key-Value Store for the Most Critical Data of a Distributed System,” Cloud Native Computing Foundation, 2016.
[9] C. McClurg et al., “Rules Engines for Cloud Automation: Ensuring Policy Compliance in AI-Driven Systems,” Journal of Cloud Computing, vol. 12, 2022.
[10] H. Lim et al., “Enhancing GPU Utilization in Multi-Tenant Kubernetes Clusters,” SC Conference, 2021.
[11] P. Bodik et al., “Resource Management in Multi-Tenant Cloud Clusters: Lessons from Hyperscale Deployments,” SoCC, 2012.
[12] F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[13] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016.
[14] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[15] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd Edition, MIT Press, 2018.
[16] C. Re et al., “Machine Learning for Systems: Review and Opportunities,” HotCloud, 2018.
[17] J. Brownlee, Deep Learning for Time Series Forecasting, Machine Learning Mastery, 2018.
[18] T. Chen et al., “XGBoost: A Scalable Tree Boosting System,” KDD, 2016.
[19] P. Domingos, “A Few Useful Things to Know about Machine Learning,” Communications of the ACM, vol. 55, no. 10, pp. 78–87, 2012.
[20] L. Kleinrock, Queueing Systems, Volume 2: Computer Applications, Wiley, 1976.










