The Collapse of Predictability in Large-Pool Redundant Storage

Mallikarjun Vppalapati; Phani Kumar Talasila

doi:10.63282/3050-9262.IJAIDSML-V5I2P125

Authors

Mallikarjun Vppalapati Sr Cloud Systems Engineer at INFOR (US), LLC, USA. Author
Phani Kumar Talasila Storage engineer III at romedica health systems, USA. Author

DOI:

https://doi.org/10.63282/3050-9262.IJAIDSML-V5I2P125

Keywords:

Large-Scale Storage, Redundancy, Failure Domains, Distributed Systems, Predictability Collapse, Hyperscale Infrastructure, Reliability Engineering, Storage Resilience

Abstract

Hyperscale data centers have grown so fast that storage architecture has drastically changed from being isolated, well-defined systems to those extremely large, interconnected pools composed of thousands of drives, controllers, and failure domains. Redundancy mechanisms including replication and erasure coding have been the main ways to ensure reliability, but their example at such a scale is not clear anymore. This paper is about the unpredictability of large-pool redundant storage systems which grow so much that the traditional assumptions about independent failures and linear scaling no longer hold. With the expansion of storage infrastructures, there is a growing influence of correlated failures, repair traffic amplification, rebuild contention, and thermal or power-domain coupling on the overall dynamics, thus eliminating the possibility of using deterministic methods for capacity planning and availability forecasting. The main point which this paper tries to illustrate is that when the scale and redundancy density go beyond a certain limit, it is no longer possible to sum up the individual components of the system's reliability profile, but it becomes a property of the whole system, that is to say, increasing redundancy does not correspond to an increase in resilience and, it is even possible that systemic risk is heightened. In order to confirm this discovery, the authors rely on analytical reliability modeling in conjunction with large-scale simulation and patterns of real-world operational telemetry to measure failure propagation, rebuild times, and resource contention under various redundancy configurations. Some effects that contribute to failure clustering and repair amplification have been studied through the help of mean-time-to-data-loss predictions, while background recovery traffic is responsible for fluctuations in performance stability and these effects only serve to worsen one another under stress.

References

[1] Wahid, Abdul, John G. Breslin, and Muhammad Ali Intizar. "Prediction of machine failure in industry 4.0: a hybrid CNN-LSTM framework." Applied Sciences 12.9 (2022): 4221.

2. Cao, Wei, et al. "PolarFS: An Ultra-low Latency and Failure Resilient Distributed File System for Shared Storage Cloud Database." Proceedings of the VLDB Endowment, vol. 11, no. 12, 2018, pp. 1849-1862.

[2] Suryadevara, Siva Sai Krishna, and Anjani Kumar Polinati. “Cross-Cloud Governance Engine Using Policy-As-Code for CMS Platforms”. International Journal of Emerging Research in Engineering and Technology, vol. 3, no. 4, Dec. 2022, pp. 165-7

[3] Kumar, Akshay, et al. "On the Latency and Energy Efficiency of Distributed Storage Systems." IEEE Transactions on Cloud Computing, vol. 5, no. 5, 2017, pp. 221-233.

[4] Katangoori, Sivadeep, and Anudeep Katangoori. "Intelligent ETL Orchestration With Reinforcement Learning and Bayesian Optimization." American Journal of Data Science and Artificial Intelligence Innovations 3 (2023): 458-488.

[5] Meza, Justin, et al. "A Large Scale Study of Data Center Network Reliability." Proceedings of the Internet Measurement Conference 2018, 2018.

[6] Parakala, Adityamallikarjunkumar. "Vendor Highlights–IoT, AI, and Process Mining." International Journal of Emerging Trends in Computer Science and Information Technology 4.4 (2023): 135-146.

[7] Pahlow, Markus, Corinna Möhrlen, and J. Jørgensen. "Application of cost functions for large-scale integration of wind power using a multi-scheme ensemble prediction technique." Optimization Advances in Electric Power Systems", Ed. Edgardo D. Castronuovo, NOVA Publisher NY (2008).

[8] Muppaneni, Rajarshi Krishna. “Low-Code Revolution: How Power Platform Extends Dynamics 365 Capabilities”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 4, no. 3, Sept. 2023, pp. 162-71

[9] Ramabhadran, Sriram, and Joseph Pasquale. "Analysis of Durability in Replicated Distributed Storage Systems." 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2010.

[10] Reddy, Parameshwar, and Daddi Roopanjali. "A Survey On Different Large Scale Reliable Distributed Storage Systems." International Journal of Engineering Research & Technology, vol. 2, no. 4, 2013.

[11] Sais, Manar, et al. "Distributed Storage Optimization Using Multi-Agent Systems in Hadoop." E3S Web of Conferences, vol. 412, 01091, 2023.

[12] Shiramalla, Rupesh. "Predictive Record Assignment Engine in Salesforce using LWC and Einstein AI." International Journal of AI, BigData, Computational and Management Studies 3.3 (2022): 147-159.

[13] Gaddam, Rohit Reddy, and Kalyan Krishna. “KFP V2 Artifact-Centric ML Pipeline Governance”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 4, no. 2, June 2023, pp. 142-53

[14] Madhyastha, Harsha V., et al. "iPlane: An information plane for distributed services." Proceedings of the 7th symposium on Operating systems design and implementation. 2006.

[15] Muppaneni , Kavya. “Virtual DOM Vs Real DOM: Performance Benchmarks”. International Journal of AI, BigData, Computational and Management Studies, vol. 4, no. 4, Dec. 2023, pp. 180-9.

[16] Datla, Lalith Sriram, and Rishi Krishna Thodupunuri. "Designing for Defense: How We Embedded Security Principles into Cloud-Native Web Application Architectures." International Journal of Emerging Research in Engineering and Technology 2.4 (2021): 30-38.

[17] Parakala, Adityamallikarjunkumar. "Citizen-Facing Automation: Chatbots and Self-Service in Public Services." International Journal of AI, BigData, Computational and Management Studies 4.4 (2023): 108-118.

[18] Wu, Chenyuan. "Towards Learned Predictability of Storage Systems." arXiv preprint arXiv:2307.16288, 2023.

[19] Takkalapally, DevenderRao, and Mahender Rao Takkellapally. “AdaptCacheAI: Adaptive Hybrid Caching With Machine-Learned Eviction for Dynamic Cloud Workloads”. International Journal of Emerging Research in Engineering and Technology, vol. 4, no. 1, Mar. 2023, pp. 165-74

[20] Dean, Jeffrey, and Luiz André Barroso. "The Tail at Scale." Communications of the ACM, vol. 56, no. 2, 2013, pp. 74-80.

[21] Shiramalla, Rupesh. "Design of a Unified API Interface Using Workato for Cross-Platform Data Orchestration Between Salesforce and Oracle ERP." International Journal of Emerging Trends in Computer Science and Information Technology 3.1 (2022): 157-168.

[22] Gaddam, Rohit Reddy. “Advanced Data & Model Drift Detection at Scale”. International Journal of AI, BigData, Computational and Management Studies, vol. 3, no. 2, June 2022, pp. 124-36

[23] Huang, Cheng, et al. "Erasure Coding in Windows Azure Storage." 2012 USENIX Annual Technical Conference (USENIX ATC '12), 2012, pp. 15-26.

[24] Kumar Doodala, Appala Nooka, et al. “Post- Pandemic QA Evolution in Healthcare IT”. International Journal of Emerging Trends in Computer Science and Information Technology, vol. 4, no. 2, June 2023, pp. 223-32

[25] Shvachko, Konstantin, et al. "The Hadoop Distributed File System." 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 2010, pp. 1-10.

[26] Verbitski, Alexandre, et al. "Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases." Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD '17), 2017, pp. 1041-1052.

The Collapse of Predictability in Large-Pool Redundant Storage

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

call for paper

Make a Submission

Cover Image

CURRENT INDEX

TOOLS

Latest publications

Information