Unobservable Performance: Storage Failures That Leave No Metrics Behind
DOI:
https://doi.org/10.63282/3050-9262.IJAIDSML-V4I4P120Keywords:
Unobservable Failures, Storage Systems, Performance Degradation, Observability Gaps, Silent Failures, System Monitoring, Reliability Engineering, Anomaly DetectionAbstract
Modern computer systems increasingly rely on intricate storage stack layers, and the performance of these layers can fail in ways that are not always easy to tell from error reports or measurable faults. A new category of silent or unobservable storage performance failures even trend to the point of causing the degradation of application throughput, latency, and reliability but without conventional alerts, counters, or health metrics being triggered. The root of these failures is the subtle coordination between the firmware, the controllers, caching layers, and the operating system which leave the system administrators little or no visibility into the real reason for the performance anomalies. Traditional monitoring frameworks largely rely on explicit error reporting, averaged latency metrics, or device-level statistics, which most of the time cannot pick up on transient stalls, internal throttling, or firmware-induced behaviours that are below the interface of the device and not directly observable. This paper tackles the problem of how to detect and characterize invisible performance failures which it does by offering a method that integrates workload-aware probing, cross-layer correlation, and anomaly-driven inference to bring to light hidden storage inefficiencies. Instead of just relying on the device-reported metrics, the method makes use of the end-to-end application behaviour, temporal patterns, and indirect performance signatures to deduce the storage pathologies. The real-life example in the paper shows a production system that was severely degraded in terms of its performance while its health indicators showed that it was in normal state. Later, it was confirmed that the problem was a silent internal storage bottleneck through the diagnostic procedure that was carried out by the proposed technique. Their studies expose the fact that system performance can deteriorate significantly without being noticed compromise because that the loss is undetected for a very long period which thus leads to misdiagnosis, poor scaling decisions, and reduced system reliability.
References
[1] Zhang, Duo, and Mai Zheng. "Benchmarking for observability: The case of diagnosing storage failures." BenchCouncil Transactions on Benchmarks, Standards and Evaluations 1.1 (2021): 100006.
[2] Chen, Peter M., and David A. Patterson. "Storage performance-metrics and benchmarks." Proceedings of the IEEE 81.8 (2002): 1151-1165.
[3] Suryadevara, Siva Sai Krishna, and Kareem Shaik. “Real-Time Anomaly Detection and Attack Mitigation for Cloud-Based Content Delivery Paths Using AI”. International Journal of Emerging Research in Engineering and Technology, vol. 4, no. 1, Mar. 2023, pp. 175-8.
[4] Jin, Chao. A sequential process monitoring approach using hidden Markov model for unobservable process drift. MS thesis. University of Cincinnati, 2015.
[5] Gaddam, Rohit Reddy. “Vertex AI As a Unified Control Plane for MLOps”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 2, no. 2, June 2021, pp. 92-102
[6] Jauk, David, Dai Yang, and Martin Schulz. "Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice." Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2019.
[7] Katangoori, Sivadeep, and Sushil Deore. "Predictive Drift Detection and Adaptive Reconciliation in Multi-Cloud Data Environments." The Distributed Learning and Broad Applications in Scientific Research 8 (2022): 247-274.
[8] Gunawi, Haryadi S., et al. "Fail-slow at scale: Evidence of hardware performance faults in large production systems." ACM Transactions on Storage (TOS) 14.3 (2018): 1-26.
[9] Muppaneni, Kavya. “Optimizing React Hooks for Efficient State and Side-Effect Management”. American International Journal of Computer Science and Technology, vol. 4, no. 6, Nov. 2022, pp. 44-55.
[10] Klein, John, et al. "Model-driven observability for big data storage." 2016 13th Working IEEE/IFIP Conference on Software Architecture (WICSA). IEEE, 2016.
[11] Parakala, Adityamallikarjunkumar. "Integrating Salesforce and UiPath: Cross-System Intelligent Automation." International Journal of Emerging Trends in Computer Science and Information Technology 3.4 (2022): 88-99.
[12] Barradas, Diogo, Nuno Santos, and Luís Rodrigues. "Deltashaper: Enabling unobservable censorship-resistant tcp tunneling over videoconferencing streams." Proceedings on Privacy Enhancing Technologies (2017).
[13] Muppaneni, Rajarshi Krishna. “From Legacy ERP to Cloud-First: A Transformation Story With Dynamics 365”. International Journal of Emerging Research in Engineering and Technology, vol. 3, no. 4, Dec. 2022, pp. 153-64.
[14] Angel, Sebastian, and Srinath Setty. "Unobservable communication over fully untrusted infrastructure." 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 2016.
[15] Kirmani, Amna, and Akshay R. Rao. "No pain, no gain: A critical review of the literature on signaling unobservable product quality." Journal of marketing 64.2 (2000): 66-79.
[16] Kumar Doodala, Appala Nooka. “Offline-First Android Architecture for Waste Management in Low Connectivity Zones”. International Journal of Emerging Trends in Computer Science and Information Technology, vol. 4, no. 1, Mar. 2023, pp. 201-9.
[17] Houmansadr, Amir, Chad Brubaker, and Vitaly Shmatikov. "The parrot is dead: Observing unobservable network communications." 2013 IEEE Symposium on Security and Privacy. IEEE, 2013.
[18] Parakala, Adityamallikarjunkumar, and Jyothirmay Swain. "AI‑Powered Intelligent Automation Emerges." International Journal of Artificial Intelligence, Data Science, and Machine Learning 3.4 (2022): 96-106.
[19] Firman, Michael, et al. "Structured prediction of unobserved voxels from a single depth image." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
[20] Takkalapally, DevenderRao, and Mahender Rao Takkellapally. “AdaptCacheAI: Adaptive Hybrid Caching With Machine-Learned Eviction for Dynamic Cloud Workloads”. International Journal of Emerging Research in Engineering and Technology, vol. 4, no. 1, Mar. 2023, pp. 165-74
[21] Papantoniou, Panagiotis, Eleni I. Vlahogianni, and George Yannis. "Are driving errors and driving performance correlated? A dual structural equation model." Advances in transportation studies 53 (2021): 37-50.
[22] Gaddam, Rohit Reddy. “Cost-Aware Autoscaling for Batch Vs. Online Inference”. International Journal of Emerging Trends in Computer Science and Information Technology, vol. 3, no. 4, Dec. 2022, pp. 134-43
[23] Duellmann, Dirk, and Alfonso Portabales. "Disk failures in the EOS setup at CERN-A first systematic look at 1 year of collected data." EPJ Web of Conferences. Vol. 214. EDP Sciences, 2019.
[24] Yoon, Gunwoo, et al. "Attracting comments: Digital engagement metrics on Facebook and financial performance." Journal of Advertising 47.1 (2018): 24-37.
[25] Tuncer, Ozan, et al. "Diagnosing performance variations in HPC applications using machine learning." International Conference on High Performance Computing. Cham: Springer International Publishing, 2017.










