Hybrid Accelerator Selection for Generative AI Workloads: A Cost-Effective Approach Based on Model Type and Pipeline Stage

Rajalakshmi Srinivasaraghavan

doi:10.63282/3050-9262.IJAIDSML-V7I2P101

Authors

Rajalakshmi Srinivasaraghavan Independent Researcher, USA. Author

DOI:

https://doi.org/10.63282/3050-9262.IJAIDSML-V7I2P101

Keywords:

AI, Performance, CPU, Machine Learning, Deep Learning, LLM

Abstract

With the increasing adoption of machine learning, AI, and large language models in production environments, the requirement for computational acceleration continues to intensify. However, defaulting to expensive off chip accelerators, such as GPUs and TPUs for all workloads leads to unnecessary cost inefficiencies. This paper presents a framework for hybrid accelerator selection that considers model characteristics, pipeline stages, and workload requirements. We demonstrate that strategic use of on-chip acceleration (CPU-based SIMD) for smaller models and latency-insensitive workloads, combined with selective deployment of off-chip accelerators for compute-intensive tasks, can significantly reduce infrastructure costs while maintaining performance requirements.

References

[1] Lewis, P., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." Advances in Neural Information Processing Systems, 33, 9459-9474.

[2] Intel Corporation. (2023). "Intel Advanced Vector Extensions 512 (Intel AVX-512) Overview." Intel Developer Zone.

[3] Intel Corporation. (2022). "Intel Advanced Matrix Extensions (Intel AMX) Overview and Programming Guide." Intel Architecture Instruction Set Extensions Programming Reference.

[4] ARM Limited. (2023). "ARM Scalable Vector Extension: A Vector Length Agnostic Architecture." ARM Architecture Reference Manual Supplement.

[5] NVIDIA Corporation. (2023). "NVIDIA H100 Tensor Core GPU Architecture." NVIDIA Technical Whitepaper.

[6] Google Cloud. (2023). "Cloud TPU System Architecture." Google Cloud Documentation.

[7] Intel Corporation. (2023). "Intel oneAPI Deep Neural Network Library (oneDNN) Performance Benchmarks." Intel oneAPI Documentation.

[8] IBM Spyre accelerator. https://www.ibm.com/docs/en/ibm-spyre-for-power?topic=spyre-accelerator-power

[9] IBM Matrix Multiply Accelerator: https://www.ibm.com/support/pages/introduction-mma-matrix-math-accelerator-component-power10-systems

Hybrid Accelerator Selection for Generative AI Workloads: A Cost-Effective Approach Based on Model Type and Pipeline Stage

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

call for paper

Make a Submission

Cover Image

CURRENT INDEX

TOOLS

Latest publications

Information