High-Throughput Data Structures for GPU-Accelerated Computing

Authors

  • Karthik Reddy Cloud Solutions Architect, Oracle, UAE Author

DOI:

https://doi.org/10.63282/3050-9262.IJAIDSML-V3I1P102

Keywords:

GPU Computing, High-Throughput Data Structures, Parallel Computing, Memory Access Patterns, Sparse Matrix, Tree Structures, Hash Tables, CUDA, OpenCL

Abstract

The increasing demand for high-performance computing in fields like scientific simulations, machine learning, and data analysis has driven the adoption of Graphics Processing Units (GPUs) as accelerators. GPUs offer massive parallelism, but effectively harnessing their power requires careful consideration of data structures. Traditional CPUcentric data structures often become bottlenecks when deployed in GPU environments due to memory access patterns and synchronization overhead. This paper explores the landscape of high-throughput data structures specifically designed for GPU-accelerated computing. We discuss key considerations for GPU data structure design, including memory layout, access patterns, concurrency management, and data transfer strategies. We then delve into specific data structures optimized for GPU execution, such as array of structures vs. structure of arrays, sparse matrix formats, tree-based structures (e.g., B-trees), and hash tables. We analyze their performance characteristics, trade-offs, and suitability for different application domains. Finally, we present case studies demonstrating the effectiveness of these data structures in real-world GPU-accelerated applications and discuss future research directions in this critical area

References

[1] Ament, M., Ziegler, G., & Dachsbacher, C. (2012). Implementing efficient parallel data structures on GPUs. In M. Pharr & R. Fernando (Eds.), GPU Gems 2 (pp. 521–545). Addison-Wesley Professional.

[2] Chen, Z., Xu, J., Tang, J., Kwiat, K., & Kamhoua, C. (2015). G-Storm: GPU-enabled high-throughput online data processing in Storm. 2015 IEEE International Conference on Big Data (Big Data), 307–312. https://doi.org/10.1109/BigData.2015.7363771

[3] Fan, B., Andersen, D. G., & Kaminsky, M. (2013). MemC3: Compact and concurrent memcache with dumber caching and smarter hashing. 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), 371–384.

[4] Garland, M., & Kirk, D. B. (2010). Programming massively parallel processors: A hands-on approach. Morgan Kaufmann.

[5] Herlihy, M., & Shavit, N. (2012). The art of multiprocessor programming. Morgan Kaufmann.

[6] Jünger, D., Kobus, R., Müller, A., Hundt, C., Xu, K., Liu, W., & Schmidt, B. (2020). WarpCore: A library for fast hash tables on GPUs. arXiv preprint arXiv:2009.07914. https://doi.org/10.48550/arXiv.2009.07914

[7] Kipfer, P., & Westermann, R. (2005). Improved GPU sorting. GPU Gems 2, 733–746.

[8] Maier, D., & Torp, K. (2019). Concurrent growing of hash tables: A GPU perspective. Proceedings of the VLDB Endowment, 12(11), 1593–1605. https://doi.org/10.14778/3342263.3342631

[9] NVIDIA Corporation. (2020). CUDA C++ programming guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

[10] Qureshi, Z., Mailthody, V. S., Gelado, I., Min, S. W., Masood, A., Park, J., Xiong, J., Newburn, C. J., Vainbrand, D., Chung, I.-H., Garland, M., Dally, W., & Hwu, W.-m. (2022). GPU-initiated on-demand high-throughput storage access in the BaM system architecture. arXiv preprint arXiv:2203.04910. https://doi.org/10.48550/arXiv.2203.04910

[11] Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., & Amarasinghe, S. (2013). Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices, 48(6), 519–530. https://doi.org/10.1145/2491956.2462176

[12] Reinders, J. (2007). Intel threading building blocks: Outfitting C++ for multi-core processor parallelism. O'Reilly Media.

[13] Sengupta, S., Harris, M., Zhang, Y., & Owens, J. D. (2007). Scan primitives for GPU computing. Graphics Hardware, 97–106.

[14] Shavit, N. (2011). Data structures in the multicore age. Communications of the ACM, 54(3), 76–84. https://doi.org/10.1145/1897852.1897873

[15] Sutter, H. (2005). The free lunch is over: A fundamental turn toward concurrency in software. Dr. Dobb's Journal, 30(3), 202–210.

[16] Tzeng, S., & Wei, L.-Y. (2008). Parallel white noise generation on a GPU via cryptographic hash. Proceedings of the 2008 Symposium on Interactive 3D Graphics and Games, 79–87. https://doi.org/10.1145/1342250.1342264

[17] Zhang, K., Wang, K., Yuan, Y., Guo, L., Lee, R., & Zhang, X. (2015). Mega-KV: A case for GPUs to maximize the throughput of in-memory key-value stores. Proceedings of the VLDB Endowment, 8(11), 1226–1237. https://doi.org/10.14778/2809974.2809981

[18] Ziegler, G., Theußl, T., & Purgathofer, W. (2002). Fast rendering of complex scenes by hierarchical rasterization. Proceedings of the Vision Modeling and Visualization Conference 2002, 251–258.

Published

2022-02-15

Issue

Section

Articles

How to Cite

1.
Reddy K. High-Throughput Data Structures for GPU-Accelerated Computing. IJAIDSML [Internet]. 2022 Feb. 15 [cited 2025 Sep. 15];3(1):10-7. Available from: https://ijaidsml.org/index.php/ijaidsml/article/view/32