Architectures for Low-Latency AI Inference on Streaming Enterprise Data

Stewyn Chaudhary

doi:10.63282/3050-9262.IJAIDSML-V7I2P116

Authors

Stewyn Chaudhary Independent Researcher, USA. Author

DOI:

https://doi.org/10.63282/3050-9262.IJAIDSML-V7I2P116

Keywords:

Low-Latency Inference, Streaming Data, Model Optimization, Post-Training Quantization, Knowledge Distillation, Structured Pruning, Enterprise AI, Apache Flink, Apache Kafka, Real-Time Machine Learning, Edge Inference, Failure Modes, Operational Risk

Abstract

Enterprise data streams now demand AI inference systems that deliver low-latency predictions without sacrificing model accuracy. This paper presents a literature-based comparative study of four architectural patterns for machine-learning inference on high-throughput enterprise streams: in-stream inference (ISI), microservice-based pipelines (MIP), co-located model serving (CMS), and edge-offloaded inference (EOI). Each pattern is evaluated on end-to-end latency, throughput, resource utilization, and operational complexity across three representative workloads: financial fraud detection, IoT telemetry anomaly detection, and real-time recommendation. Particular attention is given to three model-optimization techniques post-training quantization, structured pruning, and knowledge distillation as the primary instruments for latency reduction. Synthesizing published benchmarks, the analysis shows that in-stream inference combined with distilled, INT8-quantized models can achieve materially lower end-to-end latency than REST-based microservice baselines, with accuracy losses typically of 1–2 percentage points. A comparative analysis of failure modes across patterns is presented, together with a pattern-selection decision matrix and design guidelines for practitioners building production AI inference systems on Apache Kafka, Apache Flink, and equivalent cloud-native streaming platforms.

References

[1] M. Armbrust et al., “Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark,” in Proc. ACM SIGMOD Int. Conf. Manage. Data (SIGMOD), Houston, TX, USA, 2018, pp. 601–613.

[2] T. Akidau et al., “The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing,” Proc. VLDB Endow., vol. 8, no. 12, pp. 1792–1803, Aug. 2015.

[3] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang, “A Survey of FPGA-Based Neural Network Inference Accelerators,” ACM Trans. Reconfigurable Technol. Syst., vol. 12, no. 1, pp. 1–26, Mar. 2019.

[4] N. Narkhede, G. Shapira, and T. Palino, Kafka: The Definitive Guide, 2nd ed. Sebastopol, CA, USA: O’Reilly Media, 2021.

[5] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas, “Apache Flink: Stream and Batch Processing in a Single Engine,” Bull. IEEE Comput. Soc. Tech. Comm. Data Eng., vol. 36, no. 4, pp. 28–38, Dec. 2015.

[6] T. Akidau et al., “MillWheel: Fault-Tolerant Stream Processing at Internet Scale,” Proc. VLDB Endow., vol. 6, no. 11, pp. 1033–1044, Aug. 2013.

[7] D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica, “Clipper: A Low-Latency Online Prediction Serving System,” in Proc. 14th USENIX Symp. Networked Syst. Des. Implement. (NSDI), Boston, MA, USA, 2017, pp. 613–627.

[8] NVIDIA Corporation, “Triton Inference Server,” NVIDIA Developer Documentation. [Online]. Available: https://developer.nvidia.com/triton-inference-server

[9] D. Baylor et al., “TFX: A TensorFlow-Based Production-Scale Machine Learning Platform,” in Proc. 23rd ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining (KDD), Halifax, NS, Canada, 2017, pp. 1387–1395.

[10] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A Survey of Quantization Methods for Efficient Neural Network Inference,” in Low-Power Computer Vision, Boca Raton, FL, USA: Chapman & Hall/CRC, 2022, pp. 291–326.

[11] B. Jacob et al., “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Salt Lake City, UT, USA, 2018, pp. 2704–2713.

[12] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning Filters for Efficient ConvNets,” in Proc. Int. Conf. Learn. Represent. (ICLR), Toulon, France, Apr. 2017.

[13] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” arXiv:1503.02531, Mar. 2015.

[14] W. Park, D. Kim, Y. Lu, and M. Cho, “Relational Knowledge Distillation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Long Beach, CA, USA, 2019, pp. 3967–3976.

[15] L. Hou, Z. Huang, L. Shang, X. Jiang, X. Chen, and Q. Liu, “DynaBERT: Dynamic BERT with Adaptive Width and Depth,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, 2020, pp. 9782–9793.

[16] C. Olston et al., “TensorFlow-Serving: Flexible, High-Performance ML Serving,” arXiv:1712.06139, Dec. 2017.

[17] E. Jonas, Q. Pu, S. Venkataraman, I. Stoica, and B. Recht, “Occupy the Cloud: Distributed Computing for the 99%,” in Proc. ACM Symp. Cloud Comput. (SoCC), Santa Clara, CA, USA, 2017, pp. 445–451.

[18] A. Agache et al., “Firecracker: Lightweight Virtualization for Serverless Applications,” in Proc. 17th USENIX Symp. Networked Syst. Des. Implement. (NSDI), Santa Clara, CA, USA, 2020, pp. 419–434.

[19] Google LLC, Protocol Buffers Developer Guide. [Online]. Available: https://protobuf.dev

[20] Z. Bai, Z. Zhang, Y. Zhu, and X. Jin, “PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications,” in Proc. 14th USENIX Symp. Oper. Syst. Des. Implement. (OSDI), Nov. 2020, pp. 499–514.

[21] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Salt Lake City, UT, USA, 2018, pp. 4510–4520.

[22] X. Wei, R. Gong, Y. Li, X. Liu, and F. Yu, “QDrop: Randomly Dropping Quantization for Extremely Low-Bit Post-Training Quantization,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2022.

[23] E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, “Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned,” in Proc. 57th Annu. Meet. Assoc. Comput. Linguist. (ACL), Florence, Italy, 2019, pp. 5797–5808.

[24] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter,” arXiv:1910.01108, Oct. 2019.

[25] K. Lottick, S. Susai, S. A. Friedler, and J. P. Wilson, “Energy Usage Reports: Environmental Awareness as Part of Algorithmic Accountability,” in NeurIPS Workshop on Tackling Climate Change with ML, Vancouver, BC, Canada, Dec. 2019.

[26] T. Akidau, S. Chernyak, and R. Lax, Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing. Sebastopol, CA, USA: O’Reilly Media, 2018.

Architectures for Low-Latency AI Inference on Streaming Enterprise Data

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

call for paper

Make a Submission

Cover Image

CURRENT INDEX

TOOLS

Latest publications

Information