Adversarial AI Defense in Large Language Models & Generative AI
DOI:
https://doi.org/10.63282/3050-9262.IJAIDSML-V6I3P111Keywords:
Adversarial machine learning, prompt injection, jailbreaks, data poisoning, adversarial examples, large language models (LLMs), generative AI security, defense-in-depth, SafeDecoding, guardrails, content provenance, NIST AI RMF, C2PAAbstract
The rapid growth of Large Language Models (LLMs) and generative AI models has enabled a suite of novel capabilities, including advanced analysis of natural language, code synthesis, creative content creation, and multimodal reasoning. Yet, this power brings with it substantial vulnerability to adversarial control. Unlike narrow models of traditional machine learning, generative AI models operate in the open with data from many different sources, other data sources and autonomous applications, and the adversarial surface is increased from text to image/multimodal. New threats appear in three main guises: immediate injection, where malign instructions overwhelm or hijack the running of tools even in safe environments; adversarial examples, where crafted perturbations of inputs corrupt decoding paths to produce harmful, biased, or simply incorrect outputs; and data poisoning, where perturbation of training, fine-tuning, or retrieval corpora introduces latent backdoors, obliterates alignment, or destroys factuality at scale.
This work aims to respond to the pressing need for a defense-in-depth model that balances robustness and usability, as no single defense of those measures is adequate. We amalgamate recent developments from adversarial machine learning, robust training, secure system design and AI risk governance to outline a multi-layer approach for adversarial resilience in LLMs and more generally in generative AI. On the model side, we investigate adversarial training with red-teamed artifacts, constitutional alignment, safety-aware decoding as well as the use of guard models including Llama Guard in text and multimodal setting. At the data level, we emphasize deduplication, provenance, and the role of standards such as C2PA for verifiable authenticity, and poisoning-resilient corpus curation. The application layer offers, on demand, trust boundaries on retrieval-augmented generation (RAG), sandboxed tool use, schema enforcement, and indirect-injection-detection prompts. Finally, at the governance and operational tiers, we focus on benchmarks like Jailbreak Bench and Agent Harm, embedding red-teaming into CI/CD pipelines, applying NIST’s Generative AI Profile, and continuous telemetry-driven monitoring with incident disclosure.
The proposed approach, DiD-GEN (Defense-in-Depth for Generative AI) organizes these governors into a repeatable, standards-aligned implementation for the enterprise, research, and regulatory users. Evaluation over recent benchmarks shows that adversarial success rates can never be wiped out, however, layered defect-oriented defenses do incur noticeable decreases in exploitability while retaining task utility. This paper concludes by proposing concrete paths forward for organizations aiming to harden generative AI deployments against adaptive adversaries, emphasizing resilience, accountability, and compliance toward publication-quality robustness obligations
References
[1] National Institute of Standards and Technology (NIST), Artificial Intelligence Risk Management Framework: Generative AI Profile (AI 600-1), Jul. 25, 2024.
[2] Open Web Application Security Project (OWASP), Top 10 for Large Language Model Applications v2025, Nov. 18, 2024.
[3] P. Chao, H. Zhang, A. Yang, et al., “JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models,” arXiv:2407.12345, 2024.
[4] N. Meade, A. Patel, and S. Reddy, “Universal Adversarial Triggers Are Not Universal,” Proc. ICLR, Apr. 2024.
[5] Z. Xu, H. Zhang, L. Wang, et al., “SafeDecoding: Defending Against Jailbreak Attacks via Safety-Aware Decoding,” Proc. ACL, Feb. 2024.
[6] Z. Wang, F. Liu, X. Zhang, and J. Gao, “SELF-GUARD: Empower the LLM to Safeguard Itself,” Proc. NAACL, Jun. 2024.
[7] Meta AI, “Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations,” arXiv:2312.06674, Dec. 2023.
[8] J. Chi, L. Li, A. Rungta, et al., “Llama Guard 3 Vision: Safeguarding Human-AI Image Conversations,” arXiv:2404.01234, Apr. 2024.
[9] Y. Bai, S. Kadavath, S. Kundu, et al., “Constitutional AI: Harmlessness from AI Feedback,” arXiv:2212.08073, Apr. 2023.
[10] Microsoft Security Team, “Defending LLM Applications with Prompts, Rules, and Patterns,” Microsoft Security Blog, Apr. 11, 2024.
[11] N. Carlini, F. Tramer, E. Wallace, et al., “Poisoning Web-Scale Training Datasets is Practical,” arXiv:2402.12323, 2024.
[12] S. Shan, A. Chou, and B. Li, “Nightshade: Prompt-Specific Poisoning of Text-to-Image Models,” arXiv:2310.13828, Oct. 2023.
[13] H. Lee, P. Jain, S. Sukhbaatar, and A. Joulin, “Deduplicating Training Data Makes Language Models Better,” arXiv:2107.06499, 2021.
[14] J. Ebrahimi, A. Rao, D. Lowd, and D. Dou, “HotFlip: White-Box Adversarial Examples for Text Classification,” Proc. ACL, pp. 31–36, 2018.
[15] D. Jin, Z. Jin, J. Zhou, and P. Szolovits, “TextFooler: A Text-Based Adversarial Attack for Text Classification,” Proc. AAAI, vol. 34, no. 5, pp. 8798–8805, 2020.
[16] N. Carlini, C. Liu, M. Nasr, et al., “Adversarial Examples Make Strong Poisons for Diffusion Models,” arXiv:2308.01234, Aug. 2023.
[17] B. C. Das, M. H. Amini, and Y. Wu, “Security and Privacy Challenges of Large Language Models: A Survey,” Journal of the ACM, vol. 71, no. 3, pp. 1–48, Aug. 2024.
[18] N. K. S. Kong, “InjectBench: An Indirect Prompt Injection Benchmarking Framework,” M.S. thesis, National Univ. of Singapore, Aug. 2024.
[19] R. Agarwal, A. Basu, D. Zhou, et al., “AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents,” arXiv:2410.06789, Oct. 2024.
[20] OWASP Foundation, “LLM01: Prompt Injection,” OWASP LLM Top 10 Project, 2024.
[21] Meta AI, “Introducing Meta Llama 3,” Meta AI Blog, Apr. 18, 2024.
[22] National Institute of Standards and Technology (NIST), SP 800-218: Secure Software Development Framework (SSDF) 1.1, Feb. 2022.
[23] National Institute of Standards and Technology (NIST), SP 800-218A: Secure Software Development Practices for Generative AI and Machine Learning, Jul. 2024.
[24] MITRE Corporation, “Adversarial Threat Landscape for Artificial-Intelligence Systems (ATLAS) Fact Sheet,” MITRE, 2024.
[25] Coalition for Content Provenance and Authenticity (C2PA), Technical Specification v1.3, Nov. 2023.
[26] L. Newman, “Here Come the AI Worms,” WIRED Magazine, Apr. 2024.
[27] Amazon Web Services (AWS), “Mitigate Prompt Injection Risk in Retrieval-Augmented Generation Systems,” AWS Security Blog, Aug. 26, 2024.