Security and Adversarial Attacks on Natural Language Processing Models

Shiv Shankar Dwivedi

doi:10.22178/acta.27.2.05

Authors

Shiv Shankar Dwivedi

DOI:

https://doi.org/10.22178/acta.27.2.05

Keywords:

Natural Language Processing, Adversarial Attacks, Model Security, Robustness, Jailbreaking, Defense Mechanisms, Machine Learning Security

Abstract

Natural Language Processing (NLP) models have achieved remarkable performance across diverse applications, from sentiment analysis to machine translation. However, these sophisticated systems exhibit significant vulnerabilities to adversarial attacks that can manipulate their behavior through carefully crafted input perturbations. This comprehensive study examines the landscape of adversarial attacks on NLP models, analyzing attack methodologies, defense mechanisms, and security implications. Through systematic analysis of current research, we identify three primary attack vectors: character-level manipulations, word-level substitutions, and semantic-level transformations. The survey also highlights the fragility of advanced deep neural networks in NLP and the challenges involved in defending them. Our analysis reveals that adversarial jailbreaks, which coax LLMs into overriding their safety guardrails pose significant risks to model deployment. We propose a taxonomy of defense strategies including adversarial training, perturbation control, and certification-based approaches. The findings indicate that while robust defense mechanisms exist, the evolving nature of adversarial attacks necessitates continuous research into more adaptive and comprehensive security measures for NLP systems.