Model developed in Ning Zhang lab recognized by Mozilla

The model, PIGuard, outperformed other models

Beth Miller  
Ning Zhang

Generative artificial intelligence tools, such as large language models like ChatGPT, are used nearly every day, yet they aren’t completely secure. Prompt injection attacks, where an attacker uses deceptive text to manipulate the outputs, are a risk that can change the model’s goals or cause data leaks.

Mozilla AI recently called out the PIGuard model developed in the lab of Ning Zhang, associate professor of computer science & engineering in the McKelvey School of Engineering at Washington University in St. Louis, Chaowei Xiao at Johns Hopkins University, and collaborators for taking the top spot among all models they tested in a large experiment looking at open source guard rails and agentic systems.

PIGuard was published in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), July 27-Aug. 1, 2025.

Zhang and collaborators recently received an email from John Dickerson, CEO of Mozilla AI, sharing the news.

“Our experiments found that PIGuard was an effective indirect prompt injection detection model against the email and tabular portions of the BIPIA dataset,” Dickerson wrote. “Moreover, we found that detecting function call malfunctions is still a hard task for customizable judges to tackle, a gap in the field that we hope closes in the coming months.” 

Zhang’s team, which included Hao Li, a doctoral student in his lab, found that while there are defenses in place, like prompt guard models, these can be too aggressive and mistakenly identify harmless inputs as threats due to certain trigger words. To tackle this problem, researchers created NotInject, a dataset designed to test how often these guard models incorrectly flag safe inputs. This dataset has 339 examples that include common trigger words used in attacks, allowing for detailed testing. 

Their study showed that current models often make mistakes, performing barely better than random guessing. To improve this, they developed PIGuard, which uses a new training method called Mitigating Over-defense for Free (MOF) to reduce bias against trigger words.

“Beyond demonstrating strong performance and robustness, we are really excited to release PIGuard as one of the first fully open-source safeguards against prompt injection attacks, complete with training data, code, and models,” Zhang said. “Our hope is that this work provides a solid foundation and helps cultivate an open, collaborative ecosystem for advancing AGI safety.” 


Mozilla’s code for its experiments can be found here: https://github.com/mozilla-ai/bir_guardrail_experiments

Click on the topics below for more stories in those areas

Back to News