
Last month AI security researchers from Robust Intelligence (now a part of Cisco) and the University of Pennsylvania subjected DeepSeek’s wildly popular R1 model to algorithmic jailbreaking techniques to test its safety mechanisms using 50 randomly selected prompts from the HarmBench dataset. (You can learn more about this dataset as well as other safety benchmarks included in the HELM Safety leaderboard using this filtered view of my AI Strategy app. The safety benchmarks are indicated with a 🚩.)

R1’s 0/50 Score
DeepSeek R1 exhibited a 100% attack success rate, meaning that of the 50 dangerous prompts it fed R1, the model failed to block a single harmful prompt. This contrasts starkly with other leading models, which demonstrated at least partial resistance.
Their team’s conclusion was that DeepSeek’s claimed cost-efficient training methods—including reinforcement learning, chain-of-thought self-evaluation, and distillation—may have compromised its safety mechanisms. This lack of robust guardrails makes it highly susceptible to algorithmic jailbreaking and potential misuse by dangerous players.
Breaking Down the HarmBench Evaluation
HarmBench classifies its dataset into seven semantic categories of behavior (source):
- Cybercrime & unauthorized intrusion
- Chemical & biological weapons/drugs
- Copyright violations
- Misinformation & disinformation
- Harassment & bullying
- Illegal activities
- General harm
Also, unlike many benchmarks where larger models often see better performance scores, the HarmBench team found no correlation between robustness of safety protocols and model size within model families when red teaming at scale. However, they did observe a substantial difference in robustness among model families, which suggests that procedures and data used during training are far more important than model size in determining robustness to jailbreaks.
DeepSeek’s Performance in the HELM Safety Leaderboard
What I thought was interesting was that, when I checked Stanford’s HELM Safety leaderboard for DeepSeek’s performance for the HarmBench benchmark, although it wasn’t the best score—coming in at a 65% compared to Claud 3.5 Sonnet’s 98%—it certainly wasn’t a 0.

So the first thing I do when I’m suspicious of a model’s score is start digging into leaderboards I know disclose when models self-report benchmark results (a practice I’ve flagged here and here) and one that identifies models it suspects of training their models on its publicly available validation dataset.
I’ve also reached out to a member of the HELM team to ask if they allow self-reporting by models. If they reply, I’ll include their response in this post.
Strike 1: LiveCodeBench Flag
I use ‘strike 1’ somewhat tongue in cheek. LiveCodeBench is clear to qualify that they highlight models they suspect of model contamination based on the date they submitted their model vis-a-vis the date their dataset was published as well as performance spikes their team deems suspicious. When they suspect model contamination they don’t assign a rank to that model. All three of DeepSeek’s models were flagged.
Obviously, we’re talking safety here, not code generation, but I use this leaderboard as more of a proxy to see if the model provider has a habit of potentially training its model on a benchmark’s publicly available validation dataset (or at minimum demonstrating behaviors that cause this suspicion in the LCB team).
Strike 2: MMMU Flag
Although I’m not wild about including benchmark results for self-reporting models, I do appreciate that the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) leaderboard at least discloses when models self-report their results. All three DeepSeek models self-reported their scores, as indicated by the asterisks next to their scores.
Although these indicators are in no way smoking guns that R1’s HarmBench score in the HELM leaderboard may misrepresent the models’ ability to ward off dangerous jailbreaking attempts and prompts, it is a possibility that the independent test results by the Cisco/UPenn team may be more reliable.
Adding Guardrails to Models Weak in Safety
Caution is warranted in relying on DeepSeek’s R1 models for app development, especially if you’re building an app in a highly regulated industry. At minimum, steps should be taken to add guardrails to prevent jailbreak attempts, mitigate harmful outputs, and ensure compliance with legal and ethical guidelines.
To safeguard your app against jailbreak and dangerous prompts, consider the following steps:
- Input filters: Implement strict rule-based filtering or AI-powered classifiers to flag risky queries before they reach the model.
- Content moderation: Apply a secondary filtering layer to analyze model outputs, ensuring they do not contain sensitive, biased, or harmful content before being displayed to users.
- User monitoring: Analyze behavioral patterns at the user level to detect suspicious activity, such as repeated attempts to jailbreak the model, and block those users and/or report illegal behavior.
- Fine-tuning: Instead of using the base model directly, fine-tune it on safer, domain-specific datasets—and/or use retrieval augmented generation (RAG) to provide vetted responses sourced from more reliable data.
- Governance: Align model outputs with regulatory requirements, such as GDPR, HIPAA, FERPA, or financial compliance laws, by ensuring that private or sensitive data is never generated or leaked.
- Adversarial testing: Continuously test the model with adversarial attacks to identify vulnerabilities and improve defenses against exploitation.
- Human-in-the-Loop review: In particularly high-risk scenarios, consider adding a human moderation step before delivering responses to topics that are deemed high risk.
If you are absolute resolute in using DeepSeek’s R1 model, you can at least proactively employ safeguards like these to mitigate some of these safety risks associated and ensure that your appls remain compliant, secure, and trustworthy.
Leave a Reply