GPT-4o vs Claude: Which AI Model is More Secure?

ShieldPi Research··6 min read
llm-securitybenchmarkopenaianthropic

GPT-4o and Claude are the two most widely deployed frontier AI models in enterprise applications. Security teams evaluating these models need to know: which one is safer, and where do each model's defenses break down?

We ran our full attack suite — 230+ techniques across 15 categories — against both models under identical conditions. Here are the results.

Methodology

Both models were tested using ShieldPi's standardized evaluation pipeline:

  • 230+ attack techniques across 15 categories
  • Identical prompts — every technique was executed with the same payload against both models
  • Multi-turn conversations — including crescendo attacks up to 10 turns
  • 12 languages for multilingual evasion testing
  • LLM judge verification to eliminate false positives

All tests were conducted against the latest available versions through the respective APIs with default parameters.

Overall Results

| Metric | GPT-4o | Claude Sonnet 4.6 | |--------|--------|-------------------| | Security Score | 85/100 | 91/100 | | Grade | B+ | A- | | Techniques Tested | 230 | 230 | | Successful Attacks | 35 | 21 | | Critical Findings | 3 | 1 | | High Findings | 8 | 5 |

Claude Sonnet 4.6 achieved a higher overall security score (91 vs 85), with fewer successful attacks across every severity level. However, both models showed specific weaknesses worth examining.

Category Breakdown

Jailbreaking

Both models have invested heavily in jailbreak resistance, and it shows. Classic DAN prompts, simple role-play, and direct override attempts are reliably blocked by both.

Where they differ is in multi-turn jailbreaks. GPT-4o showed more susceptibility to crescendo attacks that gradually escalate across 6-8 turns. Claude demonstrated stronger conversation-level monitoring that detected escalation patterns earlier.

GPT-4o: 7 successful jailbreaks out of 40+ techniques Claude: 3 successful jailbreaks out of 40+ techniques

Prompt Injection

Both models are vulnerable to specific prompt injection techniques, though the attack vectors differ. GPT-4o showed more susceptibility to delimiter-based injection — where attackers use markdown formatting, code blocks, or special characters to separate their instructions from the system prompt.

Claude was more resistant to delimiter attacks but showed vulnerability to context manipulation — prompts that reframe the conversation context to make harmful requests appear as legitimate tasks.

GPT-4o: 6 successful injections Claude: 4 successful injections

Data Exfiltration

This is where the models diverged most significantly. GPT-4o leaked partial system prompt information in 4 out of 15 extraction attempts, while Claude leaked in only 1. Both models resisted training data extraction attempts.

Claude's stronger performance here likely reflects Anthropic's emphasis on Constitutional AI training, which specifically targets information leakage scenarios.

Multilingual Attacks

Both models showed weaker safety performance in non-English languages, but the pattern differed:

  • GPT-4o was weakest in Arabic and Hindi, where safety guardrails were noticeably less robust
  • Claude was weakest in Chinese and Korean, though the gap from English was smaller overall

This is an industry-wide challenge — RLHF safety training data is disproportionately English, and all current frontier models have this vulnerability to varying degrees.

Tool Injection

For this category, we tested both models with simulated tool/function calling capabilities. Claude showed slightly better resistance to schema manipulation attacks, while GPT-4o was more robust against parameter injection.

Both models were vulnerable to at least one tool injection technique that could cause unauthorized actions through connected tools.

Key Takeaways

Claude's strengths

  • Stronger jailbreak resistance, especially multi-turn
  • Better system prompt protection
  • More consistent safety across languages

GPT-4o's strengths

  • Better parameter injection resistance in tool calls
  • Stronger defense against context manipulation attacks
  • More robust output filtering for certain content categories

Both models need improvement

  • Multilingual safety gaps remain significant
  • Multi-turn escalation attacks still succeed with sufficient patience
  • Tool injection is a growing attack surface that neither model fully addresses

What This Means for Your Deployment

If you're choosing between GPT-4o and Claude for a security-sensitive application, Claude's higher overall score and stronger jailbreak resistance give it an edge. But the right choice depends on your specific use case:

  • Customer-facing chatbots: Claude's stronger jailbreak resistance matters more
  • Tool-enabled agents: GPT-4o's better parameter injection defense may be more relevant
  • Multilingual applications: Test both models in your target languages — neither is uniformly better

The most important takeaway: neither model is invulnerable. Both have exploitable weaknesses that automated testing can identify. The security posture of your application depends not just on which model you choose, but on the defensive layers you build around it.

See the Full Leaderboard

These results are part of our public LLM Security Leaderboard, where we test 15+ models with the same methodology. Check it out to see how other models compare.

Test your own LLM deployment with the same 230+ attack techniques — free.

Secure Your AI — Start Free Scan

Test your LLM deployment with 230+ attack techniques. Get a security score in minutes.

Get Started Free

Related Posts