★ 7/10 · General · 2026-05-01

GPT-5.5 matches heavily hyped Mythos Preview in new cybersecurity tests

Recent evaluations by the UK AI Security Institute (AISI) demonstrate that OpenAI's GPT-5.5 possesses cybersecurity capabilities comparable to Anthropic's Mythos Preview. The findings indicate that GPT-5.5 matches the...

GPT-5.5 matches heavily hyped Mythos Preview in new cybersecurity tests

Summary

Recent evaluations by the UK AI Security Institute (AISI) demonstrate that OpenAI's GPT-5.5 possesses cybersecurity capabilities comparable to Anthropic's Mythos Preview. The findings indicate that GPT-5.5 matches the performance of highly restricted models in complex tasks involving reverse engineering, web exploitation, and multi-step network attacks.

Key Points

  • GPT-5.5 achieved a 71.4% success rate on "Expert" level Capture the Flag (CTF) tasks, compared to 68.6% for Mythos Preview.
  • In a specific task requiring the construction of a disassembler to decode a Rust binary, GPT-5.5 completed the challenge in 10 minutes and 22 seconds with an API cost of $1.73.
  • On "The Last Ones" (TLO) simulations—a 32-step data extraction attack on corporate networks—GPT-5.5 succeeded in 3 out of 10 attempts, while Mythos Preview succeeded in 2 out of 10.
  • Both models failed the "Cooling Tower" simulation, which tests the ability to disrupt control software for power plants.
  • The evaluation utilized 95 different CTF challenges covering cryptography, web exploitation, and reverse engineering.

Technical Details

The UK AI Security Institute (AISI) benchmarked these models using 95 distinct Capture the Flag (CTF) challenges designed to test frontier capabilities in specialized domains. In the highest-level "Expert" task category, the performance delta between GPT-5.5 (71.4%) and Mythos Preview (68.6%) falls within the margin of error. A notable technical benchmark involved the autonomous generation of a disassembler for Rust binaries; GPT-5.5 executed this task without human assistance, utilizing API calls totaling $1.73.

In longitudinal attack simulations, specifically the "The Last Ones" (TLO) range, the models were tested on their ability to navigate a 32-step data extraction sequence within a simulated corporate network. While GPT-5.5 showed a marginal improvement in success rate (30% vs 20% for Mythos Preview), both models failed to execute the "Cooling Tower" simulation, which focuses on the disruption of industrial control software.

Impact / Why It Matters

The parity in cybersecurity performance between publicly available models and restricted, "partner-only" models suggests that advanced automated exploitation and reverse engineering capabilities are increasingly accessible. Developers and security engineers should prepare for more sophisticated automated threats in areas such as web exploitation and binary analysis.

AI cybersecurity LLM