First LLM Benchmark Provides Vendors and SOC Teams Needed Guidance to Select the Best LLM -- Security Today

First LLM Benchmark Provides Vendors and SOC Teams Needed Guidance to Select the Best LLM

Jul 31, 2025

Recently, Simbian introduced the first benchmark to comprehensively measure LLM performance in SOCs, measuring LLMs against a diverse range of real alerts and fundamental SOC tools over all phases of alert investigation, from alert ingestion to disposition and reporting.

Existing benchmarks compare LLMs over broad criteria such as language understanding, math, and reasoning. Some benchmarks exist for broad security tasks or very basic SOC tasks like alert summarization.

For the first time ever, the industry has a comprehensive benchmarking framework specifically designed for AI SOC. What makes this different from existing cybersecurity benchmarks like  CyberSecEval  or  CTIBench? The key lies in its realism and depth.

Forget generic scenarios. Simbian’s benchmark is built on the autonomous investigation of 100 full-kill chain scenarios that realistically mirror what human SOC analysts face every day. To achieve this, the company created diverse, real-world-based attack scenarios with known ground truth of malicious activity, allowing AI agents to investigate and be assessed against a clear baseline.

These scenarios are even based on historical behavior of well-known APT groups (like APT32, APT38, APT43) and cybercriminal organizations (Cobalt Group, Lapsus$), covering a wide range of MITRE ATT&CK™ Tactics and Techniques, with a focus on prevalent threats like ransomware and phishing.

To facilitate this rigorous LLM testing, Simbian leveraged their AI SOC agent. The evaluation process is evidence-grounded and data-driven, always kicking off with a triggered alert realistically representing the way SOC analysts operate. The AI agent then needed to determine if the alert was a True or False Positive, find evidence of malicious activity (think “CTF flags” in a red teaming exercise), correctly interpret that evidence by answering security-contextual questions, and provide high-level overview of the attacker’s activity and respond to the threat in an appropriate manner – all autonomously. This evidence-based approach is critical for managing hallucinations to ensure the LLM isn’t just guessing but validating its reasoning and actions.

Surprising Findings and Key Takeaways Benchmarking some of the most well-known and high-performing models available as of May 2025, the study including models from Anthropic, OpenAI, Google, and DeepSeek.

The Results:

All high-end models were able to complete over half of the investigation tasks, with performance ranging from 61% to 67%. For reference, during the first AI SOC Championship the best human analysts powered by AI SOC scored in the range of 73% to 85%, while Simbian’s AI Agent at extra effort settings reached the value of 72%. This suggests that LLMs are capable of much more than just summarizing and retrieving data; LLM capabilities extend to robust alert triage and tool use via API interactions.

Another key finding: AI SOC applications heavily lean on the software engineering capabilities of LLMs. This highlights the importance of thorough prompt engineering and agentic flow engineering, which involves feedback loops and continuous monitoring. In initial runs, some models struggled, requiring improved prompts and fallback mechanisms for coding agents involved in analyzing retrieved data.

It should be noted that Sonnet 3.5 sometimes outperformed newer versions like Sonnet 3.7 and 4.0. This could be due to “catastrophic forgetting,” where further domain specialization for software engineering or instruction following might detrimentally affect cybersecurity knowledge and investigation planning. This underscores the critical need for benchmarking to evaluate the fine-tuning of LLMs for specific domains.

The discovery was also made that “thinking models” (those that use post-training techniques and often involve into internal self-talk) didn’t show a considerable advantage in AI SOC applications, with all tested models demonstrating comparable performance. This resembles the findings of the studies on software bug fixing and red team CTF applications, which suggest that once LLMs hit a certain capability ceiling, additional inference leads to only marginal improvements, often at a higher cost. This points to the necessity of human-validated LLM applications in AI SOC and the continued development of fine-grained, specialized benchmarks for improving cybersecurity domain-focused reasoning.

For the full details and results of the benchmarking, click here. Simbian looks forward to leveraging this benchmark to evaluate new foundation models on a regular basis, with plans to share findings with the public.

ONVIF to End Support for Profile S, Recommends Profile T as Replacement
10/09/2025
Totem Achieves Immix Central Station Certification Status
10/09/2025
SIA Names Yazmina Rawji as 2025 SIA Progress Award Honoree
10/07/2025
SIA Announces Names Martin Gren as 2025 George R. Lippert Memorial Award Honoree
10/07/2025
LocklyPRO Strengthens Sales Leadership Team With Two Industry Veterans
10/07/2025
Keenfinity Launches Radionix as its New Intrusion Systems Brand
09/29/2025
ASSA ABLOY Opening Solutions Puts Customer Collaboration Center Stage at GSX 2025
09/29/2025
Altronix Features New eFlow Pro Series Auto-Ranging Power at GSX 2025
09/29/2025

Featured

Pragmatism, Productivity, and the Push for Accountability in 2025-2026

Every year, the security industry debates whether artificial intelligence is a disruption, an enabler, or a distraction. By 2025, that conversation matured, where AI became a working dimension in physical identity and access management (PIAM) programs. Observations from 2025 highlight this turning point in AI’s role in access control and define how security leaders are being distinguished based on how they apply it. Read Now
- Access Control
- Artificial Intelligence
Report: Cyber Attackers Continue to Turn to AI-Based Tools to Avoid Detection

Comcast Business recently released its 2025 Cybersecurity Threat Report, a comprehensive analysis of 34.6 billion cybersecurity events detected between June 1,2024 and May 31, 2025. Now in its third year, the report offers business leaders a unique perspective into the evolving threat landscape and provides actionable insights to help organizations strengthen their defenses and align cybersecurity with business risk. Read Now
- Artificial Intelligence
- Cybersecurity
Axis Communications Creates AI-powered Video Surveillance Orchestra

What if cameras could not only see the world, but interpret it—and respond like orchestra musicians reading sheet music: instantly, precisely, and in perfect harmony? That’s what global network technology leader Axis Communications set to find out. Read Now
- Artificial Intelligence
Just as Expected

GSX produced a wonderful tradeshow earlier this week. Monday was surprisingly strong in the morning, and the afternoon wasn’t bad at all. That’s Monday’s results and asking attendees to travel on Sunday. Just a quick hint, no one wants to give up their weekend to travel and set up an exhibit booth. I’m just saying. Read Now
- Industry Events
- GSX
NOLA: The Crescent City

Twenty years later we finds ourselves in New Orleans. Twenty years ago the aftermath of Hurricane Katrina forced exhibitors and attendees to look elsewhere for tradeshow floor space. Read Now
- Industry Events
- GSX

Security Today eNews

Sign up today for essential industry news and product information that can help you stay afloat in the fast-paced world of security.

Email Address*Country*

Please type the letters/numbers you see above.

Webinars

Whitepapers

New Products

Mobile Safe Shield

SafeWood Designs, Inc., a manufacturer of patented bullet resistant products, is excited to announce the launch of the Mobile Safe Shield. The Mobile Safe Shield is a moveable bullet resistant shield that provides protection in the event of an assailant and supplies cover in the event of an active shooter. With a heavy-duty steel frame, quality castor wheels, and bullet resistant core, the Mobile Safe Shield is a perfect addition to any guard station, security desks, courthouses, police stations, schools, office spaces and more. The Mobile Safe Shield is incredibly customizable. Bullet resistant materials are available in UL 752 Levels 1 through 8 and include glass, white board, tack board, veneer, and plastic laminate. Flexibility in bullet resistant materials allows for the Mobile Safe Shield to blend more with current interior décor for a seamless design aesthetic. Optional custom paint colors are also available for the steel frame.
FEP GameChanger

Paige Datacom Solutions Introduces Important and Innovative Cabling Products GameChanger Cable, a proven and patented solution that significantly exceeds the reach of traditional category cable will now have a FEP/FEP construction.
QCS7230 System-on-Chip (SoC)

The latest Qualcomm® Vision Intelligence Platform offers next-generation smart camera IoT solutions to improve safety and security across enterprises, cities and spaces. The Vision Intelligence Platform was expanded in March 2022 with the introduction of the QCS7230 System-on-Chip (SoC), which delivers superior artificial intelligence (AI) inferencing at the edge.