First LLM Benchmark Provides Vendors and SOC Teams Needed Guidance to Select the Best LLM -- Security Today

First LLM Benchmark Provides Vendors and SOC Teams Needed Guidance to Select the Best LLM

Jul 31, 2025

Recently, Simbian introduced the first benchmark to comprehensively measure LLM performance in SOCs, measuring LLMs against a diverse range of real alerts and fundamental SOC tools over all phases of alert investigation, from alert ingestion to disposition and reporting.

Existing benchmarks compare LLMs over broad criteria such as language understanding, math, and reasoning. Some benchmarks exist for broad security tasks or very basic SOC tasks like alert summarization.

For the first time ever, the industry has a comprehensive benchmarking framework specifically designed for AI SOC. What makes this different from existing cybersecurity benchmarks like  CyberSecEval  or  CTIBench? The key lies in its realism and depth.

Forget generic scenarios. Simbian’s benchmark is built on the autonomous investigation of 100 full-kill chain scenarios that realistically mirror what human SOC analysts face every day. To achieve this, the company created diverse, real-world-based attack scenarios with known ground truth of malicious activity, allowing AI agents to investigate and be assessed against a clear baseline.

These scenarios are even based on historical behavior of well-known APT groups (like APT32, APT38, APT43) and cybercriminal organizations (Cobalt Group, Lapsus$), covering a wide range of MITRE ATT&CK™ Tactics and Techniques, with a focus on prevalent threats like ransomware and phishing.

To facilitate this rigorous LLM testing, Simbian leveraged their AI SOC agent. The evaluation process is evidence-grounded and data-driven, always kicking off with a triggered alert realistically representing the way SOC analysts operate. The AI agent then needed to determine if the alert was a True or False Positive, find evidence of malicious activity (think “CTF flags” in a red teaming exercise), correctly interpret that evidence by answering security-contextual questions, and provide high-level overview of the attacker’s activity and respond to the threat in an appropriate manner – all autonomously. This evidence-based approach is critical for managing hallucinations to ensure the LLM isn’t just guessing but validating its reasoning and actions.

Surprising Findings and Key Takeaways Benchmarking some of the most well-known and high-performing models available as of May 2025, the study including models from Anthropic, OpenAI, Google, and DeepSeek.

The Results:

All high-end models were able to complete over half of the investigation tasks, with performance ranging from 61% to 67%. For reference, during the first AI SOC Championship the best human analysts powered by AI SOC scored in the range of 73% to 85%, while Simbian’s AI Agent at extra effort settings reached the value of 72%. This suggests that LLMs are capable of much more than just summarizing and retrieving data; LLM capabilities extend to robust alert triage and tool use via API interactions.

Another key finding: AI SOC applications heavily lean on the software engineering capabilities of LLMs. This highlights the importance of thorough prompt engineering and agentic flow engineering, which involves feedback loops and continuous monitoring. In initial runs, some models struggled, requiring improved prompts and fallback mechanisms for coding agents involved in analyzing retrieved data.

It should be noted that Sonnet 3.5 sometimes outperformed newer versions like Sonnet 3.7 and 4.0. This could be due to “catastrophic forgetting,” where further domain specialization for software engineering or instruction following might detrimentally affect cybersecurity knowledge and investigation planning. This underscores the critical need for benchmarking to evaluate the fine-tuning of LLMs for specific domains.

The discovery was also made that “thinking models” (those that use post-training techniques and often involve into internal self-talk) didn’t show a considerable advantage in AI SOC applications, with all tested models demonstrating comparable performance. This resembles the findings of the studies on software bug fixing and red team CTF applications, which suggest that once LLMs hit a certain capability ceiling, additional inference leads to only marginal improvements, often at a higher cost. This points to the necessity of human-validated LLM applications in AI SOC and the continued development of fine-grained, specialized benchmarks for improving cybersecurity domain-focused reasoning.

For the full details and results of the benchmarking, click here. Simbian looks forward to leveraging this benchmark to evaluate new foundation models on a regular basis, with plans to share findings with the public.

Unapplied Patches Drive Majority of Open Source Breaches
02/20/2026
GEODIS Park Selects Allied Universal for Security and Event Services
02/19/2026
Alaska DOT Pilots AI-Driven Lighting to Boost Pedestrian Safety
02/19/2026
Hanwha Vision Launches ConfigPro Management Tool
02/18/2026
Everon Appointed to NASPO Supplier Advisory Group
02/17/2026
Bell Cyber and Radware Expand AI-Driven and Cloud Delivered Security Services
02/17/2026
Verkada Launches AI-powered Deterrence
02/12/2026
TMA Launches Registration for 2026 Virtual Mid-Year Meeting
02/12/2026

Featured

Full Details Released for SIA Education at ISC West

ISC West 2026 is coming up March 23–27 in Las Vegas, and the Security Industry Association (SIA) and ISC West have revealed full conference details. Read Now
- ISC West
2025 Gun Violence Statistics Show Signs of Progress

Omnilert, a national leader in AI-powered safety and emergency communications, has released its 2025 Gun Violence Statistics, along with a new interactive infographic examining national and school-related gun violence trends. In 2025, the U.S. recorded 38,762 gun-violence deaths, highlighting the continued importance of prevention, early detection, and coordinated response. Read Now
- Facility Security
Big Brand Tire & Service Rolls Out Interface Virtual Perimeter Guard

Interface Systems, a managed service provider delivering remote video monitoring, commercial security systems, business intelligence, and network services for multi-location enterprises, today announced that Big Brand Tire & Service, one of the nation’s fastest-growing independent tire and automotive service providers, has eliminated costly overnight break-ins and significantly reduced trespassing and vandalism at a high-risk location. The company achieved these results by deploying Interface Virtual Perimeter Guard, an AI-powered perimeter security solution designed to deter incidents before they occur. Read Now
- Artificial Intelligence
The Evolution of ID Card Printing: Customer Challenges and Solutions

The landscape of ID card printing is evolving to meet changing customer needs, transitioning from slow, manual processes to smart, on-demand printing solutions that address increasingly complex enrollment workflows. Read Now
- Access Control
TSA Awards Rohde & Schwarz Contract for Advanced Airport Screening Ahead of Soccer World Cup 2026

Rohde & Schwarz, a provider of AI-based millimeter wave screening technology, announced today it has won a multi-million dollar award from TSA to supply its QPS201 AIT security scanners to passenger security screening checkpoints at selected Soccer World Cup 2026 host city airports. Read Now
- Airport Security

Security Today eNews

Sign up today for essential industry news and product information that can help you stay afloat in the fast-paced world of security.

Email Address*Country*

Please type the letters/numbers you see above.

Webinars

Whitepapers

New Products

PE80 Series by SARGENT / ED4000/PED5000 Series by Corbin Russwin

ASSA ABLOY, a global leader in access solutions, has announced the launch of two next generation exit devices from long-standing leaders in the premium exit device market: the PE80 Series by SARGENT and the PED4000/PED5000 Series by Corbin Russwin. These new exit devices boast industry-first features that are specifically designed to provide enhanced safety, security and convenience, setting new standards for exit solutions. The SARGENT PE80 and Corbin Russwin PED4000/PED5000 Series exit devices are engineered to meet the ever-evolving needs of modern buildings. Featuring the high strength, security and durability that ASSA ABLOY is known for, the new exit devices deliver several innovative, industry-first features in addition to elegant design finishes for every opening.
Unified VMS

AxxonSoft introduces version 2.0 of the Axxon One VMS. The new release features integrations with various physical security systems, making Axxon One a unified VMS. Other enhancements include new AI video analytics and intelligent search functions, hardened cybersecurity, usability and performance improvements, and expanded cloud capabilities
HD2055 Modular Barricade

Delta Scientific’s electric HD2055 modular shallow foundation barricade is tested to ASTM M50/P1 with negative penetration from the vehicle upon impact. With a shallow foundation of only 24 inches, the HD2055 can be installed without worrying about buried power lines and other below grade obstructions. The modular make-up of the barrier also allows you to cover wider roadways by adding additional modules to the system. The HD2055 boasts an Emergency Fast Operation of 1.5 seconds giving the guard ample time to deploy under a high threat situation.