First LLM Benchmark Provides Vendors and SOC Teams Needed Guidance to Select the Best LLM

Recently, Simbian introduced the first benchmark to comprehensively measure LLM performance in SOCs, measuring LLMs against a diverse range of real alerts and fundamental SOC tools over all phases of alert investigation, from alert ingestion to disposition and reporting.

Existing benchmarks compare LLMs over broad criteria such as language understanding, math, and reasoning. Some benchmarks exist for broad security tasks or very basic SOC tasks like alert summarization.

For the first time ever, the industry has a comprehensive benchmarking framework specifically designed for AI SOC. What makes this different from existing cybersecurity benchmarks like  CyberSecEval  or  CTIBench? The key lies in its realism and depth.

Forget generic scenarios. Simbian’s benchmark is built on the autonomous investigation of 100 full-kill chain scenarios that realistically mirror what human SOC analysts face every day. To achieve this, the company created diverse, real-world-based attack scenarios with known ground truth of malicious activity, allowing AI agents to investigate and be assessed against a clear baseline.

These scenarios are even based on historical behavior of well-known APT groups (like APT32, APT38, APT43) and cybercriminal organizations (Cobalt Group, Lapsus$), covering a wide range of MITRE ATT&CK™ Tactics and Techniques, with a focus on prevalent threats like ransomware and phishing.

To facilitate this rigorous LLM testing, Simbian leveraged their AI SOC agent. The evaluation process is evidence-grounded and data-driven, always kicking off with a triggered alert realistically representing the way SOC analysts operate. The AI agent then needed to determine if the alert was a True or False Positive, find evidence of malicious activity (think “CTF flags” in a red teaming exercise), correctly interpret that evidence by answering security-contextual questions, and provide high-level overview of the attacker’s activity and respond to the threat in an appropriate manner – all autonomously. This evidence-based approach is critical for managing hallucinations to ensure the LLM isn’t just guessing but validating its reasoning and actions.

Surprising Findings and Key Takeaways Benchmarking some of the most well-known and high-performing models available as of May 2025, the study including models from Anthropic, OpenAI, Google, and DeepSeek.

The Results:

All high-end models were able to complete over half of the investigation tasks, with performance ranging from 61% to 67%. For reference, during the first AI SOC Championship the best human analysts powered by AI SOC scored in the range of 73% to 85%, while Simbian’s AI Agent at extra effort settings reached the value of 72%. This suggests that LLMs are capable of much more than just summarizing and retrieving data; LLM capabilities extend to robust alert triage and tool use via API interactions.

Another key finding: AI SOC applications heavily lean on the software engineering capabilities of LLMs. This highlights the importance of thorough prompt engineering and agentic flow engineering, which involves feedback loops and continuous monitoring. In initial runs, some models struggled, requiring improved prompts and fallback mechanisms for coding agents involved in analyzing retrieved data.

It should be noted that Sonnet 3.5 sometimes outperformed newer versions like Sonnet 3.7 and 4.0. This could be due to “catastrophic forgetting,” where further domain specialization for software engineering or instruction following might detrimentally affect cybersecurity knowledge and investigation planning. This underscores the critical need for benchmarking to evaluate the fine-tuning of LLMs for specific domains.

The discovery was also made that “thinking models” (those that use post-training techniques and often involve into internal self-talk) didn’t show a considerable advantage in AI SOC applications, with all tested models demonstrating comparable performance. This resembles the findings of the studies on software bug fixing and red team CTF applications, which suggest that once LLMs hit a certain capability ceiling, additional inference leads to only marginal improvements, often at a higher cost. This points to the necessity of human-validated LLM applications in AI SOC and the continued development of fine-grained, specialized benchmarks for improving cybersecurity domain-focused reasoning.

For the full details and results of the benchmarking, click here. Simbian looks forward to leveraging this benchmark to evaluate new foundation models on a regular basis, with plans to share findings with the public.

Featured

  • Improve Incident Response With Intelligent Cloud Video Surveillance

    Video surveillance is a vital part of business security, helping institutions protect against everyday threats for increased employee, customer, and student safety. However, many outdated surveillance solutions lack the ability to offer immediate insights into critical incidents. This slows down investigations and limits how effectively teams can respond to situations, creating greater risks for the organization. Read Now

  • Security Today Announces 2025 CyberSecured Award Winners

    Security Today is pleased to announce the 2025 CyberSecured Awards winners. Sixteen companies are being recognized this year for their network products and other cybersecurity initiatives that secure our world today. Read Now

  • Empowering and Securing a Mobile Workforce

    What happens when technology lets you work anywhere – but exposes you to security threats everywhere? This is the reality of modern work. No longer tethered to desks, work happens everywhere – in the office, from home, on the road, and in countless locations in between. Read Now

  • TSA Introduces New $45 Fee Option for Travelers Without REAL ID Starting February 1

    The Transportation Security Administration (TSA) announced today that it will refer all passengers who do not present an acceptable form of ID and still want to fly an option to pay a $45 fee to use a modernized alternative identity verification system, TSA Confirm.ID, to establish identity at security checkpoints beginning on February 1, 2026. Read Now

  • The Evolution of IP Camera Intelligence

    As the 30th anniversary of the IP camera approaches in 2026, it is worth reflecting on how far we have come. The first network camera, launched in 1996, delivered one frame every 17 seconds—not impressive by today’s standards, but groundbreaking at the time. It did something that no analog system could: transmit video over a standard IP network. Read Now

New Products

  • Luma x20

    Luma x20

    Snap One has announced its popular Luma x20 family of surveillance products now offers even greater security and privacy for home and business owners across the globe by giving them full control over integrators’ system access to view live and recorded video. According to Snap One Product Manager Derek Webb, the new “customer handoff” feature provides enhanced user control after initial installation, allowing the owners to have total privacy while also making it easy to reinstate integrator access when maintenance or assistance is required. This new feature is now available to all Luma x20 users globally. “The Luma x20 family of surveillance solutions provides excellent image and audio capture, and with the new customer handoff feature, it now offers absolute privacy for camera feeds and recordings,” Webb said. “With notifications and integrator access controlled through the powerful OvrC remote system management platform, it’s easy for integrators to give their clients full control of their footage and then to get temporary access from the client for any troubleshooting needs.”

  • Automatic Systems V07

    Automatic Systems V07

    Automatic Systems, an industry-leading manufacturer of pedestrian and vehicle secure entrance control access systems, is pleased to announce the release of its groundbreaking V07 software. The V07 software update is designed specifically to address cybersecurity concerns and will ensure the integrity and confidentiality of Automatic Systems applications. With the new V07 software, updates will be delivered by means of an encrypted file.

  • Camden CM-221 Series Switches

    Camden CM-221 Series Switches

    Camden Door Controls is pleased to announce that, in response to soaring customer demand, it has expanded its range of ValueWave™ no-touch switches to include a narrow (slimline) version with manual override. This override button is designed to provide additional assurance that the request to exit switch will open a door, even if the no-touch sensor fails to operate. This new slimline switch also features a heavy gauge stainless steel faceplate, a red/green illuminated light ring, and is IP65 rated, making it ideal for indoor or outdoor use as part of an automatic door or access control system. ValueWave™ no-touch switches are designed for easy installation and trouble-free service in high traffic applications. In addition to this narrow version, the CM-221 & CM-222 Series switches are available in a range of other models with single and double gang heavy-gauge stainless steel faceplates and include illuminated light rings.