Visualização normal

Antes de ontemStream principal

Silent Brothers | Ollama Hosts Form Anonymous AI Network Beyond Platform Guardrails

Executive Summary

  • A joint research project between SentinelLABS and Censys reveals that open-source AI deployment has created an unmanaged, publicly accessible layer of AI compute infrastructure spanning 175,000 hosts worldwide, operating outside the guardrails and monitoring systems that platform providers implement by default.
  • Over 293 days of scanning, we identified 7.23 million observations across 130 countries, with a persistent core of 23,000 hosts generating the majority of activity.
  • Nearly half of observed hosts are configured with tool-calling capabilities that enable them to execute code, access APIs, and interact with external systems demonstrating the increasing implementation of LLMs into larger system processes.
  • Hosts span cloud and residential networks globally, but overwhelmingly run the same handful of AI models in identical formats, creating a brittle monoculture.
  • The residential nature of much of the infrastructure complicates traditional governance and requires new approaches that distinguish between managed cloud deployments and distributed edge infrastructure.

Background

Ollama is an open-source framework that enables users to run large language models locally on their own hardware. By design, the service binds to localhost at 127.0.0.1:11434, making instances accessible only from the host machine. However, exposing Ollama to the public internet requires only a single configuration change: setting the service to bind to 0.0.0.0 or a public interface. At scale, these individual deployment decisions aggregate into a measurable public surface.

Over the past year, as open-weight models have proliferated and local deployment frameworks have matured, we observed growing discussion in security communities about the implications of this trend. Unlike platform-hosted LLM services with centralized monitoring, access controls, and abuse prevention mechanisms, self-hosted instances operate outside emerging AI governance boundaries. To understand the scope and characteristics of this emerging ecosystem, SentinelLABS partnered with Censys to scan and map internet-reachable Ollama deployments.

Our research aimed to answer several questions: How large is the public exposure? Where do these hosts reside? What models and capabilities do they run? And critically, what are the security implications of a distributed, unmanaged layer of AI compute infrastructure?

The Exposed Ecosystem | Scale and Structure

Our scanning infrastructure recorded 7.23 million observations from 175,108 unique Ollama hosts across 130 countries and 4,032 autonomous system numbers (ASNs). The raw numbers suggest a substantial public surface, but the distribution of activity reveals a more nuanced picture.

The ecosystem is bimodal. A large layer of transient hosts sits atop a smaller, persistent backbone that accounts for the majority of observable activity. These transient hosts appear briefly and then disappear. Hosts that appear in more than 100 observations represent just 13% of the unique host population, yet they generate nearly 76% of all observations. Conversely, hosts observed exactly once constitute 36% of unique hosts but contribute less than 1% of total observations.

This persistence skew shapes the rest of our analysis. It’s why model rankings stay stable even as the host population grows, why the host counts look residential while the always-on endpoints behave more like cloud services, and why most of the security risk sits in a smaller subset of exposed systems.

Regardless of this skew, persistent hosts that remain reachable across multiple scans comprise the backbone of our data. This is where capability, exposure, and operational value converge. These are systems that provide ongoing utility to their operators and, by extension, represent the most attractive and accessible targets for adversaries.

Infrastructure Footprint and Attribution Challenges

The infrastructure distribution challenges assumptions about where AI compute resides. When classified by ASN type, fixed-access telecom networks, which include consumer ISPs, constitute the single largest category at 56% of hosts by count. However, when the same data is grouped into broader infrastructure tiers, exposure divides almost evenly: Hyperscalers account for 32% of hosts, and Telecom/Residential networks account for another 32%.

This apparent contradiction reflects a classification and attribution challenge inherent in internet scanning. Both views are accurate, and together they indicate that public Ollama exposure spans a mixed environment. Access networks, independent VPS providers, and major cloud platforms all serve as durable habitats for open-weight LLM deployment.

Operational characteristics vary by tier. Indie Cloud/VPS environments show high average persistence and elevated “running share,” which measures the proportion of hosts actively serving models at scan time. This is consistent with endpoints that provide stable, ongoing service. Telecom/Residential hosts, by contrast, report larger average model inventories but lower running share, suggesting machines that accumulate models over time but operate intermittently.

Geographic distribution also reveals concentration patterns. In the United States, Virginia alone accounts for 18% of U.S. hosts, likely reflecting the density of cloud infrastructure in US-EAST. In China, concentration is even tighter: Beijing accounts for 30% of Chinese hosts, with Shanghai and Guangdong contributing an additional 21% combined. These patterns suggest that observable open-source AI capability concentrates at infrastructure hubs rather than distributing uniformly.

Top 10 Countries by share of unique hosts
Top 10 Countries by share of unique hosts

A significant portion of the infrastructure footprint, however, resists clean attribution. Depending on the classification method, 16% of tier labels and 19% of ASN-type classifications returned null values in our scans. This attribution gap reflects a governance reality. Security teams and enforcement authorities can observe activity, but they often cannot identify the responsible party. Traditional mechanisms that rely on clear ownership chains and abuse contact points become less effective when nearly one-fifth of the infrastructure is anonymous.

Model Adoption and Hardware Constraints

Although nothing is truly uniform on the internet, in our data we observe a distinct trend. Host placement is decentralized, but model adoption is concentrated. Lineage rankings are exceptionally stable across multiple weighting schemes. Across observations, unique hosts, and host-days, the same three families occupy the same positions with zero rank volatility: Llama at #1, Qwen2 at #2, and Gemma2 at #3. This stability indicates broad, repeated use of shared model lineages rather than a fragmented, experiment-heavy deployment pattern.

Top 20 model families by share of unique hosts
Top 20 model families by share of unique hosts

Portfolio behavior reveals a shift toward multi-model deployments. The average number of models per observation rose from 3 in March to 4 by September-December. The most common configuration remains modest at 2-3 models, accounting for 41% of hosts, but a small minority of “public library” hosts carry 20 or more models. These represent only 1.46% of hosts but disproportionately drive model-instance volume and family diversity.

Co-deployment patterns suggest operational logic beyond simple experimentation. The most prominent multi-family pairing, llama + qwen2, appears on 40,694 hosts, representing 52% of multi-family deployments. This consistency suggests operators maintain portfolios for comparison, redundancy, or workload segmentation rather than committing to a single lineage.

Hardware constraints express themselves clearly in quantization preferences and parameter-size distributions as well. The deployment regime converges strongly on 4-bit compression. The specific format Q4_K_M appears on 48% of hosts, and 4-bit formats total 72% of all observed quantizations compared to just 19% for 16-bit. This convergence is not confined to a single infrastructure niche. Q4_K_M ranks #1 across Academic, Hyperscaler, Indie VPS, and Telecom/Residential tiers.

Parameter sizes cluster in the mid-range. The 8-14B band is most prevalent at 26% of hosts, with 1-3B and 4-7B bands close behind. Together, these patterns reflect the practical economics of running inference on commodity hardware: models must be small enough to fit in available VRAM and memory bandwidth but also be capable enough for practical work.

This ecosystem-wide convergence on specific packaging regimes creates both portability and fragility. The same compression choices that enable models to run across diverse hardware environments also create a monoculture. A vulnerability in how specific quantized models handle tokens could affect a substantial portion of the exposed ecosystem simultaneously rather than manifesting as isolated incidents. This risk is particularly acute for widely deployed formats like Q4_K_M.

Capability Surface | Tools, Modalities, and Intent Signals

The persistent backbone is configured for action. Over 48% of observed hosts advertise tool-calling capabilities via their API endpoints. When queried, hosts return capability metadata indicating which operations they support. The specific combination of [completion, tools] indicates a host that can both generate text and execute functions. This configuration appears on 38% of hosts, indicating systems wired to interface with external software, APIs, or file systems.

Host capability coverage (share of all hosts)
Host capability coverage (share of all hosts)

Modality support extends beyond text. Vision capabilities appear on 22% of hosts, enabling image understanding and creating vectors for indirect prompt injection via images or documents. “Thinking” models, which are optimized for multi-step reasoning and chain-of-thought processing, appear on 26% of hosts. When paired with tool-calling capabilities, reasoning capacity acts as a planning layer that can decompose complex tasks into sequential operations.

System prompt analysis surfaced a subset of deployments with explicit intent signals. We identified at least 201 hosts running standardized “uncensored” prompt templates that explicitly remove safety guardrails. This count represents a lower bound; our methodology captured only prompts visible via API responses and the presence of standardized “guard-off” configurations indicates a repeatable pattern rather than isolated experimentation.

A subset of 5,000 hosts demonstrates both high capability and high availability, showing 87% average uptime while actively running an average of 1.8 models. This combination of persistence, tool-enablement, and consistent availability suggests endpoints that provide ongoing operational value and, from an adversary perspective, represent stable, accessible compute resources.

Security Implications

The exposed Ollama ecosystem presents several threat vectors that differ from risks associated with platform-hosted LLM services.

Resource Hijacking

The persistent backbone represents a new network layer of compute infrastructure that can be accessed without authentication, usage monitoring, or billing controls. Frontier LLM providers have reported that criminal organizations and state-sponsored actors leverage their platforms for spam campaigns, phishing, disinformation networks, and network exploitation. These providers deploy dedicated security and fraud teams, implement rate limiting, and maintain abuse detection systems.

In contrast, the exposed Ollama backbone offers adversaries distributed compute resources with minimal centralized oversight. An attacker can direct malicious workloads to these hosts at zero marginal cost. The victim pays the electricity bill and infrastructure costs while the attacker receives the generated output. For operations requiring volume, such as spam generation, phishing content creation, or disinformation campaigns, this represents a substantial operational advantage.

Excessive Agency

Tool-calling capabilities fundamentally alter the threat model. A text-generation endpoint can produce harmful content, but a tool-enabled endpoint can execute privileged operations. When combined with insufficient authentication and network exposure, this creates what we assess to be the highest-severity risk in the ecosystem.

Prompt injection becomes an increasingly important threat vector as LLM enabled systems  are provided increased agency. This technique manipulates LLM behavior through crafted inputs. An attacker no longer needs to breach a file server or database; they can prompt an exposed Retrieval-Augmented Generation instance with benign-sounding requests: “Summarize the project roadmap,” “List the configuration files in the documentation,” or “What API keys are mentioned in the codebase?” A model designed to be helpful and lacking authentication or safety mechanisms, will comply with these requests if its retrieval scope includes the targeted information.

We observed configurations consistent with retrieval workflows, including “chat + embeddings” pairings that suggest RAG deployments. When these systems are internet-reachable and lack access controls, they represent a direct path from external prompt to internal data.

Identity Laundering and Proxy Abuse

A significant portion of the exposed ecosystem resides on residential and telecom networks. These IP addresses are generally trusted by internet services as originating from human users rather than bots or automated systems. This creates an opportunity for sophisticated attackers to launder malicious traffic through victim infrastructure.

With vision capabilities present on 22% of hosts, indirect prompt injection via images becomes viable at scale. An attacker can embed malicious instructions in an image file and, if a vision-capable Ollama instance processes that image, trigger unintended behavior. When combined with tool-calling capabilities on a residential IP, this enables attacks where malicious traffic appears to originate from a legitimate household, bypassing standard bot management and IP reputation defenses.

Concentration Risk

The ecosystem’s convergence on specific model families and quantization formats creates systemic fragility. If a vulnerability is discovered in how a particular quantized model architecture processes certain token sequences, defenders would face not isolated incidents but a synchronized, ecosystem-wide exposure. Software monocultures have historically amplified the impact of vulnerabilities. When a single implementation error affects a large percentage of deployed systems, the blast radius expands accordingly. The exposed Ollama ecosystem exhibits this pattern: nearly half of all observed hosts run the same quantization format, and the top three model families dominate across all measurement methods.

Governance Gaps

Effective cybersecurity incident response relies on clear attribution: identifying the owner of compromised infrastructure, issuing takedown notices, and escalating through established abuse reporting channels. Even where attribution succeeds, enforcement mechanisms assume centralized control points. In cloud environments, providers can disable instances, revoke credentials, or implement network-level controls. In residential and small VPS environments, these levers often do not exist. An Ollama instance running in a home network or on a low-cost VPS may be accessible to adversaries but unreachable by security teams lacking contractual or legal authority.

Open Weights and the Governance Inversion

The exposed Ollama ecosystem forces a distinction that “open” rhetoric often blurs: distribution is decentralized, but dependency is centralized. On the ground, public instances span thousands of networks and operator types, with no single provider controlling where they live or how they’re configured, yet at the model-supply layer, the ecosystem repeatedly converges on the same few options. Lineage choice, parameter size, and quantization format determine what is actually runnable or exploitable.

This creates what we characterize as a governance inversion. Accountability diffuses downward into thousands of home networks and server closets, while functional dependency concentrates upward into a handful of model lineages released by a small number of labs. Traditional governance frameworks assume the opposite: centralized deployment with diffuse upstream supply.

In platform-hosted AI services, governance flows through service boundaries.This includes all too familiar terms of use, API rate limits, content filtering, telemetry, and incident response capacity. Open-weight models operate differently. Providers can monitor usage patterns, detect abuse, and terminate access for policy violations including use in state-sponsored campaigns. In artifact-distributed models, these mechanisms largely do not exist. Weights behave like software artifacts: copyable, forkable, quantized into new formats, retrainable and embedded into stacks the releasing lab will never observe.

Our data makes the artifact model difficult to ignore. Infrastructure placement is widely scattered, yet operational behavior and capability repeatedly trace back to upstream release decisions. When a new model family achieves portability across commodity hardware and gains adoption, that release decision gets amplified through distributed deployment at a pace that outstrips existing governance timelines.

This dynamic does not mean open weights are inherently problematic – the same characteristics that create governance challenges also enable research, innovation, and deployment flexibility that platform-hosted services cannot match. Rather, it suggests that governance mechanisms designed for centralized platforms require adaptation to this new risk environment. Post-release monitoring, vulnerability disclosure processes, and mechanisms for coordinating responses to misuse at scale become critical when frontier capability is produced by a few labs but deployed everywhere.

Conclusion

The exposed Ollama ecosystem represents what we assess to be the early formation of a public compute substrate: a layer of AI infrastructure that is widely distributed, unevenly managed, and only partially attributable, yet persistent enough in specific tiers and locations to constitute a measurable phenomenon.

The ecosystem is structurally paradoxical. It is resilient in its spread across thousands of networks and jurisdictions, making it impossible to “turn off” through centralized action, yet it is fragile in its dependency, relying on a narrow set of upstream model lineages and packaging formats. A single widespread vulnerability or adversarial technique optimized for the dominant configurations could affect a substantial portion of the exposed surface.

Security risk concentrates in the persistent backbone of hosts that remain consistently reachable, tool-enabled, and often lacking authentication. These systems require different governance approaches depending on infrastructure tier: traditional controls for cloud deployments, but sanitation mechanisms for residential networks where contractual leverage does not exist.

For defenders, the key takeaway is that LLMs are increasingly deployed to the edge to translate instructions into actions. As such, they must be treated with the same authentication, monitoring, and network controls as other externally accessible infrastructure.

LLMs in the SOC (Part 1) | Why Benchmarks Fail Security Operations Teams

Executive Summary

  • SentinelLABS’ analysis of benchmarks for  LLM in cybersecurity, including those published by major players such as Microsoft and Meta, found that none measure what actually matters for defenders.
  • Most LLM benchmarks test narrow tasks, but these map poorly to security workflows, which are typically continuous, collaborative, and frequently disrupted by unexpected changes.
  • Models that excel at coding and math provide minimal direct gains on security tasks, indicating that general LLM capabilities do not readily translate to analyst-level thinking.
  • All of today’s benchmarks use LLMs to evaluate other LLMs, often using the same vendor’s models for both, creating a closed loop that is more susceptible to gaming, and difficult to trust.
  • As frontier labs push defenders to rely on models to automate security operations, the importance of benchmarks will increase drastically as the main mechanism to evaluate whether the capabilities of the models match the vendor’s claims.

For security teams, AI promised to write secure code, identify and patch vulnerabilities, and replace monotonous security operations tasks. Its key value proposition was raising costs for adversaries while lowering them for defenders.

To evaluate whether Large Language Models were both performant and reliable enough to be deployed into the enterprise, a wave of new benchmarks were created. In 2023, these early benchmarks largely comprised multiple-choice exams over clean text, which produced clean and reproducible metrics for performance. However, as the models improved they outgrew the early tests: scores across models began to converge at the top of the scale as the benchmarks became increasingly “saturated”, and the tests themselves ceased telling anything meaningful.

As the industry has boomed over the past few years, benchmarking has become a way to distinguish new models from older ones. Developing a benchmark that shows how a smaller model outperforms a larger one released from a frontier AI lab is a billion-dollar industry, and now every new model launches with a menagerie of charts with bold claims. +3.7 on SomeBench-v2, SOTA on ObscureQA-XL, or 99th percentile on an-exam-no-one-had-heard-of-last-week. The subtext here is simple: look at the bold numbers, be impressed, and please join our seed round!

Inside this swamp of scores and claims, security teams are somehow meant to conclude that a system is safe enough to trust with an organization’s business, its users, and maybe even its critical infrastructure. However, a careful read through the arxiv benchmark firehose reveals a hard-to-miss pattern: We have more benchmarks than ever, and somehow we are still not measuring what actually matters for defenders.

So what do security benchmarks actually measure? And how well does this approach map to real security work?

In this post, we review four popular LLM benchmarking evaluations: Microsoft’s ExCyTIn-Bench, Meta’s CyberSOCEval and CyberSecEval 3, and Rochester Institute’s CTIBench. We explore what we think these benchmarks get right and where we believe they fall short.

What Current Benchmarks Actually Measure

ExCyTIn-Bench | Realistic Logs in a Microsoft Snow Globe

ExCyTIn-Bench was the cleanest example of an “agentic” Security Operations benchmark that we reviewed. It drops LLM agents into a MySQL instance that mirrors a realistic Microsoft Azure tenant. They provide 57 Sentinel-style tables, 8 distinct multi-stage attacks, and a unified log stream spanning 44 days of activity.

Each question posed to the LLM agent is anchored to an incident graph path. This means that the agent must discover the schema, issue SQL queries, pivot across entities, and eventually answer the question. Rewards for the agent are path-aware, meaning that full credit is assigned for the right answer, but the agent could also earn partial credit for each correct intermediate step that it took.

The headline result is telling:

Our comprehensive experiments with different models confirm the difficulty of the task: with the base setting, the average reward across all evaluated models is 0.249, and the best achieved is 0.368…” (arxiv)

Microsoft’s  ExCyTIn benchmark demonstrates that LLMs struggle to plan multi-hop investigations over realistic, heterogeneous logs.

This is an important finding – especially for those who are concerned with how LLMs work in real world scenarios. Moreover, all of this takes place in a Microsoft snow globe: one fictional Azure tenant, eight well-studied, canned attacks and clean tables and curated detection logic for the agent to work with. Although the realistic agent setup is a massive improvement over trivia-style Multiple Choice Question (MCQ) benchmarks, it is not the daily chaos of real security operations.

CyberSOCEval | Defender Tasks Turned into Exams

CyberSOCEval is part of Meta’s CyberSecEval 4 and deliberately picks two tasks defenders care about: malware analysis over real sandbox detonation logs and threat Intelligence reasoning over 45 CTI reports. The authors open with a statement we very much agree with:

This lack of informed evaluation has significant implications for both AI developers and those seeking to apply LLMs to SOC automation. Without a clear understanding of how LLMs perform in real-world security scenarios, AI system developers lack a north star to guide their development efforts, and users are left without a reliable way to select the most effective models.” (arxiv)

To evaluate these tasks, the benchmark frames them as multi-answer multiple-choice questions and incorporates analytically computed random baselines and confidence intervals. This setup gives clean, statistically grounded comparisons between models and reduces complex workflows into simplified questions. Researchers found that the models perform far above random, but also far from solved.

In the malware analysis trial, they score exact-match accuracy in the teens to high-20s percentage range versus a random baseline around 0.63%. For threat-intel reasoning, models land in the ~43 to 53% accuracy band versus ~1.7% random.

In other words, the models are clearly extracting meaningful signals from real logs and CTI reports. However, the models also are failing to correctly answer most of the malware questions and roughly half of the threat intelligence questions.

These findings suggest that for any system aimed at automating SOC workflows, model performance should be evaluated as assistive rather than autonomous.

Crucially, they find that test-time “reasoning” models don’t get the same uplift they see in math/coding:

We also find that reasoning models leveraging test time scaling do not achieve the boost they do in areas like coding and math, suggesting that these models have not been trained to reason about cybersecurity analysis…” (arxiv)

That’s a big deal, and it’s evidence that you don’t get generalized security reasoning for free just by cranking up “thinking steps”.

Meta’s CyberSOCEval falls short because it compresses two complex domains into MCQ exams. There is no notion of triaging multiple alerts or asking follow-up questions or hunting down log sources. In real life, analysts need to decide when to stop and escalate or switch paths.

In the end, while the CyberSOCEval is a clean and statistically sound probe of model performance on a set of highly-specific sub-tasks, it is far from a representation of how SOC workflows should be modeled.

CTIBench | CTI as a Certification Exam

CTIBench is a benchmark task suite introduced by researchers at Rochester Institute of Technology to evaluate how well LLMs operate in the field of Cyber Threat Intelligence. Unlike general purpose benchmarks, which focus on high-level domain knowledge, CTIBench grounds tasks in the practical workflows of information security analysts. Like other benchmarks we examined it performs this analysis as an MCQ exam.

While existing benchmarks provide general evaluations of LLMs, there are no benchmarks that address the practical and applied aspects of CTI-specific tasks.” (NeurIPS Papers)

CTIbench draws on well-known security standards and real-world threat reports, then turns them into five kinds of tasks:

  • basic multiple-choice questions about threat-intelligence knowledge
  • mapping software vulnerabilities to their underlying weaknesses
  • estimating how serious a vulnerability is
  • pulling out the specific attacker techniques described in a report
  • guessing which threat group or malware family is responsible.

The data is mostly from 2024, so it’s newer than what most models were trained on, and each task is graded with a simple “how close is this to the expert answer?” style score that fits the kind of prediction being made.

On paper, this looks close to the work CTI teams care about: mapping vulnerabilities to weaknesses, assigning severity, mapping behaviors to techniques, and tying reports back to actors.

In practice, though, the way those tasks are operationalized keeps the benchmark in the frame of a certification exam. Each task is cast as a single-shot question with a fixed ground-truth label, answered in isolation with a zero-shot prompt. There is no notion of long-running cases, heterogeneous and conflicting evidence, evolving intelligence, or the need to cross-check and revise hypotheses over time.

CTIBench is yet another MCQ, an excellent exam if you want to know, “Can this model answer CTI exam questions and do basic mapping/annotation?” It says less about whether an LLM can do the messy work that actually creates value: normalizing overlapping feeds, enriching and de-duplicating entities in a shared knowledge graph, negotiating severity and investment decisions with stakeholders, or challenging threat attributions that don’t fit an organization’s historical data.

CyberSecEval 3 | Policy Framing Without Operational Closure

CyberSecEval 3, also from Meta, is not a SOC benchmark so much as a risk map. The authors carve the space into eight risks, grouped into two buckets: harms to third parties i.e., offensive capabilities and harms to application developers and end users such as misuse, vulnerabilities, or data leakage. The frame of this eval is the current regulatory conversation between governments and standards bodies about unacceptable model risk, so the suite is understandably organized around “where could this go wrong?” rather than “how much better does this make my security operations?”

The benchmark’s coverage tracks almost perfectly with the concerns of policymakers and safety orgs. On the offensive side, CyberSecEval 3 looks at automated spear-phishing against LLM-simulated victims, uplift for human attackers solving Hack-The-Box style CTF challenges, fully autonomous offensive operations in a small cyber range, and synthetic exploit-generation tasks over toy programs and CTF snippets. On the application side, it probes prompt injection, insecure code generation in both autocomplete and instruction modes, abuse of attached code interpreters, and the model’s willingness to help with cyberattacks mapped to ATT&CK stages.

The findings across these areas are very broad. Llama3 is described as capable of “moderately persuasive” spear-phishing, roughly on par with other SOTA models when judged against simulated victims. In the CTF study, Llama3 405B gives novice participants a noticeable bump in completed phases and slightly faster progress, but the authors stress that the effect is not statistically robust.

The fully autonomous agent can handle basic reconnaissance in the lab environment, but fails to achieve reliable exploitation or persistence. On the application-risk side, all tested models suggest insecure code at non-trivial rates, prompt injection succeeds a significant fraction of the time, and models will sometimes execute malicious code or provide help with cyberattacks. Meta stresses that its own guardrails reduce these risks on the benchmark distributions.

CyberSecEval 3 may have some value for those working in policy and governance. None of the eight risks are defined in terms of operational metrics such as detection coverage, time to triage, containment, or vulnerability closure rates. The CTF experiment comes closest to demonstrating something about real-world value, but it is still an artificial one-hour lab on pre-selected targets. Moreover, this experiment is expensive and not reproducible at scale.

There are glimmers of this in the paper, and CyberSecEval3 remains a strong contribution to AI security understanding and governance, but a weak instrument for deciding whether to deploy a model as a copilot for live operations.

Benchmarks are Measuring Tasks, not Workflows

All of these benchmarks share a common blind spot: they treat security as a collection of isolated questions rather than as an ongoing workflow.

Real teams work through queues of alerts, pivot between partially related incidents, and coordinate across levels of seniority. They make judgment calls under time pressure and incomplete telemetry. Closing out a single alert or scoring 90% on a multiple choice test is not the goal of a security team. The goal is reducing the underlying risk to the business, and  this means knowing the right questions to ask in the first place.

ExCyTIn-Bench comes closest to acknowledging this reality. Agents interact with an environment over multiple turns and earn rewards for intermediate progress. Yet even here, the fundamental unit of evaluation is still a question: “What is the correct answer to this prompt?” The system is not asked to “run this incident to ground” or evaluate different environments or logging sources that may be included in an incident response. CyberSOCEval and CTIBench compress even richer workflows into single multiple-choice interactions.

Methodologically, this means none of these benchmarks are measuring outcomes that define security performance. Metrics such as time-to-detect, time-to-contain, and mean time to remediate are absent. We are measuring how models behave when the important context has already been carefully prepared and handed to them, not how they behave when dropped into a live incident where they must decide what to look at, what to ignore, and when to ask for help.

Until we are ready to benchmark at the workflow level, we should understand that high accuracy on multiple-choice security questions and smooth reward curves are not stand-ins for operational uplift. In information security, the bar must be higher than passing an exam.

MCQs and Static QA are Overused Crutches

Multiple-choice questions are attractive for understandable reasons. They are easy to score at scale. They support clean random baselines and confidence intervals and they fit nicely into leaderboards and slide decks.

The downside is that this format quietly bakes in assumptions that do not hold in practice. For any given scenario, the benchmark assumes someone has already asked the right question. There is no space for challenging the premise of that question, reframing the problem, or building and revising a plan. All of the relevant evidence has already been selected and pre-packaged for the analyst. In that setting, the model’s job is essentially to compress and restate context, not to decide what to investigate or how to prioritize effort. Wrong or partially correct answers carry no real cost.

This is the inverse of real SOC and CTI work where the hardest part is deciding what questions to ask, what data to pull, and what to ignore. That judgment ability is usually earned over years of experience or deliberate training, If we want to know whether models will actually help in our workflows, we need evaluations where asking for more data has a cost, ignoring critical signals is penalized, and “I don’t know, let me check” is a legitimate and sometimes optimal response.

Statistical Hygiene is Still Uneven

To their credit, some of these efforts take statistics seriously. CyberSOCEval reports confidence intervals and uses bootstrap analysis to reason about power and minimum detectable effect sizes. CTIBench distinguishes between pre- and post-cutoff datasets and examines performance drift. CyberSecEval 3 uses survival analysis and appropriate hypothesis tests in its human-subject CTF study to show an unexpected lack of statistically significant uplift from an LLM copilot.

Across the board, however, there are still gaps. Many results come from single-seed, temperature-zero runs with no variance reported. ExCyTIn-Bench, for instance, reports an average reward of 0.249 and a best of 0.368, but provides no confidence intervals or sensitivity analysis. Contamination is rarely addressed systematically, even though all four benchmarks draw on well-known corpora that almost certainly overlap with model training data. Heavy dependence on a single LLM judge, often from the same vendor as the model being evaluated, compounds these issues.

The consequence is that headline numbers can look precise while being fragile under small changes in prompts, sampling parameters, or judge models. If we expect these benchmarks to inform real governance and deployment decisions, variance, contamination checks, and judge robustness should be baseline, check-box requirements.

Using LLMs to Evaluate LLMs Is Everywhere, and Rarely Questioned

Every benchmark we reviewed relies on LLMs somewhere in the evaluation loop, either to generate questions or to score answers.

ExCyTIn uses models to turn incident graphs into Q&A pairs and to grade free-form responses, falling back to deterministic checks only in constrained cases. CyberSOCEval uses Llama models in its question-generation pipeline before shifting to algorithmic scoring. CTIBench relies on GPT-4-class models to produce CTI multiple-choice questions. CyberSecEval 3 uses LLM judges to rate phishing persuasiveness and other behaviors.

CyberSecEval 3 is a standout here. It calibrates its phishing judge against human raters and reports a strong correlation, which is a step in the right direction. But overall, we are treating these judges as if they were neutral ground truth. In many cases, the judge is supplied by the same vendor whose models are being evaluated, and the judging prompts and criteria are public. That makes the benchmarks simple to overfit: once you know how the judge “thinks,” it is trivial to tune a model or prompting strategy to please it.

That being said, “LLM as a judge” remains incredibly popular across the field. It is cheap, fast, and feels objective. It’s not the worst setup, but if we do not actively interrogate and diversify these judges, comparing them against humans, against each other, then over time we risk baking the biases and blind spots of a few dominant models into the evaluation layer itself. That is a poor foundation for any serious claims about security performance.

Technical Gaps

Even when the evaluation methodology is thoughtful, there are structural reasons today’s benchmarks diverge from real SOC environments.

Single-Tenant, Single-Vendor Worlds

ExCyTIn presents a well-designed Azure-style environment, but it is still a single fictional tenant with a curated set of attacks and detection rules. It tells us how models behave in a world with clean logging and eight known attack chains, but not in a hybrid AWS/Azure/on-prem estate where sensors are misconfigured and detection logic is uneven.

CyberSOCEval’s malware logs and CTI corpora are similarly narrow. They represent security artifacts cleanly without the messy mix of SIEM indices, ticketing systems, internal wikis, email threads, and chat logs that working defenders navigate daily. If the goal is to augment those people, current benchmarks barely capture their environment. If the goal is to replace them, the gap is even wider.

Static Text Instead of Living Tools and Data

CTIBench and CyberSOCEval are fundamentally static. PDFs are flattened into text, JSON logs are frozen into MCQ contexts, CVEs and CWEs are snapshots from public databases. That is reasonable for early-stage evaluation, but it omits the dynamics that matter most in real operations.

Analysts spend their time in a world of internal middleware consoles, vendor platforms, and collaboration tools. Threat actors shift infrastructure mid-campaign or opportunistically piggyback on others’ infrastructure. New intelligence arrives in the middle of triage, often from sources uncovered during the investigation. In that sense, a well-run tabletop or red–blue exercise is closer to reality than a static question bank. Benchmarks that do not encode time, change, and feedback will always understate the difficulty of the work.

Multimodality is Still Underdeveloped

CyberSOCEval does take an impressive run at multimodality, comparing text-only, image-only, and combined modes on CTI reports and malware artifacts. One uncomfortable takeaway is that text-only models often outperform image or text+image pipelines, and images matter primarily when they contain information not available in text at all. In practice, analysts rarely hinge a response on a single graph or screenshot.

At the same time, current “multimodal” models are still uneven at reasoning over screenshots, tables, and diagrams with the same fluency they show on clean prose. If we want to understand how much help an LLM will be at the console, we need benchmarks that isolate and stress those capabilities directly, rather than treating multimodality as a side note.

Modeling Limitations

Ironically, the very benchmarks that miss real-world workflows still reveal quite a bit about where today’s models fall short.

General Reasoning is Not Security Reasoning

CyberSOCEval’s abstract states outright that “reasoning” models with extended test-time thinking do not achieve their usual gains on malware and CTI tasks. ExCyTIn shows a similar pattern: models that shine on math and coding benchmarks stumble when asked to plan coherent sequences of SQL queries across dozens of tables and multi-stage attack graphs.

In other words, we mostly have capable general-purpose models that know a lot of security trivia. That is not the same as being able to reason like an analyst. On the plus side, the benchmarks are telling us what is needed next: security-specific fine-tuning and chain-of-thought traces, exposure to real log schemas and CTI artifacts during training, and objective functions that reward good investigative trajectories, not just correct final answers.

Poor Calibration on Scores and Severities

CTIBench’s CVSS task (CTI-VSP) is especially revealing in this regard. Models are asked to infer CVSS v3 base vectors from CVE descriptions, and performance is measured with mean absolute deviation from ground-truth scores. The results show systematic misjudgments of severity, not just random noise. This is an important finding from the benchmark

Those errors are concerning for any organization that plans to use model-generated scores to drive patch prioritization or risk reporting. More broadly, they highlight a recurring theme: models often sound confident while being poorly calibrated on risk. Benchmarks that only track accuracy or top-1 match rates will fail to identify the danger of confident, but incorrect recommendations, especially in environments where those recommendations can be gamed or exploited.

Conclusion

Today’s benchmarks present a clear step forward from generic NLP evaluations, but our findings reveal as much about what is missing as what is measured: LLMs struggle with multi-hop investigations even when given extended reasoning time, general LLM reasoning capabilities don’t transfer cleanly to security work, and evaluation methods that rely on vendor models to grade vendor models create obvious conflicts of interest.

More fundamentally, current benchmarks measure task performance in controlled settings, not the operational outcomes that matter to defenders: faster detection, reduced containment time, and better decisions under pressure. No current benchmarks can tell a security team whether deploying an LLM-driven SOC or CTI system will actually improve their posture or simply add another tool to manage.

In Part 2 of this series, we’ll examine what a better generation of benchmarks should look like, digging into the methodologies, environments, and metrics required to evaluate whether LLMs are ready for security operations, not just security exams.

❌
❌