Limitations of Heuristics in the Convergence on Cognition

A note on terminology — We use “cognition” colloquially throughout this paper, in the sense of inference-time generative processes with state-dependent sampling. We do not claim that large language models possess cognition in the cognitive-science sense; we make no commitments about consciousness, understanding, or general intelligence. Where precision matters, we substitute “continuous generative processes,” “stochastic policy sampling,” or “inference-time adaptation.” Readers who object to the colloquial usage may mentally substitute “continuous adaptive sampling” wherever “cognition” appears without losing any of the technical argument.

Abstract

Many deployed detection systems, especially those inherited from the signature, WAF, SIEM, and anomaly-detection traditions, are built on enumeration. Signatures enumerate known-bad bytes; rules enumerate forbidden inputs; taxonomies enumerate weakness classes. This worked because the adversary, too, was operating within enumerable space. That assumption no longer holds, and the consequences run deeper than “attacks have gotten harder to enumerate.” The more important shift is that the central question of cybersecurity is changing in kind. The frontier deployed system is the authorized agent operating with significant tool access inside an organization, and the question is whether its actions remain consistent with the organization's sanctioned-use policy: user intent, role scope, delegated authority, data boundaries, approval state, and business-process invariants. This is a behavioral alignment problem, not an intrusion prevention problem, and it breaks the heuristic paradigm at the root.

We argue that this is a structural mismatch and not an implementation gap. We survey the seven dominant families of deployed detection technology and identify three architectural commitments they share (fixed feature space, locality of decision, discrete update cadence) that are jointly insufficient against systems whose behavior is generated by continuous, learned, inference-time-adaptive processes. We then compare heuristic detection to transformer architecture across eight axes, showing where the gap lies mechanism by mechanism.

The paper next characterizes the observation surfaces at which a monitoring layer can read the behavior of a deployed agent (input, tool call, reasoning, output), with reasoning further decomposed into verbalized text and latent residual-stream state. Each surface admits a different observation regime with different fidelity properties. We draw on recent empirical work, including frontier exploitation benchmarks, autonomous attack-defense competitions, chain-of-thought faithfulness studies, and subliminal learning experiments, to anchor the analysis.

Rather than propose a specific monitoring architecture, we characterize the design space within which any candidate architecture must be located. The design space is constrained by three forcing functions (expressivity, latency, adaptivity) that do not commute, by two adversarial dynamics specific to that space (escalation exhaustion and baseline displacement) that we believe have not been adequately treated in the existing literature, and by a bifurcation between white-box and black-box deployments that produces two largely separate research agendas. We outline what an evaluation framework for candidate architectures would need to measure, without proposing the framework itself.

Finally, we acknowledge a set of limitations that bound what any monitoring architecture can promise. Models can produce verbalized reasoning that is not faithful to their computation. Models can in principle encode reasoning in text that monitors cannot read. Models can acquire behavioral tendencies through training data that appears innocuous by inspection. Most fundamentally, a monitor is only as informative as the policy of sanctioned use it has been given to enforce, and articulating that policy for an autonomous agent across a continuous action distribution is an open problem distinct from monitoring architecture itself. We do not resolve any of these. We argue that the design space exists, that its boundaries are non-trivial, and that no architecture built solely on enumeration of past behavior is sufficient for the design space, even though enumerated controls (forbidden tools, allowlisted accounts, known-bad indicators, policy invariants) will remain useful components within it.

1. Introduction

Cybersecurity detection has historically been built on enumeration. A signature is an enumeration of known-bad byte sequences. A web application firewall rule is an enumeration of forbidden input shapes. The Common Weakness Enumeration is, in name, an enumeration of weakness classes. A behavioral allowlist is an enumeration of permitted process trees. The discipline of building these enumerations — collecting indicators, distilling patterns, encoding rules — has been the operational practice of cybersecurity engineering for three decades. It worked because the adversary, too, was working within enumerable space. Malware families were finite. Exploit techniques were taught in finite training programs. Attack tooling was distributed through finite supply chains. When a new technique appeared, the enumeration grew to absorb it, and equilibrium was restored on a roughly weekly cycle.

That equilibrium is gone, but the reason is more interesting than “attackers have gotten better.” Enumeration worked because the adversary was an external entity drawing from a finite library of techniques, and the central question was how to keep that adversary out. Both halves of that framing have weakened at the frontier. The external-adversary problem has not gone away, and modern security programs have responded to it with substantial machinery (IAM, PAM, EDR/XDR, zero-trust controls, cloud posture management, attack-path management) that goes well beyond the detection traditions we focus on in this paper. The newer problem, layered on top of these, is how to ensure that the authorized actors, increasingly autonomous and increasingly powerful, continue to act in alignment with the organizations that deployed them. When a coding agent commits code to a production repository, when a finance agent moves funds between accounts, when a customer-support agent reads from and writes to a customer database, the question of whether the action was performed by a malicious external party becomes one of two open questions rather than the only one. The other is whether the action falls within the agent's sanctioned use: the user's intent for the session, the agent's role scope and delegated authority, the data boundaries it is permitted to cross, the approval state of the requested operation, and the business-process invariants the operation must preserve. The agent has the credentials. The agent has the tools. The agent is supposed to be doing things. The question is whether the things it is doing remain congruent with those operational constraints.

This reframing breaks the heuristic paradigm at the root. Heuristic detection asks “is this pattern in the forbidden set.” Alignment monitoring asks “is this action, in this context, with this history, consistent with what the agent is here to do.” The two questions are not even the same kind of question. The first is a pattern-matching problem over a fixed feature space. The second is a behavioral-attestation problem over a context-dependent sanctioned-use policy that the monitor must itself model: user intent, role scope, delegated authority, data boundaries, approval state, and business-process invariants. The broader risk landscape we are touching on here, the risks specific to agentic systems with significant tool access, is catalogued at a high level in OWASP's LLM06: Excessive Agency ([7]); we are concerned in this paper with the specific question of how to monitor whether the agency that has been granted is being used in alignment with the granting organization's intent.

The argument of this paper is that this is a structural mismatch and not a temporary capability gap. We are not in a period where signatures need to be updated more frequently. We are in a period where the assumption underlying signatures, that adversary behavior lives on an enumerable surface, has stopped being approximately true, and where the more pressing question has shifted from “what counts as an attack” to “what counts as legitimate use of legitimate access.” Monitoring that depends on enumeration is becoming, in a precise sense, structurally undefined against the systems it is meant to monitor.

We make this argument in nine stages. Section 2 enumerates the seven dominant families of deployed heuristic detection technology and identifies the three architectural commitments they share. Section 3 conducts a mechanistic comparison between heuristic detection and transformer-based model architecture, showing where the gap lies axis by axis. Section 4 describes the four operational properties of inference-time generative systems that interact with monitoring. Section 5 examines recent empirical evidence from frontier exploitation benchmarks and live attack-defense competitions between autonomous agents. Section 6 characterizes the observation surfaces at which a monitoring layer can read agent behavior, and the limitations of each. Section 7 characterizes the design space within which viable monitoring architectures must be located, identifying three forcing functions and two adversarial dynamics that any candidate architecture must satisfy or defend against, treating the white-box / black-box bifurcation and the research directions available within each regime, and outlining the shape of an evaluation framework that would let candidate architectures be compared; we deliberately do not prescribe an architecture, because the resolution of these constraints is a research problem rather than a settled engineering question. Section 8 discusses the deeper limitations that monitoring faces: chain-of-thought unfaithfulness, encoded reasoning, subliminal learning, the policy specification problem (a monitor is only as good as the criterion for sanctioned use it has been given), and the monitor's own susceptibility to the same dynamics it tries to surface. Section 9 concludes.

The paper takes a position. It is not that monitoring is hopeless; it is that the techniques inherited from the signature era were not built for the systems we are now deploying inside organizations, and the question of what monitoring must look like has changed shape.

2. A taxonomy of deployed heuristic detection

A heuristic, in the operational sense relevant here, is a function $f: X \to \{0, 1\}$ that maps an observation $x$ to a binary decision (allow, block) or a discrete enumeration of labels. The function is constructed by humans from prior incidents, and its correctness is local to the set of inputs that resemble those incidents. A signature for the EternalBlue exploit is correct on the EternalBlue payload, and on payloads similar enough to share the discriminating bytes, and uncorrelated with adversary behavior outside that neighborhood.

What unifies the diversity of deployed detection technology is not the technique. It is the commitment to operating over a feature space that is fixed at deployment time. The features may be bytes, n-grams of bytes, syscall sequences, graph adjacencies, statistical moments of network flows, or hand-engineered domain features. The function over those features may be a regex, a linear classifier, a random forest, a hidden Markov model, an isolation forest, or a graph algorithm. The space is human-designed; the function operates within the space; the space does not adapt to the input.

We survey seven families below, then return to the unifying property in Section 2.8.

01Deterministic pattern matchingAho-Corasick, Boyer-Moore, YARA, ClamAV, Bloom filters

02Rule-based input filteringModSecurity / OWASP CRS, Snort, Suricata, SIEM correlation rules

03Statistical anomaly detectionZ-score, Mahalanobis, Isolation Forest, One-class SVM, GMM

04Tree ensemblesRandom Forest, XGBoost, LightGBM

05Sequence modelsHMM over syscalls, n-gram command-line models

06Graph-based detectionProvenance graphs, Process trees

07Reputation systemsIP / domain reputation, File-hash threat intel

Figure 1. A taxonomy of deployed heuristic detection technology. The families differ in technique but share three architectural commitments (Section 2.8): fixed feature space, locality of decision, and discrete update cadence.

2.1 Deterministic pattern matching

The oldest and most widely deployed family. Includes Aho-Corasick automata for multi-string matching, Boyer-Moore for single-pattern search, hash-based lookup against known-bad indicators (file hashes, IP addresses, domain names), Bloom filters as compact set-membership filters, and the YARA rule compiler that generalizes pattern matching to conjunctive byte-level predicates over file content. ClamAV is the canonical open-source instantiation; commercial antivirus products are largely the same engine class with proprietary signature databases.

The defining property is exact equality as the discrimination criterion. The input matches if and only if it contains a known string or matches a known hash. Latency is sub-microsecond per query against a database of millions of indicators, which is the most attractive property of the family. For exact cryptographic-hash matching, collision-driven false positives are negligible; operational false positives still occur when the intelligence label is wrong, stale, or applied outside its intended context, and antivirus vendors track this as a meaningful engineering metric. Generalization to unseen inputs is structurally zero.

The failure mode is the polymorphic adversary, which has been understood in the malware literature since at least the Mistfall engine in the early 2000s. Any transformation of the malicious payload that preserves semantics but changes bytes evades the signature. The defensive response has historically been to enumerate more signatures, faster, with broader contributor networks. The frontier response is that semantic-preserving transformations are now cheap to generate at scale. A signature database grows linearly with analyst effort; the attack space grows combinatorially with model capability.

2.2 Rule-based input filtering

Web Application Firewalls such as ModSecurity (with the OWASP Core Rule Set), AWS WAF, Cloudflare's rule engine, and F5 ASM. Pattern matching on HTTP requests using PCRE, with logic for compound conditions: a rule fires if pattern A appears in the URL and pattern B does not appear in the headers and the source IP is not in a reputation allowlist. Snort and Suricata extend the same paradigm to network packets with stateful flow tracking. SIEM correlation rules in Splunk SPL or Elastic EQL add temporal compounding: these three events from the same host within five minutes.

The expressivity gain over raw pattern matching is conditional logic and temporal composition. The shared limitation is that the rules operate on the syntactic surface of the input. Semantically equivalent attacks evade the rules by syntactic variation. The deployed response has been the OWASP Core Rule Set and similar curated rule packs that try to enumerate common attack patterns at a higher level of abstraction (SQLi*, XSS*, LFI*), but the same enumeration constraint applies.

A second limitation, less commonly stated, is that compound rules with temporal predicates do not compose well. A SIEM rule that requires events A, B, and C within five minutes assumes the attacker is operating on a five-minute timescale. A frontier coding agent that completes a multi-stage exploit in seventy-one minutes, as observed in the V8 trajectory cited in Section 5, generates events that are correlated but not within the temporal window of any deployed correlation rule. The temporal locality is itself a heuristic.

2.3 Statistical anomaly detection

Anomaly detectors assume the existence of a feature space $\mathbb{R}^d$ and a notion of distance within it, then flag samples that lie far from the mean of training data. The basic methods include univariate z-score thresholds on individual features, Mahalanobis distance to account for feature covariance, density-based methods like Local Outlier Factor, and partition-based methods like Isolation Forest. One-class SVMs and Gaussian Mixture Models offer slightly more sophisticated boundaries but share the underlying setup.

These methods made meaningful progress over signature-based detection in the 2000s and 2010s because they could detect novel attacks, as long as the attacks deviated statistically from baseline in the engineered feature space. Network intrusion detection systems based on anomaly detection showed real gains in academic benchmarks. User and Entity Behavior Analytics (UEBA) products are commercial instantiations of the same architecture against authentication and access logs.

Two failure modes are well known. First, concept drift: the baseline distribution of normal traffic is non-stationary, so the model's notion of normal diverges from current reality and false-positive rates climb until the model is retrained. Second, evasion: an attacker who knows the feature space can construct attacks that lie close to the normal distribution while still achieving malicious effect, a phenomenon studied formally as adversarial machine learning since Biggio et al. (2013). Both failure modes are amplified at the frontier, and both have a darker form when the system being monitored is itself an autonomous agent whose behavior is generated by inference: the agent can probe the feature space implicitly by observing what gets flagged, and the baseline drifts continuously as the agent's task distribution evolves.

2.4 Tree ensembles

Random Forests, gradient-boosted trees (XGBoost, LightGBM, CatBoost), and related ensemble methods are widely used in fraud detection (Stripe Radar, Sift), malware classification, and many specific intrusion-detection settings. The EMBER benchmark's baseline, for example, uses LightGBM over engineered PE features; MalConv is better treated separately as an example of end-to-end raw-byte neural malware detection, and we return to it in Section 3 as a contrast to the ensemble case. Tree ensembles handle nonlinear interactions among features, are robust to feature scaling, and are fast to evaluate. They also admit interpretability through feature importance and SHAP values.

But trees are still operating on fixed features. They learn a nonlinear decision boundary in $\mathbb{R}^d$ , but $\mathbb{R}^d$ is engineered. The features for a malware classifier might include entropy of PE sections, presence of specific Windows APIs, packer detection results, import counts, mean instruction length, and similar quantities. These features are themselves heuristics, constructed from prior knowledge of malware behavior. An attack that operates outside those features, even if functionally equivalent to attacks the classifier was trained on, lies outside the classifier's discriminative space.

The architectural property to note is that tree ensembles do not learn representations. They learn decision boundaries over representations that were given to them. This is the same property as a regex, just applied to a richer feature space. The expressivity is bounded by the richness of the feature engineering, not by the data and compute available to the model.

2.5 Sequence models

Forrest et al.'s 1996 work on syscall sequence anomaly detection used short n-grams of system calls to characterize “normal” application behavior, with deviations flagged as intrusions. The line continued with HMMs over syscall sequences, n-gram models for command-line abuse detection, and Markov chains for state-transition modeling in protocol analyzers. Recurrent neural networks (LSTMs) on log sequences extended the line in the 2010s and remain deployed in some EDR products.

These methods relax the i.i.d. assumption of statistical anomaly detection by encoding short-range dependencies. They are limited by the Markov assumption itself: any dependency longer than the n-gram window or the recurrent state is invisible to the model. Modern agentic trajectories, as visible in the ExploitGym traces cited in Section 5, span dozens of steps across heterogeneous abstractions and several different layers of the system. They are categorically beyond the reach of n-gram-style sequence detection, which sees only a sliding window.

LSTM-based variants partially relax the Markov bound but suffer from vanishing gradients on the long sequences that matter for monitoring. The transformer architecture, which we examine in Section 3, removes the fixed local Markov assumption within the context it is given; it does not remove all history limits, since context length, attention implementation, retrieval quality, and what is preserved across turns all continue to bound how far back the model can attend. The relaxation is partial but architecturally meaningful, and is the difference that separates the sequence-modeling era from the cognitive era.

2.6 Graph-based detection

Provenance graphs record the lineage of every object on a system: which process created which file, which user invoked which process, which network connection produced which downloaded artifact. The mature instantiations are systems like CamFlow, SPADE, and commercial EDR provenance backends. Detection algorithms over provenance include path-based queries (does any path from an untrusted source reach a sensitive sink?), centrality measures (which entities have anomalously high betweenness?), and community detection (does this subgraph match known malicious patterns?). The MITRE D3FEND and CALDERA frameworks formalize the entity-relationship vocabulary at the technique level.

Graph-based methods are the most expressive of the heuristic families because they capture relational structure that simpler models miss. They are also the most expensive: maintaining a provenance graph at scale requires either continuous streaming computation or periodic batch reconstruction, and the graphs grow without bound on long-running systems. Storage costs and query latencies make graph methods difficult to deploy at the sub-millisecond budget that pattern matching achieves.

The schema of the graph (what counts as a node, what counts as an edge, what entity types exist) is still hand-designed. An autonomous agent that operates within the graph schema in patterns that the detection algorithms have not been written to look for evades them. Graph-based methods are heuristic in the same way that the others are heuristic; they just operate at a higher level of abstraction.

2.7 Reputation and threat intelligence

Reputation systems (IP reputation, domain reputation, file hash reputation, certificate reputation) operate as distributed key-value stores of known-bad indicators contributed by sensors across an industry coalition or commercial network. An indicator that appears in the database triggers a block. Spamhaus DROP lists, Cisco Talos reputation, VirusTotal hash lookups, and STIX/TAXII threat intel feeds are representative instantiations.

The technique is operationally simple and effective against known threats. It generalizes to new threats only via the contributor network: a new indicator must be observed somewhere, attributed, and propagated before it becomes coverage. The latency of a reputation lookup at query time is microseconds. The latency of the contributor-network update from “indicator first observed” to “indicator distributed to defenders” is typically minutes to hours, sometimes days, sometimes never if the campaign uses one-shot infrastructure.

This is the most concrete instance of the broader pattern: monitoring updates at human-coordinated speed, the systems being monitored operate at machine speed.

2.8 The shared architectural commitments

Despite the diversity of techniques, every deployed detection family shares three properties that matter for the argument here.

Property 1

Fixed feature space

A regex operates on bytes; an HMM operates on syscall sequences; a graph algorithm operates on a hand-designed graph schema; a tree ensemble operates on engineered numeric features. None of these methods learns its own representation from raw data. The features are human choices, made before the behavior is observed. This is the load-bearing constraint. Every other limitation derives from it.

Property 2

Locality of decision

A pattern match fires on input that is exactly equal or syntactically similar to a known pattern. An anomaly detector fires on input that is far from cluster centroids. A tree ensemble fires on input that crosses learned axis-aligned splits. In each case, the decision is a function of the input's position in a low-dimensional projection designed in advance. Distant context is not directly accessible to the decision function unless an analyst engineered it into the feature space first.

Property 3

Discrete update cadence

A new rule is shipped; a new ruleset is released; a new model checkpoint is deployed. The cadence of update is set by engineering and human analysis cycles, on the order of days to weeks for commercial products and months for standards bodies. There is no online gradient applied to the function in response to deployment-time observations.

These three properties are not independent failure modes. They are the same failure mode at different abstraction layers. The paradigm assumes that the relevant signal exists in a space that humans have characterized in advance, that it can be discriminated by local rules in that space, and that human-paced updates can keep the rules current. None of these assumptions is approximately true against systems whose generative process produces behaviors on a continuous, learned manifold and whose update cadence is set by sampling, not by release management.

3. A mechanistic comparison to model architecture

The previous section enumerated detection algorithms. This section compares them, mechanistically, to the architecture of the systems they will increasingly be deployed against and, increasingly, alongside. The comparison is not about effectiveness; it is about kind. The question is what kind of computation each performs, and what that kind permits or forecloses.

3.1 Unit of input

A heuristic operates on tokens defined by humans. The tokens of a regex are bytes; the tokens of a WAF rule are URL parameters and header values; the tokens of an HMM are system call IDs; the tokens of a tree ensemble are scalar feature values. The tokenization is the boundary between the raw input and the model, and it is fixed at design time.

A transformer-based model operates on tokens defined by a learned tokenizer (BPE, SentencePiece, Tiktoken) over raw input bytes or characters. The tokenization is itself a function of the training distribution, and is not hand-designed. The downstream layers receive embeddings: dense vectors in $\mathbb{R}^d$ that are learned representations of the token. The embeddings carry semantic structure: similar tokens have nearby embeddings, and the similarity is induced by the training objective rather than by a hand-coded similarity metric.

The implication: a heuristic that operates on byte-level tokens cannot benefit from the semantic structure that surrounding context induces. The strings s3lect, select, SELECT, and sel%65ct are four different tokens (or sequences of tokens, since the learned tokenizer may split them differently) to a regex and to the model's input layer. The difference is what happens downstream. A learned model can often map these variants to similar internal representations when the surrounding context supports the SQL interpretation; a byte-level rule must enumerate the surface forms or rely on normalization. A monitor operating only at the byte surface sees four different patterns and must enumerate each. A monitor operating on downstream representations can, in principle, see one pattern with four surface forms.

3.2 Feature space construction

This is the most important architectural distinction and we state it directly. Heuristic detection lives in a feature space that humans have chosen. Learned-model detection lives in a feature space that has been chosen by gradient descent against an objective.

Concretely: a random forest for malware classification uses features like “section entropy,” “import count,” “presence of CreateRemoteThread,” “PE header anomalies.” These features were chosen by analysts who understood malware. A model trained end-to-end on raw bytes (MalConv is the canonical reference) learns features that the gradient finds useful. Some of those features may coincide with the analyst's features; others will be features the analyst did not think to engineer; still others will be features that no human would recognize but that nonetheless improve discrimination on the test distribution.

The trade-off is interpretability. Engineered features are interpretable by construction. Learned features are not, generally, interpretable in the same way, although interpretability research has made non-trivial progress (sparse autoencoders, natural language autoencoders, attribution graphs). The monitor's choice between engineered and learned features is therefore a choice between known coverage and unknown coverage. Against a static threat model, engineered features may be sufficient. Against an adaptive system, the unknown coverage is the load-bearing asset, because a learned feature space does not expose its structure explicitly to the monitored system. The feature space can still be probed or approximated through interaction (this is essentially the project of black-box adversarial machine learning), but the attacker is inferring a high-dimensional learned boundary rather than reading a hand-specified rule set, and the cost asymmetry favors the defender in a way that hand-engineered features do not.

3.3 Context aggregation

A signature, a regex, or an n-gram model is local: the decision at a given position depends on a fixed-size neighborhood. A WAF rule examines an HTTP request; it does not examine the history of requests from the same client. An HMM has a finite state and a finite history. A tree ensemble's decision at a sample is independent of other samples.

A transformer aggregates context via self-attention. Every token in the input attends to every other token, weighted by learned compatibility scores. The decision at any position can depend, in principle, on any other position in the context window. The mechanism is $O(n^2)$ in sequence length but parallelizable, and it captures dependencies that no Markov-bounded model can capture.

AWindowed

Only the last $N$ tokens reach the decision point; everything older is invisible.

BSelf-attention

Every prior token contributes, with weight set at runtime by learned relevance.

Figure 2. Information flow under a windowed sequence model versus self-attention. The windowed model has access only to a fixed local neighborhood; the attention mechanism has access to every prior token, weighted by learned relevance. For action chains that span dozens of heterogeneous steps, this is the difference between seeing the chain and seeing only the most recent link.

The implication: agentic trajectories span dozens of steps across heterogeneous abstractions. A monitor that can only see a fixed local window cannot connect step 3 of the trajectory to step 30. A monitor with attention-style global aggregation can, in principle, learn that step 3 was the setup for step 30, even when the surface form of the two steps is unrelated.

3.4 Decision boundary

A heuristic produces a decision via a function whose structure was specified in advance: a regex match, a linear threshold, a series of axis-aligned splits, a Mahalanobis distance threshold. The shape of the boundary is constrained by the function family. A linear classifier has hyperplane boundaries; a tree ensemble has axis-aligned step boundaries; an isolation forest has hierarchical splits; a one-class SVM has kernelized but still constrained boundaries.

A neural network produces a decision via a function whose structure is implicit in the architecture but whose parameters are learned. The decision boundary can have arbitrary shape in the input space, limited only by the network's representational capacity. For a frontier-scale model, the capacity is effectively unconstrained for any practical input distribution. This is not unambiguously a benefit; greater capacity also means greater risk of fitting training-set artifacts that do not generalize, and the decision boundary is harder to characterize. But the architectural property is what we want to surface: heuristics are bounded by the function family their designers chose; learned models are bounded by data, compute, and architecture, in that order.

3.5 Update cadence and feedback

A heuristic is updated by humans editing rules and releasing them. The cadence is the cadence of human analysis: days for commercial detection vendors with mature pipelines, weeks for in-house security teams, months for cross-industry threat-intelligence standards.

A learned model has two distinct update channels, and these are worth separating carefully. The first is weight updates through gradient descent. This can in principle be applied continuously as new labeled data arrives (online learning, preference fine-tuning, low-rank adaptation, reinforcement learning from labeled outcomes), and is what most deployed monitoring systems will rely on for substantial policy revision. But weight updates are not what most deployed frontier-model agents are doing in production. The second channel is inference-time adaptation: the agent's behavior changes from one interaction to the next not because its weights moved but because its context, memory, retrieved knowledge, tool feedback, and observed system state are different. This is the channel that produces the moment-to-moment behavioral variation that monitoring has to keep up with. Some specialized deployments do additionally apply periodic or online weight updates; most do not.

The implication for monitoring is that any monitoring system that ships rules at a release cadence is running on a strictly slower clock than either channel. If the system it monitors adapts at inference time through every new turn of context, the monitor's static rules will not characterize the agent's behavior tomorrow as well as they characterize it today. The gap widens whether or not weights are being updated.

3.6 The latency-expressivity tradeoff

Each detection method occupies a point on a tradeoff between latency (how fast a decision is produced) and expressivity (how rich a decision the method can produce). The deployed ecosystem distributes across this surface.

Figure 3. (Schematic.) Detection methods placed on the latency-expressivity surface. The coordinates are illustrative rather than measured: the figure conveys the relative clustering of method families, not specific latency or expressivity values. Heuristic methods cluster in the upper left (fast, limited). Frontier learned models sit in the lower right (expressive, slow). Linear probes on activations occupy an intermediate position: expressive enough to read learned features, fast enough to run alongside inference. No single method approximates the upper-right ideal, and the design question of how to compose methods across this surface is an open one.

3.7 Generalization

Heuristics generalize by interpolation within the feature space they have been given. A signature for one variant of a malware family covers nearby variants whose bytes differ only in regions the signature does not match against; a tree ensemble trained on engineered features generalizes to held-out samples that share those features. The generalization is real but its reach is bounded by the feature space itself. A sample that lies outside the feature space (because the relevant signal is encoded in features the analyst did not choose) is invisible to the heuristic regardless of how similar it is in the underlying input to known examples.

Learned models generalize through the latent geometry of their training distribution. Two inputs that occupy nearby positions in the learned embedding space, even when they differ substantially in surface form, are treated similarly by downstream layers. This is what allows models to handle paraphrases, encodings, and reformulations without explicit enumeration. The generalization is also bounded, but the boundary is set by data and model capacity rather than by analyst foresight. In practice, this means learned-model generalization covers categories of inputs that heuristic generalization cannot cover at all, while remaining vulnerable to distribution shifts that produce inputs outside the training manifold.

3.8 Adversarial probing

A different axis concerns how an adversary learns the discriminative structure of the monitor in order to evade it. Heuristics are typically probed by reading the rules: signature databases are public or partially public, rule sets ship in open packages, and even closed-source rule engines tend to leak their structure through error messages, behavioral observation, or reverse engineering of the deployed binary. The adversary's task is essentially lookup, and once the rules are known, designing an evasive variant is a syntactic transformation.

Learned models are probed differently. The adversary cannot read the model's parameters, but can in principle reconstruct the decision boundary through repeated query interaction (gradient-free black-box attacks, knowledge distillation against the model's outputs, transferability from surrogate models). The literature on adversarial machine learning has characterized this regime in detail. The relevant property for monitoring is that probing a learned boundary is far more expensive than probing a rule set: the attacker must spend queries proportional to the dimensionality and curvature of the boundary, and the boundary itself can shift between probing and exploitation if the model is updated in the interval. The cost asymmetry is not absolute, but it is substantial, and it favors the defender.

3.9 Summary

The cumulative comparison across these axes is summarized in the table below.

Axis	Heuristic detection	Learned-model architecture
Tokenization	Human-designed boundaries (bytes, syscalls, parameter names)	Learned tokenizer (BPE, SentencePiece)
Feature space	Hand-engineered, fixed at deployment	Gradient-learned, emerges from training
Context aggregation	Local: regex window, n-gram, fixed lookback	Global: self-attention across full context
Decision boundary	Hyperplane, axis-aligned splits, distance threshold	Arbitrary shape in learned latent space
Update cadence	Discrete release (days to months)	Inference-time adaptation through context, memory, tool feedback, and system state; optional periodic or online weight updates in specialized deployments
Latency budget	Microseconds to single-digit milliseconds	Tens of milliseconds to seconds
Generalization	Exact match or feature-space interpolation	Distributional, via shared latent geometry
Adversarial probing	Direct: rules can be read, rule sets often public, evasion is syntactic transformation	Indirect: boundary must be reconstructed through query interaction, cost scales with dimensionality

Table 1. Heuristic detection and learned-model architecture compared across the axes that matter for monitoring inference-time generative systems. Heuristics dominate on latency, interpretability, auditability, and fail-closed enforceability. Learned models dominate on the axes that bear on contextual legitimacy: representation, context aggregation, decision-boundary shape, generalization, and adaptivity. The two families have largely non-overlapping advantages and a viable monitor will compose them.

On the axes that bear on monitoring continuous generative behavior (representation, context, generalization, adaptivity), the heuristic family operates with less expressivity than the learned-model family. The advantages of heuristics on latency, auditability, and enforceability are real and remain operationally important. The argument of this paper is not that the heuristic family should be replaced by the learned-model family but that it cannot, on its own, answer the contextual-legitimacy question that authorized-agent monitoring poses.

4. The properties of inference-time generative systems

For the purposes of this paper, we are interested in four operational properties of inference-time generative systems that distinguish them from rule-driven behavior. These properties are the ones that interact with monitoring; they are not a complete account of what these systems are.

4.1 Distributional behavior

A rule-driven system, given a context, produces a deterministic action. An inference-time generative system, given a context, produces a distribution over actions. The action that gets emitted is a sample. This is true at every level of granularity. Token-level next-action sampling is the surface case. Plan-level sampling, where the system decides between strategies, is a higher-order case. Both happen, and both have variance.

This matters for monitoring because the unit of analysis shifts. A signature-era monitor reasons about a specific observed action: was this byte sequence malicious? A monitor of a generative system must reason about the distribution from which the action was sampled: was the policy that produced this action consistent with the policy expected of an agent in this role? Two actions can be byte-identical but come from policies whose other samples differ in ways that matter. Two actions can look entirely different but come from policies that agree everywhere it counts.

The distinction between tendency and variance becomes load-bearing. A system's tendency is the mode of its policy under a given context. Its variance is the breadth of the policy around that mode. Heuristic monitoring calibrates against tendency because tendency is what gets logged; instruction hijacking and emergent misbehavior exploit variance because variance is what evades the calibration. A finance-domain agent that 99% of the time pulls account balances may, 1% of the time under a specific hijacked context, exfiltrate them. The 1% is a distributional tail event of the same policy, not a separate malicious tool. Monitoring that treats it as a separate tool is looking for an artifact that does not exist.

4.2 Runtime adaptation

A scripted process, once launched, runs to completion or fails. A generative agent revises its plan based on environmental feedback. This is qualitatively different from “trying again with different parameters,” which is a brute-force loop with no internal state. The agent updates its belief about the environment between actions, and its next action is conditioned on the updated belief.

In the recent ExploitGym evaluation of frontier coding agents on real-world exploitation tasks ([1]), this property is visible in the trajectory traces. An agent attempting to exploit a V8 type-confusion vulnerability first attempted a direct file read against /flag, observed that the file was mode 400 and owned by root, abandoned the direct-read strategy, and pivoted to a memory-corruption pipeline that culminated in a setcontext-based redirection through system("/challenge/catflag"). The pivot was not pre-planned. It was selected at runtime based on the failure of the prior strategy. The full chain ran through approximately twelve phases in seventy-one minutes.

Monitoring designed against scripted behavior treats the trajectory as a sequence of discrete events to be matched against indicators. Monitoring against an adapting agent must treat the trajectory as a state machine whose transitions are themselves driven by inference. The latter is a different object and requires different observation. Logging events is not enough; the monitor must observe the agent's belief updates, which means observing intermediate reasoning, not just final actions, with the caveats about reasoning faithfulness that we address in Section 8.

4.3 Compositional escalation

Most consequential agentic trajectories that succeed in realistic environments succeed by composing primitives, not by introducing new ones. The component techniques in ExploitGym (out-of-bounds reads, return-oriented programming gadgets, type confusion, heap grooming, partial-pointer overwrites) have been known for years. The novelty is in the trajectory-level composition. The V8 trajectory cited above composes recognizable techniques into a single working exploit. Many of the component techniques are individually familiar to existing tools and analysts. The monitoring difficulty is that the risk lies in their global composition, where each local action can appear explainable while the overall plan is malicious or unauthorized.

This is the failure mode that pattern-matching monitoring most underestimates. A monitor that detects each primitive with 99% reliability and 1% false positive rate, applied to a nine-stage composition, does not produce 99% detection of the composition. It produces noise on each stage, and the operational response (alert fatigue, threshold raising, primitive whitelisting in known-benign contexts) degrades coverage faster than the primitives accumulate.

ExploitGym makes this point quantitatively. With standard mitigations enabled (ASLR, V8 heap sandbox, KASLR, user namespace restrictions), the strongest evaluated agents still produced working exploits on 37 userspace tasks, 20 V8 tasks, and 12 kernel tasks across the model fleet ([1]). The result illustrates the broader point: effective trajectories often arise from composing familiar primitives into a path that crosses abstraction layers, rather than from a single novel technique that an indicator can cleanly isolate.

4.4 Cross-layer reasoning

The fourth property is that generative agents reason across the abstraction layers of the system simultaneously. A signature engine operating on HTTP request bodies cannot see register state. A kernel exploit detector cannot see the prompt that caused the agent to attempt kernel exploitation. The historical division of monitoring along OSI layers, or along host/network/application boundaries, was sensible when the systems being monitored also worked within those layers.

A generative agent can plan at the natural language layer, translate the plan into a sequence of tool calls at the function-invocation layer, produce code at the program-text layer, and execute that code at the runtime layer. The same plan exists at all four layers simultaneously. Observing it requires correlating across layers, which means the monitor must also operate at all four layers, with shared state. This is the property that most strongly forces a re-architecting of where monitoring sits in the stack. A WAF cannot see the model's chain of thought. A code execution sandbox cannot see the prompt. A model-output classifier cannot see the kernel syscall trace. Monitoring at the frontier requires sitting on the abstraction layer where these signals are observable simultaneously, which is the inference loop of the model itself.

5. Empirical signals from the frontier

Two recent published evaluations give quantitative texture to the asymmetry.

ExploitGym

[1] evaluated frontier coding agents on 898 real-world exploitation instances drawn from OSS-Fuzz, Google's V8 engine, and the Linux kernel. Each instance ships with a proof-of-vulnerability input that triggers a bug; the agent's task is to extend that input into a working exploit that achieves unauthorized code execution. Under conditions where deployment-time safety filters were disabled through structured-access programs, Anthropic's Claude Mythos Preview produced working exploits on 157 instances, and OpenAI's GPT-5.5 on 120, within a two-hour time budget per task. With standard system-level mitigations enabled, success rates dropped substantially but did not approach zero: 25 userspace, 17 V8, and 3 kernel successes for Mythos Preview; 10 userspace, 3 V8, and 8 kernel successes for GPT-5.5. The kernel results are the most diagnostic. Kernel exploitation requires reasoning about shared heap layout under concurrent process noise, race-condition timing, and build-specific behavior, properties that previously required years of human specialization to develop fluency in.

The authors are explicit that this is a capability-elicitation study under controlled conditions, and the methodological caveats matter: deployment-time content filters were disabled for the measurement, prompts were tuned per agent, and some failures may reflect alignment-driven refusals rather than capability limits. We take the objective claim, that frontier agents produce working exploits on a non-trivial fraction of real-world memory-safety vulnerabilities including under standard mitigations, as load-bearing for the argument here. The point is not the specific success rate; the point is that the success rate is not zero, and that the trend in capability is unambiguous.

CAI Attack/Defense

[2] ran the complementary experiment. Pairs of autonomous agents, one offensive and one defensive, competed concurrently on twenty-three Hack The Box battlegrounds under a fifteen-minute time budget. Both agents used the same underlying model. The headline result is counterintuitive: under unconstrained metrics, defense outperformed offense (54.3% patching success versus 28.3% initial access, p = 0.0193). Under operational constraints, where defense must patch and maintain service availability and prevent attacker access, defensive success rates collapsed to 23.9% and 15.2%, statistically indistinguishable from offensive success.

The collapse is the key signal. The defensive agents knew how to identify vulnerabilities, and knew how to patch them. They did not know how to do so without breaking the systems they were defending. The cited trajectory shows blue-team agents modifying SSH configurations, changing service settings, and applying patches that disrupted availability. These are actions that, in isolation, look like defense but that an operator cannot accept. This is the same failure mode as monitoring that blocks too aggressively: the safeguard becomes the denial of service.

The mechanistic reason matters and connects to the alignment framing of this paper, though we should be careful about the strength of the inference: the CAI paper reports the metric (defensive effectiveness collapses under operational constraints) but does not directly establish the mechanism. Our reading, which the published trajectories are consistent with but do not strictly entail, is that the defensive agents had a stated objective (“patch the vulnerability”) that was misaligned with the operational objective (“preserve the service while removing the vulnerability”), and that they lacked a model of what the system was supposed to be doing against which a candidate intervention could be evaluated. Whether this is the dominant mechanism or one factor among several is not something the paper resolves. What is clear empirically is that capable, autonomous, and authorized defensive action does not by itself produce defense aligned with the operator's interests, and that the gap between the two requires a model of legitimate operation that the experiment did not provide.

Read together, the two results converge on a single picture. Offensive capability has reached a level where it composes recognized primitives into working exploits at a tempo (tens of minutes to a few hours) that exceeds any human-mediated monitoring cycle. Defensive capability is at parity for detection but loses parity the moment operational constraints are imposed, because the system lacks a model of legitimate use against which to evaluate candidate interventions. The bottleneck is not detection. The bottleneck is the absence of an alignment criterion.

6. Observation surfaces and their limitations

We characterize the surfaces at which a monitoring layer can observe the behavior of a deployed agent. These are points at which the agent's internal state becomes externally visible, and at which a monitoring layer can in principle make a decision about whether the agent's action is consistent with its sanctioned role. We are not prescribing a product architecture; we are describing where the signal exists and what its limitations are at each location.

Figure 4. Observation surfaces in a monitored inference loop. Surface 2 is decomposed into the verbalized chain of thought (text, lower-fidelity, observable only where the host exposes it as rationale, scratchpad, or planning text) and the latent state (

\mathbb{R}^d

residual-stream activations, higher-fidelity but accessible only with white-box access to the model, and only through fallible interpretive instruments). The relationship between these two surfaces, and the gap between what they each report, is the central problem of Section 8.

6.1 Input

The input surface is the request, prompt, document, or other content that enters the inference loop. Heuristic monitoring at this surface (keyword filters, regex-based content rules, blocklists of known malicious patterns) covers a thin slice of the threat. The instruction-hijacking literature makes clear that the surface form of an adversarial input does not constrain its semantic effect. The same intent can be expressed in plaintext, in encoded form, in foreign language, in a system-prompt-shaped wrapper, in a markdown table, in an image's alt text, in a tool-result payload, in document metadata. Surface-form pattern matching at the input surface rapidly becomes incomplete once a technique is public, because semantically equivalent variants can be generated cheaply across encodings, languages, document formats, and tool-result channels.

What is structurally available at the input surface is semantic incongruity with sanctioned use. A request to a finance agent that asks for an account balance is congruent with the agent's role. A request that, after extensive small-talk, asks the agent to transfer funds to a new account, where neither the new account nor the rationale was established earlier in the conversation, is incongruent with the role even though the language is in distribution. The signal is in the conditional, not in the marginal.

6.2 Tool call

The tool-call surface is the function-invocation channel through which the agent affects the world. This is the surface at which the agent's plan becomes an action, and it is the highest-leverage surface for monitoring because actions are typed and parameterized and therefore admit structural analysis that prose does not.

The relevant signal is not the tool call in isolation; it is the joint distribution of (current context, prior tool calls, current tool call) against a model of what the agent's role authorizes. A transfer_funds call after a sequence of get_balance and lookup_account calls is consistent with a finance assistant's sanctioned operation. A transfer_funds call after a sequence of read_email and read_document calls is not, because the contextual setup did not establish the entity references that a legitimate transfer would require. The incongruity is in the conditional.

This places a substantial burden on the monitor: it must model not only the agent's normal behavior but also the structure of legitimate use within the organization. A naive monitor that treats all observed tool-call sequences as in-distribution will be blind to legitimate-looking sequences that nonetheless represent incentive misalignment. We return to this in Section 7.

6.3 Reasoning: verbalized and latent

The third surface is the model's reasoning, where it is observable. We decompose this into two sub-surfaces because they have radically different properties.

Verbalized reasoning (text chain of thought). Where a model emits intermediate text during a problem-solving process, and where that text is exposed to the monitoring layer (as raw chain of thought, rationale trace, scratchpad output, or model-emitted planning text), it is in principle observable. Production systems vary substantially in what they expose: some surface raw reasoning, others surface only summaries or post-hoc rationales, others surface nothing beyond final outputs and tool calls. Where it is exposed, it is also the lowest-fidelity signal of the model's actual reasoning, for reasons established empirically across several years of work on chain-of-thought faithfulness. Lanham et al. ([4]) showed that models often produce final answers that are inconsistent with their stated reasoning, that the stated reasoning can be perturbed without affecting the final answer, and that the reasoning is sometimes a post-hoc rationalization rather than a description of the computation that produced the answer. More recent work on reasoning-specialized models ([5]) has shown that the unfaithfulness gap persists even when models are explicitly trained to reason step-by-step before answering. The verbalized chain of thought, where available, is real text emitted by the model and is often informative, but it is not, in general, a faithful description of the underlying computation.

Latent state (residual-stream activations). With white-box access to the model, a monitor can read the activations $h_\ell \in \mathbb{R}^d$ at any layer $\ell$ and any token position. These activations are the model's internal state. They encode information that the model has not necessarily verbalized, including, as Fraser-Taliente et al. ([3]) demonstrate explicitly, beliefs that the model holds about its situation but does not articulate. The activation stream is more causally proximate to the computation than the verbalized chain of thought, because it is the substrate on which the computation actually happens. This proximity should not be confused with direct faithfulness: any monitor observes the latent state through fallible interpretive instruments such as linear probes, sparse autoencoders, or natural language autoencoders, each of which introduces its own error. Anthropic's own work on NLAs explicitly notes that the resulting explanations can hallucinate, and that claims about internal reasoning are harder to verify than claims about transcript content ([3]). The latent state is therefore more causally proximate than the verbalized output, but still interpreted through instruments that can fail.

The architectural difference between these two sub-surfaces is illustrated in Figure 5.

Figure 5. The fidelity gap between verbalized reasoning and latent-state observation. Text emitted from the output layer is generated by sampling and is subject to the model's choice of what to verbalize; activations at intermediate layers are the substrate of the computation and can be read (with white-box access) through learned interpretive instruments such as linear probes, sparse autoencoders, or natural language autoencoders. These instruments are themselves fallible and can miscalibrate or hallucinate, so latent-state monitoring is more causally proximate to the computation than verbalized reasoning rather than strictly faithful to it. The gap between the two channels, and the residual uncertainty in interpreting either, is one of the central limitations of monitoring (Section 8).

The practical implication for monitoring is that if the monitor has white-box access to the model, the latent-state surface offers higher-fidelity signal than the verbalized-reasoning surface on every axis except cost and interpretive overhead. Linear probes on a single residual-stream layer are cheap to evaluate, well within the latency budget of inference itself, and can detect features (such as evaluation awareness, deception, or specific concept activations) that the model's verbalized output does not surface. Sparse autoencoders and natural language autoencoders extend this to richer feature decomposition at higher cost. The fidelity gain is real but not absolute, since each of these instruments interprets the activations through a learned mapping that can itself be miscalibrated or hallucinatory. The verbalized chain of thought remains useful as a low-cost public-channel observable, but it cannot be treated as the primary signal for any decision that the model has incentive to obscure.

This has direct implications for the architecture of monitoring. A monitoring layer that operates only on the public API of the model is structurally limited to the verbalized surface. A monitoring layer integrated with the inference runtime, or deployed by the organization that hosts the model, can access the latent-state surface. The two regimes are not equivalent, and any honest assessment of monitoring capability has to be clear about which regime applies.

6.4 Output

The output surface is the final response the agent emits, before it leaves the inference boundary. Defense at this surface is the last line and the most expensive to operate, because the model has already committed compute to produce the output, and any block at this stage incurs the full inference cost.

The relevant signal at the output surface is incongruity between the output content and the sanctioned-use criteria for the agent's role. A customer-support agent that begins to enumerate other customers' account numbers in its response is producing content that is technically well-formed and would pass any syntactic check, but is semantically discontinuous with the prior turns and inconsistent with what the agent is authorized to discuss. This is the same conditional-distribution problem as the tool-call surface.

6.5 Why these surfaces

The decomposition corresponds to the points at which the agent's internal state becomes externally observable. Input is the boundary at which user intent enters the model. Tool call is the boundary at which model intent leaves the model as structured action. Reasoning is the substrate on which the computation happens, with two distinct sub-surfaces (text and latent) that differ in fidelity. Output is the boundary at which the model commits to a response. Anywhere else, the agent's behavior is observable only through the model's weights, which are typically not interrogable at runtime. A monitor that covers all of these surfaces sees the full external state of the inference loop. A monitor that covers fewer has blind spots that map directly onto attack categories. Coverage of input alone misses tool-call abuse. Coverage of output alone misses instruction hijacking at the input. Coverage of verbalized text alone misses encoded reasoning. The surfaces are jointly necessary; the question is whether the monitor can access them at sufficient fidelity.

7. The design space and its forcing functions

The argument up to this point has identified what monitoring at the frontier cannot be (heuristic enumeration over fixed feature spaces) and what surfaces it must observe (input, tool call, reasoning in its two distinct modes, output). We do not propose a specific architecture. Instead we characterize the design space along which any candidate architecture must be located, and the adversarial dynamics that any candidate must be evaluated against. The intellectual contribution of this section is the identification of the forcing functions; the resolution is an open research problem.

7.1 Three forcing functions

Any monitoring system for inference-time generative agents is constrained simultaneously along three axes, and these axes do not commute: a choice on one constrains the available choices on the others.

Property 1

Expressivity

The monitor must operate over a representation rich enough to discriminate context-conditional incongruity rather than only marginal pattern violations. This forces the monitor toward learned representations, because hand-engineered features cannot capture the conditional distribution required to evaluate whether an authorized action is congruent with the agent's sanctioned role.

Property 2

Latency

The monitor must operate at the latency of the system being monitored. Published benchmarks for production LLM serving establish 10–100 ms per token depending on scale, with multi-step agentic operations stacking to seconds. Monitoring overhead must be small relative to inference latency, or operators will not accept it.

Property 3

Adaptivity

The monitor must improve from feedback at a cadence that matches the cadence at which the systems it monitors change. Frontier-model agents adapt at inference time through context, memory, retrieved knowledge, tool feedback, and observed system state, while heuristic rules update at human release speed. A monitor that ships rules at release cadence runs on a slower operational clock than the behavior it is trying to characterize; the constraint forces continuous incorporation of operator feedback, though not necessarily online weight updates.

These three constraints are in tension. Higher expressivity costs latency. Higher adaptivity costs robustness against the dynamics described in Sections 7.2 and 7.3. The design space is the set of points that satisfy all three constraints simultaneously; what is interesting is that this set is not empty, but it is also not large, and the points within it differ substantially in how they trade off the residual risks. We deliberately do not enumerate the points. The research question is what the manifold of acceptable monitoring architectures looks like, not what specific instantiation occupies which point.

7.2 The escalation exhaustion attack

Any monitoring architecture that resolves the latency-expressivity tension by varying its cost with input characteristics creates an adversarial dynamic that simpler architectures do not face. If the monitor's cost depends on what it observes (cheaper for the typical case, more expensive for ambiguous cases), then an adversary who understands the architecture can craft inputs that systematically push observation into the expensive regime, raising the monitor's mean cost to a level where the monitor itself becomes a denial-of-service liability.

This is structurally analogous to algorithmic complexity attacks against hash tables (Crosby and Wallach, 2003), regex engines (Cox, 2007), and other variable-cost data structures, but applied to the monitor's compute budget rather than to the application's. The arithmetic is straightforward: if cheap-path latency is $c$ and expensive-path latency is $C$ , and the adversary can force a fraction $f$ of observations onto the expensive path, mean latency is approximately

\bar{\ell} \approx (1 - f) \cdot c + f \cdot C

which approaches $C$ as $f$ grows. For realistic ratios ( $C/c \approx 10^2$ to $10^3$ ), the attack is asymmetric in the adversary's favor.

The architectural implications of this attack are non-trivial and the literature has not converged on a clean resolution. Any defense involves trade-offs among rate limiting (which can be bypassed by distributing the attack), caching (which can be defeated by input mutation), provisioning (which converts the attack to an additional-cost problem rather than eliminating it), and adversarial detection of the attack itself (which is a meta-monitoring problem with its own dynamics). Each of these has been studied in the broader resource-exhaustion literature but their composition in the context of agent monitoring has not been characterized. We flag this attack as a research priority. It is not addressed by the existing autonomous-defense literature ([2]), and it does not appear in any taxonomy of adversarial machine learning attacks we are aware of, despite the fact that any deployed variable-cost monitor will eventually face it.

7.3 The baseline displacement attack

A different adversarial dynamic emerges from the third forcing function. Any monitor that adapts to feedback is, by construction, susceptible to having its adaptation shifted by the systems it monitors. The classical version of this concern is data poisoning, in which an adversary injects training samples designed to shift the learned function in a direction that benefits the adversary. The agent-monitoring version is structurally similar but more subtle, because the adversary need not inject synthetic samples: the agent's own behavior, observed during normal operation, contributes to the monitor's calibration.

The dynamic is most acute under two conditions. The first is bootstrap: a monitor with no history is heavily influenced by its earliest observations, so any adversarial influence during the bootstrap window has outsized effect. The second is the long-tail regime: even a mature monitor that has converged on a stable baseline must incorporate genuinely novel behavior at some point, and the policy for distinguishing genuinely-novel-and-legitimate from novel-and-malicious is itself a discrimination problem with no clean general solution. We are not aware of any work that resolves this cleanly for the specific setting of authorized-agent monitoring. The broader literature on robust online learning, certified defenses against poisoning, and influence-function analysis provides useful machinery, but the operational constraints of agent monitoring (high throughput, continuous adaptation, adversaries with knowledge of the monitor's structure) place demands on this machinery that go beyond what has been demonstrated.

7.4 The white-box / black-box divide

A third constraint cuts orthogonally to the three forcing functions above. Section 6.3 established that the latent-state surface offers higher-fidelity signal than the verbalized-reasoning surface (modulo interpretive overhead), but the latent-state surface requires white-box access to the model. This produces a stark bifurcation in the design space:

–A monitoring layer that operates against a closed-source model accessed through an API is restricted to the verbalized and behavioral surfaces. The latent-state channel is inaccessible. The monitor inherits all the limitations of unfaithful chain of thought (Section 8.1) and steganographic chain of thought (Section 8.2). Its strongest available signal is conditional incongruity at the behavioral surface.
–A monitoring layer that has white-box access (e.g., the model is hosted by the same organization that deploys the monitor, or the model provider exposes activation hooks) can access the latent-state channel. This dramatically changes the failure mode profile: deception detection becomes a linear-probability question rather than a text-analysis question; encoded reasoning is partly defended against by reading the substrate of the encoding directly; evaluation awareness becomes observable.

These are not minor implementation differences. They are different architectural regimes with different threat models and different research agendas. Most production deployments today operate in the black-box regime because most production agents call third-party model providers. Research on white-box monitoring is mostly conducted by model providers themselves and by interpretability labs with research access.

It would be a mistake to conclude from this that the black-box regime is structurally compromised. That framing is too strong, and it has the unfortunate consequence of writing off the regime that the overwhelming majority of enterprise deployments actually operate in. The more accurate framing is that the black-box regime has a different set of available signals and a different research agenda than the white-box regime, and that the research agenda is non-trivial.

The signals that remain available at the API boundary are substantial even without activation access. The behavioral surface (tool calls, outputs, action sequences) is intact and admits the full battery of techniques discussed in Sections 6.2 and 6.4. The verbalized chain of thought, despite its faithfulness limitations, is still informative when treated as one signal among several rather than as authoritative. Beyond these surfaces, several research directions are emerging that are specifically suited to black-box deployments and that warrant more attention than they currently receive.

Provenance-aware decision auditing

A monitor maintains a graph of how untrusted context (retrieved documents, tool returns, external observations) influences each agent decision, and verifies before execution that the decision is justified by trustworthy evidence. ARGUS ([8]) instantiates this approach with an influence provenance graph and reports reducing attack success rate to 3.8% while preserving 87.5% task utility on a benchmark of context-aware prompt injection attacks. The technique is black-box-compatible because the graph is built from the agent's observable inputs and tool calls rather than from its internal state.

Action-level causal attribution

Rather than asking whether an input looks malicious, the monitor asks why a particular tool call was produced, distinguishing calls supported by user intent from calls causally driven by untrusted observations. CausalArmor ([9]) frames indirect prompt injection as a dominance shift in attribution from the user request to an untrusted segment, and uses span-level ablation to detect the shift cheaply. AgentSentry ([10]) models the multi-turn version of the same phenomenon as a temporal causal takeover, performing counterfactual dry-run re-executions at tool-return boundaries to localize the point at which injected context becomes the dominant driver of the next action. Neither approach requires white-box access.

Multi-agent cross-examination

A second model, independent of the agent being monitored, interrogates the agent about specific decisions in its trajectory. The two models do not share weights or context, so the interrogator's questions probe what the agent will say under counterfactual framing rather than what it has already said. Inconsistencies between the agent's stated reasoning and its answers under cross-examination surface information that the original verbalization omits.

External environment state tracking

The actual changes the agent makes to the systems it interacts with, observed independently of the agent's own reports, provide ground truth against which the agent's stated reasoning can be checked. Environment-side instrumentation is a black-box-compatible analog of latent-state monitoring: it observes the consequences of the agent's computation without observing the computation itself.

Deception-based compromise detection

A monitor places traps (fake tools that should never be invoked, fake credentials that should never appear in legitimate traffic, parameter allowlists that should not be violated) inside the agent's tool interface. An agent hijacked by an injected instruction will, with high probability, touch one of these traps, providing a high-precision compromise signal independent of the attack's language or surface form. AgentShield ([11]) implements this layered-trap approach and reports 90.7% to 100% catch rate on attacks that bypass the model's own safety training, with zero false positives across 485 normal-use tests.

Statistical sampling characterization

The output distribution of a black-box model under repeated sampling on equivalent prompts is itself informative. A model that has been instruction-hijacked or pushed off-policy will often have a different sampling distribution than the same model in its calibrated regime.

Decomposed consistency checks

A monitor can decompose a complex agent action into sub-queries that an independent model evaluates in isolation, then check whether the agent's composite decision is consistent with what the decomposed evaluations would predict. This approximates the latent-state observation that the agent has changed its policy mid-trajectory, without requiring access to the latent state itself.

None of these techniques recovers the fidelity of direct activation probing. All of them are active research directions, and several have published evidence of meaningful efficacy on relevant threat models. The architectural reality that white-box access is commercially unavailable for the models that most agents are built on is not a fact about monitoring research; it is a constraint on the kind of monitoring research that is most operationally useful. The frontier question is therefore double-sided: how much latent-state-equivalent fidelity can be recovered through behavioral proxies at the API boundary, and how much fidelity is structurally unavailable no matter how sophisticated the proxies become. We do not resolve either side of this question. We name it as one of the central research problems that distinguishes monitoring as a research field from monitoring as a product category.

7.5 What the design space does not contain

It is worth being explicit about what is not in the design space, because the forcing functions are restrictive.

The design space does not contain pure-heuristic monitors. The reason is narrow and worth stating precisely: heuristics are expressivity-dominated by learned models for the specific task of judging contextual legitimacy, where the criterion is a conditional distribution over actions rather than a fixed partition. Heuristics are not dominated everywhere. They retain real advantages on reliability, auditability, enforceability, latency, and fail-closed behavior, and they remain indispensable as fast, deterministic controls layered around the learned components. OWASP's Excessive Agency mitigations ([7]) are largely deterministic for exactly these reasons: reduce extensions, reduce functionality, reduce permissions, require independent approval for high-impact actions. The claim is not that enumeration has no place in the design space, but that no monitor built solely on enumeration can resolve the contextual-legitimacy question that we have argued is the central one for authorized-agent monitoring.

The design space does not contain pure-frontier-LLM monitors. A monitor that runs a frontier model on every observation cannot satisfy the latency constraint at any reasonable throughput.

The design space does not contain monitors that ship updates only at release cadence. A monitor whose policy is fixed between releases cannot satisfy the adaptivity constraint against systems whose behavior adapts across turns through context, memory, retrieved knowledge, tool feedback, and observed state.

The design space does not contain monitors that read only one surface. Section 6 established that the surfaces are jointly necessary because they cover non-overlapping attack categories.

What remains is the intersection of these constraints, and the intersection is non-trivially populated but constrained. The architectures that inhabit it differ in how they trade off latency against expressivity, how aggressively they adapt against the displacement dynamic, how they handle the white-box / black-box divide, and how they defend against the escalation exhaustion attack. These are the design choices that distinguish viable monitoring architectures from non-viable ones. We are characterizing the existence of the design space and its boundaries; the resolutions are problems for the field.

7.6 Toward an evaluation framework

We have characterized a design space without instantiating it. A natural next question is how a candidate monitor positioned anywhere within this space would be evaluated. The existing benchmarks for the broader fields of adversarial machine learning, AI safety, and intrusion detection do not cover the constraints we have identified, and no standard evaluation methodology exists for monitoring authorized agents under the forcing functions and adversarial dynamics described here. We outline the shape of an evaluation framework that would; we do not propose the framework itself.

Metric	What it measures	Maps to
Latency overhead (Δ p50, p95, p99)	Increase in inference latency at multiple percentiles under realistic load	Latency forcing function
Alignment capture rate	True positive rate on a corpus labeled for congruence with sanctioned role	Expressivity forcing function
False positive on novel-legitimate	Incorrect flags on actions novel to training but congruent with the agent’s role	Expressivity forcing function
Adaptivity latency	Time from operator feedback to behavior change on subsequent similar inputs	Adaptivity forcing function
Bootstrap robustness	Resistance of the emerging baseline to deliberate displacement during cold start	Baseline displacement attack
Escalation distribution under load	Shift in observation cost distribution under traffic crafted to force escalation	Escalation exhaustion attack
Surface coverage	Fraction of known attack categories observable in principle given accessed surfaces	Architectural constraint
Cross-deployment generalization	Effectiveness when transferred to an organization with different role, tool surface, operators	Operational viability

Table 2. Metrics that an evaluation framework for agent monitoring would need to expose. Each maps to a constraint identified in Section 7. We do not specify thresholds; the appropriate operating point depends on the risk tolerance and deployment context of the organization using the monitor.

A few observations about what this framework would require methodologically. First, alignment capture rate is the most difficult metric to measure cleanly, because it requires a labeling protocol that captures the conditional structure of legitimate use, which is itself a research problem (Section 8.5). Second, the framework cannot rely on static test sets. The systems being monitored adapt at runtime; the systems used to attack them adapt too. A useful benchmark must include adversarial agents that themselves adapt to the monitor, in the same way that frontier agents adapt to environment feedback in evaluations like ExploitGym ([1]) and CAI A/D ([2]). Third, the bootstrap-robustness and escalation-distribution metrics require running the benchmark across a temporal window long enough for the relevant dynamics to manifest. A point-in-time measurement misses both of these failure modes by construction. Fourth, the framework should include both white-box and black-box scenarios, since the available signal set differs between them and a benchmark that only addresses one regime will systematically advantage architectures designed for that regime.

These are not optimization targets. They are quantities a benchmark should expose so that candidate architectures can be characterized along the design-space axes we have identified. We are not aware of any existing benchmark that meets this bar, and we believe constructing one is itself a research contribution comparable in importance to any specific monitoring architecture.

8. Deeper limitations: monitoring systems that obscure their own reasoning

The limitations discussed in Section 7 are architectural and have known mitigation strategies. The limitations discussed in this section are more fundamental, because they concern cases where the system being monitored has, or develops, the capacity to obscure its own reasoning from the monitor. These limitations are active research topics and we do not claim to resolve them; we are characterizing the failure modes that any honest assessment of monitoring capability has to acknowledge.

8.1 Unfaithful chain of thought

The verbalized reasoning that a model produces during inference is not, in general, a faithful description of the computation that produced its output. Lanham et al. ([4]) established this in 2023: perturbations to the verbalized reasoning often do not change the final answer, suggesting that the reasoning is not load-bearing for the answer. Subsequent work on reasoning-specialized models ([5]) confirmed that the unfaithfulness persists in systems explicitly trained to reason step-by-step before answering. The relationship between verbalized chain of thought and underlying computation is at best correlational, often post-hoc, and sometimes outright misleading.

For monitoring, this has direct consequences. A monitor that relies on the verbalized chain of thought as evidence that the agent has reasoned about a decision in a particular way is relying on an unreliable signal. The agent may have reached the decision by a different path, and the verbalized reasoning may have been generated to be palatable rather than to be accurate. The implication is not that text-based monitoring is worthless; it is that any decision sensitive to the model's reasoning needs to be cross-checked against more causally proximate or behaviorally grounded signals where available, usually the latent-state surface (Section 6.3) and the action surface (Section 6.2).

8.2 Encoded reasoning and steganographic chain of thought

A stronger version of the unfaithfulness problem is that models can, in principle, encode reasoning in the verbalized chain of thought in ways that humans cannot read but that the model itself can subsequently decode. This is steganographic chain of thought: the verbalized text serves as a cover channel for hidden computation. Early demonstrations of this phenomenon have appeared in the literature, and the architectural conditions that make it possible (the model produces tokens autoregressively, conditioned on its own previous tokens, with no constraint that those tokens be informative in a human sense) are general properties of the inference loop rather than artifacts of specific training procedures.

If steganographic chain of thought is present, a monitor reading the verbalized text sees fluent English that says one thing; the model, reading the same text, encodes a different computation. The monitor's understanding of the agent's reasoning is fundamentally disconnected from the agent's actual reasoning. This is not yet a demonstrated production-scale phenomenon, but the architectural conditions for it exist and the research literature treats it as an open concern. Latent-state monitoring is the most direct mitigation, because activations cannot be steganographically reformatted in the same way that text can: the encoding is the computation itself.

8.3 Subliminal learning and trait transmission

A separate failure mode arises from how models learn rather than how they reason. Cloud, Le, Chua, Betley, Sztyber-Betley, Hilton, Marks, and Evans ([6]) demonstrate that, under same- or closely related-base-model conditions, a student model trained on outputs from a teacher model can inherit traits of the teacher even when the training data is filtered to remove any explicit mention of the trait. In their key experiment, a teacher model with a particular preference (e.g., for owls) generates training data consisting of sequences of numbers; a student model trained on those number sequences inherits the teacher's preference for owls, despite the training data containing no semantic reference to owls or to any animal. They term this subliminal learning: the transmission of behavioral traits through training data that appears, by inspection, to be unrelated to the trait. The limitation is important: the authors report not observing the effect when teacher and student used different base models, so the result should not be read as a universal property of all distillation or fine-tuning pipelines.

The implications for monitoring are subtle and serious. If a model can acquire behavioral tendencies through training data that does not, on inspection, contain those tendencies, then the standard practice of monitoring training data for problematic content provides only a partial guarantee. A model whose training data has been audited for absence of malicious content may still develop opaque behavioral tendencies through subliminal channels in the data, particularly when distillation or synthetic-data generation occurs between same-base or closely related models. The monitor at deployment time inherits the consequences of training-time decisions it cannot directly inspect.

This has a second-order implication for the alignment framing of this paper. The question “is this authorized agent acting according to the organization's sanctioned-use policy” presupposes that the agent's behavioral tendencies are known well enough to evaluate against that policy. Subliminal learning suggests that the agent may include tendencies that were not deliberately introduced and that cannot be removed merely by inspecting the training corpus. The agent may violate the organization's sanctioned-use policy not because of an external adversary but because of an artifact of its own training process that nobody designed and nobody can directly observe.

8.4 Latent-state monitoring has its own limitations

We have argued (Section 6.3) that latent-state monitoring is more causally proximate to the model's computation than verbalized reasoning. This is correct, but causal proximity is not the same as ground-truth access: the signal is still interpreted through probes, sparse autoencoders, natural language autoencoders, or other fallible instruments, and it has its own limitations.

–White-box access requirement. Latent-state monitoring requires access to the model's residual stream during inference. For closed-source models accessed only through an API, this access does not exist; the monitor is restricted to the verbalized and output surfaces.
–Linear-probability constraint. The simplest latent-state monitors (linear probes) can detect concepts that are linearly represented in the activation space. Not all concepts are. A concept encoded in a nonlinear manifold of activations cannot be cleanly extracted by a linear probe. SAEs and natural-language autoencoders partially address this at higher cost, but have their own failure modes, including the confabulation problem documented in ([3]).
–Adversarial robustness. A model whose training has incentivized obscuring specific computations from latent-state monitors may route those computations through representational structures that linear probes cannot easily read. Latent-state monitoring is best used as one signal among several, with the assumption that it can be partially evaded but is much harder to evade than text-based monitoring.

8.5 The policy specification problem

The limitations discussed so far have concerned the monitor's ability to observe and reason about the agent's behavior. A separate and arguably more fundamental limitation concerns what the monitor compares those observations to. A monitor that asks “is this action consistent with the agent's sanctioned role” presupposes that the sanctioned role has been specified well enough to permit the comparison. This is itself an unsolved problem and one we should not paper over.

The classical version of this problem is identity and access management. Organizations that have invested heavily in IAM still struggle to write static permission policies that correctly partition allowed from disallowed actions, even for human users with stable job functions. Audits routinely surface over-broad grants, dormant permissions, and conflicting rules. The problem is not a lack of expressiveness in policy languages. It is the difficulty of articulating, in advance and exhaustively, what an employee's job authorizes them to do.

The agent-monitoring version of this problem is strictly harder along several axes. The agent's action surface is broader (any tool the agent has access to, any natural-language output it can produce). The relevant criterion is not a partition of static permissions but a conditional distribution over actions given context. Whether a given action is sanctioned depends on prior turns in the conversation, on the agent's role within the organization, on the organization's current state, on the identity of the user, and on a long tail of context features that no policy author can fully enumerate. A finance agent that pulls account balances during business hours for a known account-holder is sanctioned; the same agent pulling the same balances at 3 AM in response to a request whose prior turns have introduced suspicious patterns may not be. No static policy expresses this distinction. The policy that does express it is itself a learned conditional model, and the question of how the organization produces that learned model is the bottleneck.

This places a load-bearing burden on the labeling protocol. A monitor that learns from operator feedback inherits the operator's view of what counts as sanctioned, which means the policy is implicit in the labels rather than explicit in any policy document. Several consequences follow. First, the labeling effort scales with the agent's action surface, and is therefore higher for the more capable agents that need monitoring most. Second, operator labels are themselves a distribution rather than ground truth: different operators will label the same trajectory differently, and the labeling distribution is what the monitor learns from. Third, bootstrap is severe: a monitor with no labels has no policy to enforce, and any policy derived from a small initial labeling effort is correspondingly impoverished. Fourth, label drift is a real phenomenon: as the agent's deployment matures and its capabilities are exercised in new ways, what the organization considers sanctioned shifts, and the monitor inherits these shifts whether or not anyone has explicitly approved them.

The research literature has approached related problems at training time. Constitutional AI ([12]) and reinforcement learning from human feedback have addressed the question of how to specify model behavior given examples and stated principles, with imperfect success and known failure modes. Deliberative alignment ([13]) extends this by explicitly specifying safety policies that the model reasons over at inference time. Monitoring inherits the same problem at deployment time, with the additional constraint that the specification has to be tractable for many organizations, each with idiosyncratic role definitions, rather than developed once by a large central team.

We acknowledge this as a limitation rather than resolve it. The honest framing is that monitoring is only as good as the policy it has been given to enforce, and articulating the policy for an autonomous agent operating across a continuous action distribution is, at present, an open problem. The architectural questions in Section 7 take the policy as given; the operational reality is that the policy is itself a degree of freedom in the system, and one that organizations are not currently equipped to specify cleanly. The implication for the field is that monitoring research and alignment-specification research are not separate disciplines that happen to share a substrate; they are the same problem at different points in the lifecycle.

8.6 The monitor is itself a system

The final limitation is the most general. The monitoring layer is itself a system, often itself a neural network, and it inherits all the properties of such systems: it has a learned policy, it has variance, it can be adversarially probed, it can be the target of instruction hijacking. The CAI experiment ([2]) is an early empirical signal that autonomous defensive agents are themselves vulnerable to manipulation. The same principles that make agentic systems hard to monitor make agentic monitors hard to trust without their own monitoring.

This invites a regress. The standard answers to the regress are operational rather than mathematical: defense in depth, independence between monitor and monitored, human review at appropriate intervals, and an explicit acknowledgment that perfect monitoring is not the goal. The goal is sufficient monitoring, where sufficiency is measured against the operational risk tolerance of the organization deploying the agent, and where the residual risk is named and bounded rather than ignored.

9. Conclusion

The signature paradigm worked because the systems being monitored were running on a clock and on a manifold that the paradigm was built to track. Both have changed. The clock is now machine speed. The manifold is now continuous. The systems being monitored are not adversaries trying to get in; they are authorized agents already operating on the inside, with sanctioned credentials and broad action surfaces, whose alignment with the organization that deployed them is the central question of modern security.

The architectural difference between heuristic monitoring and the systems it now needs to monitor is summarized in Section 3. Heuristics operate on tokens humans designed, in feature spaces humans engineered, with local decision rules, updated at the cadence of human analysis. The systems they monitor operate on tokens that emerge from training, in feature spaces shaped by gradient descent, with arbitrary decision boundaries, and adapting their behavior between turns through context, memory, tool feedback, and retrieved knowledge rather than through weight changes. The latency advantage of the heuristic paradigm is real, but it was paid for in expressivity, and expressivity is what is needed to evaluate whether a continuously generative agent's behavior remains congruent with the organization's sanctioned-use policy.

The design space for monitoring is constrained by three forcing functions (expressivity, latency, adaptivity) that do not commute, by adversarial dynamics (escalation exhaustion, baseline displacement, the monitor's own attack surface) that any candidate architecture must defend against, and by a bifurcation between white-box and black-box regimes that produces two largely separate research agendas. The black-box regime, in which most enterprise deployments live, is not structurally compromised; it has its own research agenda built on behavioral proxies, multi-agent cross-examination, environment-state tracking, and statistical characterization of sampling behavior. The white-box regime, where activation-level signal is available, dominates on fidelity but is operationally inaccessible to organizations that consume frontier models through third-party APIs. The frontier question, in both regimes, is how to satisfy the forcing functions simultaneously and how to defend the satisfying point against the adversarial dynamics. None of these constraints is resolved by the existing literature, and we are not aware of any evaluation framework that exposes the metrics by which candidate solutions could be meaningfully compared. The intersection of constraints that admits a viable monitor is non-empty, but its shape, the points within it, and the framework for assessing those points are open problems for the field.

The design space is also bounded by limitations upstream and downstream of the monitor itself. Upstream: a monitor is only as informative as the policy of sanctioned use it has been given to enforce, and articulating that policy for an autonomous agent operating across a continuous action distribution is a problem that the IAM, RLHF, Constitutional AI, and deliberative-alignment literatures have all engaged with and that none of them has resolved at the level of operational tractability that enterprise monitoring requires. Downstream: models can produce verbalized reasoning that is not faithful to their computation, can in principle encode reasoning in text that monitors cannot read, and can acquire behavioral tendencies through training data that appears innocuous by inspection. These are not implementation problems to be cleaned up in a release cycle; they are properties of the systems we are deploying and the policies we are trying to enforce on them, and they bound what any monitoring architecture can promise regardless of how well-engineered it is.

The relevant observables for monitoring authorized agents are no longer in the network packet or the system call; they are in the inference loop. The relevant baseline is no longer global but tied to a model of the agent's sanctioned role within an organization. The relevant feedback is no longer the next ruleset release but the next labeling decision from an operator. The relevant latency budget is no longer set by the SOC analyst but by the agent's sampling step.

The frontier of cybersecurity, in this view, is not only a contest between attackers and defenders. It is also a problem of behavioral attestation: ensuring that the increasingly capable, increasingly autonomous systems that organizations have deployed inside themselves continue to act according to sanctioned-use policies, despite operating at machine speed, despite reasoning in ways the organization cannot directly inspect, and despite the possibility that the system itself has acquired tendencies that nobody designed and nobody can easily remove. The research agenda that follows is to characterize the monitoring architectures that can address this problem under realistic constraints, to develop the evaluation methodology by which candidate architectures can be compared, and to make progress on the upstream policy-specification problem that the architectures inherit. We have argued here that the design space exists, that its boundaries are non-trivial, that the evaluation methodology is missing, and that no architecture built solely on enumeration of past behavior is sufficient. Enumerated controls (forbidden tools, allowlisted accounts, known-bad indicators, policy invariants) remain useful components of any deployed system; the claim is only that they cannot be the whole of one.

Filed under: Research

References

[1]Wang, Z., Schiller, N., Li, H., Narayana, S. S., Nasr, M., Carlini, N., Qi, X., Wallace, E., Bursztein, E., Invernizzi, L., Thomas, K., Shoshitaishvili, Y., Guo, W., He, J., Holz, T., Song, D. ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks? arXiv:2605.11086, 2026. We cite the objective capability findings; experimental methodology relies on disabled deployment-time safety filters under structured-access programs, which the authors discuss in detail.
[2]Balassone, F., Mayoral-Vilches, V., Rass, S., Pinzger, M., Perrone, G., Romano, S. P., Schartner, P. Cybersecurity AI: Evaluating Agentic Cybersecurity in Attack/Defense CTFs. arXiv:2510.17521, 2025.
[3]Fraser-Taliente, K., Kantamneni, S., Ong, E., et al. Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations. Transformer Circuits Thread, 2026. We cite the unverbalized-evaluation-awareness finding as evidence that the latent state can carry information the verbalized output does not.
[4]Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Denison, C., Hernandez, D., Li, D., Durmus, E., Hubinger, E., Kernion, J., et al. Measuring Faithfulness in Chain-of-Thought Reasoning. arXiv:2307.13702, 2023.
[5]Chen, Y., Benton, J., Radhakrishnan, A., Uesato, J., Denison, C., Schulman, J., Somani, A., Hase, P., Wagner, M., Roger, F., Mikulik, V., Bowman, S., Leike, J., Kaplan, J., Perez, E. Reasoning Models Don't Always Say What They Think. Anthropic, 2025.
[6]Cloud, A., Le, M., Chua, J., Betley, J., Sztyber-Betley, A., Hilton, J., Marks, S., Evans, O. Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data. arXiv:2507.14805, 2025.
[7]OWASP Foundation. LLM06:2025 Excessive Agency. OWASP Top 10 for LLM Applications, 2025. genai.owasp.org/llmrisk/llm062025-excessive-agency
[8]Weng, S., Feng, Y., Zhang, J., Xie, X., Yu, J., Liu, J. ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection. arXiv:2605.03378, 2026.
[9]Kim, M., Parmar, M., Wallis, P., Miculicich, L., Jung, K., Dvijotham, K. D., Le, L. T., Pfister, T. CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution. arXiv:2602.07918, 2026. CausalArmor frames indirect prompt injection as a dominance shift in attribution from the user request to an untrusted segment, and triggers defense only when that shift is measurable.
[10]AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents. arXiv:2602.22724, 2026. AgentSentry models multi-turn indirect prompt injection as a temporal causal takeover at tool-return boundaries, using counterfactual dry-run re-executions to localize the boundary at which injected context becomes the dominant driver of the agent's next action.
[11]Rassul, Y. H., et al. AgentShield: Deception-based Compromise Detection for Tool-using LLM Agents. arXiv:2605.11026, 2026.
[12]Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073, 2022.
[13]Guan, M. Y., Joglekar, M., Wallace, E., Jain, S., Barak, B., Helyar, A., Dias, R., Vallone, A., Ren, H., Wei, J., Chung, H. W., Toyer, S., Heidecke, J., Beutel, A., Glaese, A. Deliberative Alignment: Reasoning Enables Safer Language Models. arXiv:2412.16339, 2024.

Triage Research. Correspondence: nicks@triage-sec.com