Markus Anderljung
  • Home
  • Research
  • Blog
  • Talks & Podcasts
  • AI calibration game
  • Music

Environment-Based Safeguards for AI Cyber Risk

21/5/2026

 
Safeguards for cyber risk on proprietary frontier AI (e.g. from Anthropic, OpenAI, Google DeepMind) currently focus on two questions: what the model is being asked to do – the type of task, the user’s intent – and who is asking – the identity behind the account. 

This post proposes adding a third question: where the model is operating. If a model’s context shows it is interacting with the control systems of a power grid, the trading infrastructure of a major bank, the internal network of a defense contractor, or, plausibly, the internal systems of an AI company itself, you might want to apply safeguards unless the user is authorized to be there. If the user isn’t authorized the AI provider could notify the owner of the software environment, downgrade to a less capable model, or have the model outright refuse the task.

I’m not sure this can be made to work; I’d love cybersecurity and cryptography folks to look into it. The two key questions are: 
  • Can identifiers be built that an AI system reliably encounters during normal operation in a sensitive environment, but that a sophisticated adversary cannot reliably remove? 
  • Are there safeguards subtle enough that adversaries won't bother looking for the identifier – even though they could find it if they did?​​
The mechanism

The mechanism has three components.
  1. Owners of sensitive software systems submit identifiers of those environments to a registry. This could be identifiers already in the system – e.g. distinctive log formats, proprietary protocol headers – but it would more likely have to be identifiers deliberately introduced, that are hard for an attacker to identify. A simple option would be a canary string – a unique combination of numbers and letters.
  2. When an identifier appears in a model’s context, safeguards trigger. Possible responses include routing the session to a weaker model, requiring additional identity or purpose verification, silently notifying the environment owner’s security team, or refusing the task.
  3. Authorized accounts are allowlisted. The provider also maintains a list of AI accounts that are allowed to operate in the relevant environments, such as the operator's own staff and sanctioned penetration testers. 

Concretely: a power grid operator could embed a hidden pattern in its control system logs. Any AI agent reading those logs trips some safeguard (e.g. downgrading to a less powerful model), unless the model is running on an allowlisted account, likely belonging to the grid operator.

The central design problem

The mechanism is effective to the extent that two things hold:
  • The identifier is encountered – the probability that an AI agent operating in the environment actually processes something containing the identifier is high.
  • The identifier is not discovered – the probability that an adversary can locate the identifier well enough to strip it from model inputs is low.

A motivated adversary has several plausible routes to discovering an identifier. They can scout the environment with an open-weight model and inspect its context. They can run a safeguarded model in the target environment themselves, see what it sees or says before refusing, and infer from that what activated the safeguard. They can compare environments they believe are marked against ones they believe are not. They can social-engineer people with access to the registry. If they do discover the identifier, they would likely be able to avoid the safeguard triggering, e.g. stripping it out of any input before it is sent to the safeguarded model.
To address this problem, you could either make sure that adversaries don’t find out that there are environment-based safeguards – e.g. by only activating safeguards that are subtle enough not to be noticed, such as notifying the owner of the penetrated system – or use identifiers adversaries can’t find even if they’re looking. 

Candidate techniques

Hidden markers (steganography). Signals woven into the environment’s outputs that look like ordinary content but encode a pattern the provider’s classifier has been trained to detect, similar to watermarking. A human reading the output wouldn’t notice anything; the pattern is legible only to a system that knows what it’s looking for. This is the only candidate that plausibly scores well on discovery resistance against a sophisticated adversary.

Distinctive environmental signatures. Features that already make a given environment recognizable: SCADA banners (text identifiers produced by industrial control systems), proprietary protocol headers, specific log formats. Easy to deploy because nothing new needs to be embedded. But discovery resistance is poor – an adversary who can identify the environment’s make and model can deduce the signature. 

Registered values. Specific canary strings, tokens, or admin-held cryptographic keys that operators share with the provider, which the provider’s classifier then watches for. Simple to deploy, weakest on discovery resistance: if the output that carries the value is visible in the environment, the adversary can study it.
Hidden markers are the only option that plausibly survives a sophisticated adversary that is aware of the environment-based safeguards. The other two would only work against less sophisticated adversaries or if the safeguards are subtle enough that attackers might not notice them (e.g. notifying the owner of the penetrated system). 

How the response should work

Silent notification of the operator of the sensitive software environment, possibly combined with tiering down to a weaker model (something companies already do for suspicious-seeming tasks), is the default I’d argue for in most cases.
The choice between silent notification and outright refusal depends on confidence in discovery resistance. A noticeable detection event (an unexplained refusal, a sudden capability drop) can help the adversary narrow down what triggers the safeguard. That signal is most useful when the identifier is otherwise findable. With a strong steganography scheme, refusing outright costs relatively little: the adversary learns that detection happened but can’t work out what triggered it. As such, I would only recommend using e.g. existing environmental signatures or unique strings as identifiers if the only safeguard is notifying the system operator of the intrusion. 

Failure modes beyond the central problem

Five further concerns are worth naming.

The identifier may be removed. An insider or sophisticated attacker who has learned about the scheme can attempt to strip it. So can normal system updates, migrations, or redesigns. Routine system changes are probably the bigger ongoing risk, and someone needs to own keeping the identifiers in place.

Open-weight and self-hosted models are likely out of scope. Only models run by participating providers are affected. While open-weight models can put in place environment-based safeguards, my guess is they would be easy to strip out.  Enterprise on-prem deployments of frontier models, where the operator runs the model on its own infrastructure rather than calling a provider API, fall in the same category.

Griefing. A bad actor could register an identifier that appears in a competitor’s environment, effectively a denial-of-AI-service attack against that competitor. Therefore, any registry needs high barriers to entry, real vetting, and a narrow eligibility list. 
The registry is itself a target. A directory mapping identifiers to registered operators is both a list of critical-infrastructure systems and a recipe for stripping their safeguards. It’ll need high cybersecurity protections.

Scale. Maintaining the registry will be a lot of work. Even with a “critical infrastructure only” scope, the US has well over 2,500 electric distribution utilities, plus thousands more generators, transmission owners, and balancing authorities – and that is before the supply chain is counted.

Who would run this?

If this mechanism does work, someone would need to manage the registry, vet applicants, coordinate with AI companies, and handle disputes. A few options:
A government agency like CISA (in the US) or the NCSC (in the UK) could run the registry. This has the advantage of legitimacy and existing relationships with critical infrastructure operators, but risks being slow-moving and jurisdiction-limited.

Alternatively, an AI company (or a consortium of them) could start unilaterally. A company could invite a small number of critical infrastructure operators to submit identifier strings and integrate them into its existing safety filters. Anthropic could extend this to Project Glasswing partners; OpenAI to Trusted Access for Cyber participants. This has the advantage of speed and doesn’t require government action, but could raise concerns about a private company holding sensitive information about critical infrastructure.

Appendix: Analogous current practice – honeytokens and data loss prevention

The proposal is similar to some existing cybersecurity techniques. 

Honeytokens, also called canary tokens, are decoy artifacts placed inside real systems: a fake credential committed to a private repo, a database row no legitimate query would ever return, a planted file that calls home when opened. They’re designed so that any access is almost certainly malicious; the resulting alert is high-signal and cheap to maintain. CISA and Five Eyes partners recommend canary objects, Thinkst and others sell it at scale, and there is a small industry around OT (operational technology) and SCADA honeypots aimed at industrial environments specifically. The idea has begun to be applied to large language models directly – canaries planted in the kinds of files an unauthorized agent might read, and in model files themselves. 

I’m excited about honeytokens, and would probably recommend them as an intervention ahead of environment-based safeguards. One concern with honeytokens for AI agents might be that you’ll want to have your own agents roaming around your sensitive systems, who might trigger the tokens repeatedly. However, I think that can be addressed by instructing your own models to avoid certain files or cross-check any alert against the transcripts of your own agents. My main concern with honeytokens is that they can potentially be avoided by instructing the model that they exist in the environment. However, a recent experiment suggests that doing so reduces the usefulness of the model, as it becomes significantly more cautious. 
​
Data Loss Prevention systems use a related mechanic. Markers or fingerprints are embedded in sensitive data – e.g. customer records or source code – and watched for in data flows; when a marker shows up where it shouldn't, exfiltration is blocked. 

Comments are closed.
  • Home
  • Research
  • Blog
  • Talks & Podcasts
  • AI calibration game
  • Music