|
Safeguards for cyber risk on proprietary frontier AI (e.g. from Anthropic, OpenAI, Google DeepMind) currently focus on two questions: what the model is being asked to do – the type of task, the user’s intent – and who is asking – the identity behind the account. This post proposes adding a third question: where the model is operating. If a model’s context shows it is interacting with the control systems of a power grid, the trading infrastructure of a major bank, the internal network of a defense contractor, or, plausibly, the internal systems of an AI company itself, you might want to apply safeguards unless the user is authorized to be there. If the user isn’t authorized the AI provider could notify the owner of the software environment, downgrade to a less capable model, or have the model outright refuse the task. I’m not sure this can be made to work; I’d love cybersecurity and cryptography folks to look into it. The two key questions are:
The mechanism
The mechanism has three components.
The central design problem The mechanism is effective to the extent that two things hold:
To address this problem, you could either make sure that adversaries don’t find out that there are environment-based safeguards – e.g. by only activating safeguards that are subtle enough not to be noticed, such as notifying the owner of the penetrated system – or use identifiers adversaries can’t find even if they’re looking. Candidate techniques Hidden markers (steganography). Signals woven into the environment’s outputs that look like ordinary content but encode a pattern the provider’s classifier has been trained to detect, similar to watermarking. A human reading the output wouldn’t notice anything; the pattern is legible only to a system that knows what it’s looking for. This is the only candidate that plausibly scores well on discovery resistance against a sophisticated adversary. Distinctive environmental signatures. Features that already make a given environment recognizable: SCADA banners (text identifiers produced by industrial control systems), proprietary protocol headers, specific log formats. Easy to deploy because nothing new needs to be embedded. But discovery resistance is poor – an adversary who can identify the environment’s make and model can deduce the signature. Registered values. Specific canary strings, tokens, or admin-held cryptographic keys that operators share with the provider, which the provider’s classifier then watches for. Simple to deploy, weakest on discovery resistance: if the output that carries the value is visible in the environment, the adversary can study it. Hidden markers are the only option that plausibly survives a sophisticated adversary that is aware of the environment-based safeguards. The other two would only work against less sophisticated adversaries or if the safeguards are subtle enough that attackers might not notice them (e.g. notifying the owner of the penetrated system). How the response should work Silent notification of the operator of the sensitive software environment, possibly combined with tiering down to a weaker model (something companies already do for suspicious-seeming tasks), is the default I’d argue for in most cases. The choice between silent notification and outright refusal depends on confidence in discovery resistance. A noticeable detection event (an unexplained refusal, a sudden capability drop) can help the adversary narrow down what triggers the safeguard. That signal is most useful when the identifier is otherwise findable. With a strong steganography scheme, refusing outright costs relatively little: the adversary learns that detection happened but can’t work out what triggered it. As such, I would only recommend using e.g. existing environmental signatures or unique strings as identifiers if the only safeguard is notifying the system operator of the intrusion. Failure modes beyond the central problem Five further concerns are worth naming. The identifier may be removed. An insider or sophisticated attacker who has learned about the scheme can attempt to strip it. So can normal system updates, migrations, or redesigns. Routine system changes are probably the bigger ongoing risk, and someone needs to own keeping the identifiers in place. Open-weight and self-hosted models are likely out of scope. Only models run by participating providers are affected. While open-weight models can put in place environment-based safeguards, my guess is they would be easy to strip out. Enterprise on-prem deployments of frontier models, where the operator runs the model on its own infrastructure rather than calling a provider API, fall in the same category. Griefing. A bad actor could register an identifier that appears in a competitor’s environment, effectively a denial-of-AI-service attack against that competitor. Therefore, any registry needs high barriers to entry, real vetting, and a narrow eligibility list. The registry is itself a target. A directory mapping identifiers to registered operators is both a list of critical-infrastructure systems and a recipe for stripping their safeguards. It’ll need high cybersecurity protections. Scale. Maintaining the registry will be a lot of work. Even with a “critical infrastructure only” scope, the US has well over 2,500 electric distribution utilities, plus thousands more generators, transmission owners, and balancing authorities – and that is before the supply chain is counted. Who would run this? If this mechanism does work, someone would need to manage the registry, vet applicants, coordinate with AI companies, and handle disputes. A few options: A government agency like CISA (in the US) or the NCSC (in the UK) could run the registry. This has the advantage of legitimacy and existing relationships with critical infrastructure operators, but risks being slow-moving and jurisdiction-limited. Alternatively, an AI company (or a consortium of them) could start unilaterally. A company could invite a small number of critical infrastructure operators to submit identifier strings and integrate them into its existing safety filters. Anthropic could extend this to Project Glasswing partners; OpenAI to Trusted Access for Cyber participants. This has the advantage of speed and doesn’t require government action, but could raise concerns about a private company holding sensitive information about critical infrastructure. Appendix: Analogous current practice – honeytokens and data loss prevention The proposal is similar to some existing cybersecurity techniques. Honeytokens, also called canary tokens, are decoy artifacts placed inside real systems: a fake credential committed to a private repo, a database row no legitimate query would ever return, a planted file that calls home when opened. They’re designed so that any access is almost certainly malicious; the resulting alert is high-signal and cheap to maintain. CISA and Five Eyes partners recommend canary objects, Thinkst and others sell it at scale, and there is a small industry around OT (operational technology) and SCADA honeypots aimed at industrial environments specifically. The idea has begun to be applied to large language models directly – canaries planted in the kinds of files an unauthorized agent might read, and in model files themselves. I’m excited about honeytokens, and would probably recommend them as an intervention ahead of environment-based safeguards. One concern with honeytokens for AI agents might be that you’ll want to have your own agents roaming around your sensitive systems, who might trigger the tokens repeatedly. However, I think that can be addressed by instructing your own models to avoid certain files or cross-check any alert against the transcripts of your own agents. My main concern with honeytokens is that they can potentially be avoided by instructing the model that they exist in the environment. However, a recent experiment suggests that doing so reduces the usefulness of the model, as it becomes significantly more cautious. Data Loss Prevention systems use a related mechanic. Markers or fingerprints are embedded in sensitive data – e.g. customer records or source code – and watched for in data flows; when a marker shows up where it shouldn't, exfiltration is blocked. Comments are closed.
|