Markus Anderljung
  • Home
  • Research
  • Blog
  • Talks & Podcasts
  • AI calibration game

The case for outsourcing frontier AI safeguards

30/3/2026

 

Frontier AI companies don’t grow their own coffee beans. Like every other company, they make constant decisions about what to do in-house and what to buy from someone else. They already outsource significant AI-related work.
Surge AI generated $1.2 billion in revenue in 2024, almost entirely from selling data labelling and RLHF services to most frontier labs. METR, Apollo Research, and the UK AI Security Institute run evaluations that now feature prominently in frontier model system cards.


I think this pattern will and should extend further – to safeguards: technical measures that reduce risks from AI systems, such as content classifiers, jailbreak detectors, monitoring tools, and know-your-customer checks. This would be both in AI companies’ interest and good for the world. 

Concretely, I think: 
  • If you’re currently working on safeguards inside a frontier AI company, you should seriously consider whether you’d have more impact offering that service as a third party to the whole industry. 
  • Funders should support a thriving safeguards market by providing organisations with startup capital and, where needed, longer-term funding that allows them to prioritise pro-social outcomes over purely commercial incentives.
  • Evaluation organisations should consider expanding into safeguard development; they likely have a lot of the necessary skills. 
  • Frontier AI companies should work more closely with third-party safeguard developers – sending clear demand signals and working closely enough with them to enable rapid iteration, including sharing data on safeguard performance. 


Why outsourcing safeguards development is in companies’ interest

First, opportunity cost. Engineering time at frontier labs is extraordinarily scarce and valuable. Safeguards are important, but they’re not the core business – and the commercial returns to capability work are much clearer and more immediate. 

Second, specialisation. A dedicated safeguards company can recruit and retain people whose entire focus is building the best possible jailbreak detector or monitoring system. This is the same logic that makes companies buy their cybersecurity tools from e.g. CrowdStrike rather than building them from scratch.

Third, cost-sharing. Developing a good content classifier involves significant upfront R&D costs (mainly opportunity cost from engineers’ time), but often the marginal cost of deploying it to an additional company is relatively low. A third-party provider can spread those fixed costs across multiple frontier labs, making the per-lab cost substantially lower than if each had built the safeguard independently. 

These mechanisms already seem to be at work for model evaluations. My guess is that more than half of the quality-adjusted evaluations relevant to AI risks are developed and run by third parties. METR and others have become the de facto standard for autonomous capability evaluations, producing work that features in both Anthropic’s and OpenAI’s safety processes. Anthropic explicitly funds third-party evaluation development, acknowledging that “the demand for high-quality, safety-relevant evaluations is outpacing the supply.” If this model works for evaluations, that suggests it may work for safeguards too.

Fourth, legal liability. Using widely-adopted third-party safeguards could reduce legal exposure: “we used the same safety infrastructure as the rest of the industry” is a stronger defence than “we built our own.” 


Why it could be good for the world

Beyond private benefits to companies, a healthy market for third-party safeguards would create dynamics that raise the safety floor across the industry. Why is that? 

Awkward not to adopt. Once a well-developed, publicly known safeguard exists and demonstrably works, it becomes reputationally and potentially legally costly not to use it. The existence of a good off-the-shelf content classifier can raise the bar for what counts as reasonable care. 

Transparency and legibility. When multiple labs work with the same provider, it becomes easier to compare safety practices across the industry. “We use [Company X]’s classifier” is likely a more informative signal than “we have internal safety measures.” 

Supporting fast-follower companies. AI companies that are 3-12 months behind the frontier tend to invest significantly less in risk assessment and misuse prevention. A third-party safeguards ecosystem could be particularly valuable here.

Finally, concrete, purchasable safeguards make it much easier for regulators to write requirements. Mandating that “companies must screen API requests for CBRN-related content” is far more viable when there are off-the-shelf tools that do exactly that. The relationship is bidirectional: as tooling becomes available, it reinforces the standardisation of requirements, creating further demand for tooling.

Safeguard outsourcing could go awry. If outsourcing safeguards becomes a way for frontier companies to shift liability to their suppliers – or if it leads companies to shrink their internal safety teams while also pressuring safeguard providers to cut costs – the net effect could be negative. These are risks to be aware of, but they seem manageable to me. 


This has happened in other industries

The shift from in-house to third-party safety tooling has occurred in virtually every industry that faces complex, regulated risks.

In anti-money laundering, the software market is now valued at $3-4 billion, dominated by third-party vendors like NICE Actimize and Oracle. Banks don’t build their own anti-money laundering software because compliance is a cost centre, not a competitive advantage. A key enabler here seems to be standardised regulatory requirements that gave vendors something concrete to build against and required banks to implement AML processes.

Cybersecurity is probably the closest analogy. A huge third-party market – firewalls, endpoint detection, SIEM, penetration testing tools – is used by companies that also maintain substantial in-house security teams. The third-party tooling handles standardised threats; the internal team handles bespoke risk. I’d expect a similar setup to emerge for AI safeguards.

In content moderation, platforms have heavily outsourced to firms like Accenture and Teleperformance, while shared infrastructure like the GIFCT hash-sharing database handles known-bad content. 

Perhaps the most direct precedent is in child sexual abuse material (CSAM) detection. Microsoft developed PhotoDNA, a perceptual hashing tool for identifying known CSAM images, and made it freely available to the industry. It is now used by virtually all major platforms. The National Center for Missing & Exploited Children (NCMEC) maintains the underlying hash database. A key driver here is that possessing CSAM is itself illegal in most jurisdictions, making it impractical for each company to independently build and maintain detection databases – third-party provision is not just efficient but practically necessary.


Which safeguards?

Not all safeguards are equally suited to third-party development. Three factors make a safeguard more suitable for third-party development: 

  • Not being entangled with capability development. Where a safeguard is more “bolt-on” – such that you can just attach it to the AI system – it’ll be more suitable for third-party development. However, safety measures at frontier labs are often kludges – messy assemblages of pre-training data filters, fine-tuning adjustments, and runtime classifiers. A single internal pipeline might handle content filtering, refusal training, and output monitoring in ways that are hard to decompose. 
  • Impact on core product quality. Safeguards that significantly affect the user experience – making the model notably less helpful or responsive – are ones where companies will want very tight control over the tuning. You’re less likely to outsource something that directly shapes your core product. This extends to performance: classifiers that add significant latency to every API call may need to be tightly integrated with a company’s inference stack to reduce delay, making them harder (though not impossible) to outsource.
  • Overlap with existing evaluation expertise. A lot of safeguards development will have strong complementarities with safety and evaluation work. To know whether your safeguard is working, you’ll need high quality assessments. To evaluate whether a model can engage in self-exfiltration, you need to build a good understanding of why models would do so and develop datasets of models doing so, both of which are very important for safeguard development. Much of this expertise already sits outside of frontier AI companies. 
With these factors in mind, these are the products and services I think are particularly strong candidates for third-party involvement: 
  • Data. Generating adversarial test cases, building attack datasets, and creating training data for safety classifiers. It can also include the tools needed to generate such data, such as rubrics verified by subject matter experts, classifiers for data filtering, or environments for data generation. This is essentially what many red-teaming companies already do. The data is often transferable across models, making it a natural fit for third-party provision. Data provision also sidesteps the kludge problem: third parties can supply high-quality datasets, letting companies use them to adjust their own development pipelines. 
  • Certain input/output classifiers. Detecting jailbreaks, prompt injection, and undesired outputs. Often, once developed for one model these defences work for others. Third-party provision of these classifiers is more suitable where, for example, there are legal barriers making it difficult for companies to do the work (e.g. for CSAM and nuclear issues), where classifiers can be bolted on, and where the classifier can be easily adapted to the specific company (since companies may have different preferences over false-positive rates). 
  • Know-your-customer (KYC) and identity verification. Screening who gets access to certain dual-use model capabilities, along the lines of OpenAI’s Trusted Access program for cyber. Banks have used third-party KYC providers for decades; these can likely be used to screen for entity-listed companies, among other things. New KYC solutions may also be needed to identify enterprises and individuals suitable for trusted access programmes for bio and cyber capabilities. 
  • Consulting. Where a third party has a lot of expertise related to the development of a certain safeguard, but the safeguard is heavily intertwined with the rest of the company’s development efforts – where the kludge factor is high – consulting is likely to be the best setup. The third party offers some combination of expertise, data, and evaluations, to help develop and refine safeguards. 
  • Monitoring and observability. Tracking deployed model behaviour, detecting distribution shifts, and flagging anomalous usage patterns. The core technology here seems relatively model-agnostic, making it a natural fit for third-party provision. There is also an interesting independence argument: using a different provider’s model to monitor your own reduces the risk of correlated failures – a consideration that becomes more important as concerns about model self-preservation and collusion grow.
All of the above are probably best seen as ongoing services rather than as one-off products: providers would need to continuously update their safeguards (or inputs to safeguard development) as risks evolve, much as cybersecurity vendors do.

One countertrend is worth noting. Some frontier labs are moving to internalise safety tooling: OpenAI acquired Promptfoo, an open-source red-teaming framework, in early 2026, and Anthropic treats its safety infrastructure as a competitive differentiator. But this doesn’t invalidate the outsourcing thesis – it mirrors the pattern in cybersecurity, where companies like Google maintain world-class internal security teams while still purchasing CrowdStrike, Palo Alto Networks, and dozens of other third-party tools. Internal teams and external providers serve complementary functions: the former handles bespoke, product-specific risks; the latter provides standardised tooling that benefits from cross-industry learning.


What follows

If this analysis is roughly right, what should be done?

Start these organisations. People with safeguards expertise – especially those currently developing safeguards inside frontier labs – should consider doing that work as a product serving the whole industry. Some are already starting: companies like Gray Swan and Lakera exist. But the market is still thin relative to the need. Notably, the outsourcing so far is concentrated in jailbreak detection and input filtering – the most “bolt-on” category of safeguards.

Fund them. Market forces alone may be insufficient or too slow. The customer base for frontier-specific safeguards is small – tops a dozen companies – and the social value of good safeguards exceeds what these companies would pay for them. Worse, if safeguard providers are funded exclusively by the companies they serve, they have an incentive to produce low-cost, just-about-sufficient safeguards. Philanthropic and public funders should provide seed funding and, in some cases, ongoing support to ensure these organisations can prioritise social value over commercial incentives. 

Evaluation organisations should consider expanding into safeguards. There are strong complementarities between risk assessment and safeguard development – if you’re good at finding the vulnerabilities, you’re well-positioned to build the defences and iterate on them. However, this needs to be managed carefully. An organisation that both evaluates a company’s safety and sells it safeguards faces an obvious conflict of interest. It might be severe enough that companies should focus on one or the other. However, possible approaches include maintaining organisational separation between evaluation and safeguard-development arms, or requiring disclosure when the same organisation provides both services to a client. 

Give third parties access to develop safeguards. Third-party safeguard developers need access to frontier models, deployment infrastructure, and relevant data to build and test their products effectively. Without such access, the tools will be limited. Frontier labs should establish structured access programmes specifically for safeguard developers – analogous to the model access they already provide to external evaluators. 


Comments are closed.
  • Home
  • Research
  • Blog
  • Talks & Podcasts
  • AI calibration game