Markus Anderljung
  • Home
  • Research
  • Blog
  • Talks & Podcasts
  • AI calibration game

The case for outsourcing frontier AI safeguards

30/3/2026

 

Frontier AI companies don’t grow their own coffee beans. Like every other company, they make constant decisions about what to do in-house and what to buy from someone else. They already outsource significant AI-related work.
Surge AI generated $1.2 billion in revenue in 2024, almost entirely from selling data labelling and RLHF services to most frontier labs. METR, Apollo Research, and the UK AI Security Institute run evaluations that now feature prominently in frontier model system cards.


I think this pattern will and should extend further – to safeguards: technical measures that reduce risks from AI systems, such as content classifiers, jailbreak detectors, monitoring tools, and know-your-customer checks. This would be both in AI companies’ interest and good for the world. 

Concretely, I think: 
  • If you’re currently working on safeguards inside a frontier AI company, you should seriously consider whether you’d have more impact offering that service as a third party to the whole industry. 
  • Funders should support a thriving safeguards market by providing organisations with startup capital and, where needed, longer-term funding that allows them to prioritise pro-social outcomes over purely commercial incentives.
  • Evaluation organisations should consider expanding into safeguard development; they likely have a lot of the necessary skills. 
  • Frontier AI companies should work more closely with third-party safeguard developers – sending clear demand signals and working closely enough with them to enable rapid iteration, including sharing data on safeguard performance. 


Why outsourcing safeguards development is in companies’ interest

First, opportunity cost. Engineering time at frontier labs is extraordinarily scarce and valuable. Safeguards are important, but they’re not the core business – and the commercial returns to capability work are much clearer and more immediate. 

Second, specialisation. A dedicated safeguards company can recruit and retain people whose entire focus is building the best possible jailbreak detector or monitoring system. This is the same logic that makes companies buy their cybersecurity tools from e.g. CrowdStrike rather than building them from scratch.

Third, cost-sharing. Developing a good content classifier involves significant upfront R&D costs (mainly opportunity cost from engineers’ time), but often the marginal cost of deploying it to an additional company is relatively low. A third-party provider can spread those fixed costs across multiple frontier labs, making the per-lab cost substantially lower than if each had built the safeguard independently. 

These mechanisms already seem to be at work for model evaluations. My guess is that more than half of the quality-adjusted evaluations relevant to AI risks are developed and run by third parties. METR and others have become the de facto standard for autonomous capability evaluations, producing work that features in both Anthropic’s and OpenAI’s safety processes. Anthropic explicitly funds third-party evaluation development, acknowledging that “the demand for high-quality, safety-relevant evaluations is outpacing the supply.” If this model works for evaluations, that suggests it may work for safeguards too.

Fourth, legal liability. Using widely-adopted third-party safeguards could reduce legal exposure: “we used the same safety infrastructure as the rest of the industry” is a stronger defence than “we built our own.” 


Why it could be good for the world

Beyond private benefits to companies, a healthy market for third-party safeguards would create dynamics that raise the safety floor across the industry. Why is that? 

Awkward not to adopt. Once a well-developed, publicly known safeguard exists and demonstrably works, it becomes reputationally and potentially legally costly not to use it. The existence of a good off-the-shelf content classifier can raise the bar for what counts as reasonable care. 

Transparency and legibility. When multiple labs work with the same provider, it becomes easier to compare safety practices across the industry. “We use [Company X]’s classifier” is likely a more informative signal than “we have internal safety measures.” 

Supporting fast-follower companies. AI companies that are 3-12 months behind the frontier tend to invest significantly less in risk assessment and misuse prevention. A third-party safeguards ecosystem could be particularly valuable here.

Finally, concrete, purchasable safeguards make it much easier for regulators to write requirements. Mandating that “companies must screen API requests for CBRN-related content” is far more viable when there are off-the-shelf tools that do exactly that. The relationship is bidirectional: as tooling becomes available, it reinforces the standardisation of requirements, creating further demand for tooling.

Safeguard outsourcing could go awry. If outsourcing safeguards becomes a way for frontier companies to shift liability to their suppliers – or if it leads companies to shrink their internal safety teams while also pressuring safeguard providers to cut costs – the net effect could be negative. These are risks to be aware of, but they seem manageable to me. 


This has happened in other industries

The shift from in-house to third-party safety tooling has occurred in virtually every industry that faces complex, regulated risks.

In anti-money laundering, the software market is now valued at $3-4 billion, dominated by third-party vendors like NICE Actimize and Oracle. Banks don’t build their own anti-money laundering software because compliance is a cost centre, not a competitive advantage. A key enabler here seems to be standardised regulatory requirements that gave vendors something concrete to build against and required banks to implement AML processes.

Cybersecurity is probably the closest analogy. A huge third-party market – firewalls, endpoint detection, SIEM, penetration testing tools – is used by companies that also maintain substantial in-house security teams. The third-party tooling handles standardised threats; the internal team handles bespoke risk. I’d expect a similar setup to emerge for AI safeguards.

In content moderation, platforms have heavily outsourced to firms like Accenture and Teleperformance, while shared infrastructure like the GIFCT hash-sharing database handles known-bad content. 

Perhaps the most direct precedent is in child sexual abuse material (CSAM) detection. Microsoft developed PhotoDNA, a perceptual hashing tool for identifying known CSAM images, and made it freely available to the industry. It is now used by virtually all major platforms. The National Center for Missing & Exploited Children (NCMEC) maintains the underlying hash database. A key driver here is that possessing CSAM is itself illegal in most jurisdictions, making it impractical for each company to independently build and maintain detection databases – third-party provision is not just efficient but practically necessary.


Which safeguards?

Not all safeguards are equally suited to third-party development. Three factors make a safeguard more suitable for third-party development: 

  • Not being entangled with capability development. Where a safeguard is more “bolt-on” – such that you can just attach it to the AI system – it’ll be more suitable for third-party development. However, safety measures at frontier labs are often kludges – messy assemblages of pre-training data filters, fine-tuning adjustments, and runtime classifiers. A single internal pipeline might handle content filtering, refusal training, and output monitoring in ways that are hard to decompose. 
  • Impact on core product quality. Safeguards that significantly affect the user experience – making the model notably less helpful or responsive – are ones where companies will want very tight control over the tuning. You’re less likely to outsource something that directly shapes your core product. This extends to performance: classifiers that add significant latency to every API call may need to be tightly integrated with a company’s inference stack to reduce delay, making them harder (though not impossible) to outsource.
  • Overlap with existing evaluation expertise. A lot of safeguards development will have strong complementarities with safety and evaluation work. To know whether your safeguard is working, you’ll need high quality assessments. To evaluate whether a model can engage in self-exfiltration, you need to build a good understanding of why models would do so and develop datasets of models doing so, both of which are very important for safeguard development. Much of this expertise already sits outside of frontier AI companies. 
With these factors in mind, these are the products and services I think are particularly strong candidates for third-party involvement: 
  • Data. Generating adversarial test cases, building attack datasets, and creating training data for safety classifiers. It can also include the tools needed to generate such data, such as rubrics verified by subject matter experts, classifiers for data filtering, or environments for data generation. This is essentially what many red-teaming companies already do. The data is often transferable across models, making it a natural fit for third-party provision. Data provision also sidesteps the kludge problem: third parties can supply high-quality datasets, letting companies use them to adjust their own development pipelines. 
  • Certain input/output classifiers. Detecting jailbreaks, prompt injection, and undesired outputs. Often, once developed for one model these defences work for others. Third-party provision of these classifiers is more suitable where, for example, there are legal barriers making it difficult for companies to do the work (e.g. for CSAM and nuclear issues), where classifiers can be bolted on, and where the classifier can be easily adapted to the specific company (since companies may have different preferences over false-positive rates). 
  • Know-your-customer (KYC) and identity verification. Screening who gets access to certain dual-use model capabilities, along the lines of OpenAI’s Trusted Access program for cyber. Banks have used third-party KYC providers for decades; these can likely be used to screen for entity-listed companies, among other things. New KYC solutions may also be needed to identify enterprises and individuals suitable for trusted access programmes for bio and cyber capabilities. 
  • Consulting. Where a third party has a lot of expertise related to the development of a certain safeguard, but the safeguard is heavily intertwined with the rest of the company’s development efforts – where the kludge factor is high – consulting is likely to be the best setup. The third party offers some combination of expertise, data, and evaluations, to help develop and refine safeguards. 
  • Monitoring and observability. Tracking deployed model behaviour, detecting distribution shifts, and flagging anomalous usage patterns. The core technology here seems relatively model-agnostic, making it a natural fit for third-party provision. There is also an interesting independence argument: using a different provider’s model to monitor your own reduces the risk of correlated failures – a consideration that becomes more important as concerns about model self-preservation and collusion grow.
All of the above are probably best seen as ongoing services rather than as one-off products: providers would need to continuously update their safeguards (or inputs to safeguard development) as risks evolve, much as cybersecurity vendors do.

One countertrend is worth noting. Some frontier labs are moving to internalise safety tooling: OpenAI acquired Promptfoo, an open-source red-teaming framework, in early 2026, and Anthropic treats its safety infrastructure as a competitive differentiator. But this doesn’t invalidate the outsourcing thesis – it mirrors the pattern in cybersecurity, where companies like Google maintain world-class internal security teams while still purchasing CrowdStrike, Palo Alto Networks, and dozens of other third-party tools. Internal teams and external providers serve complementary functions: the former handles bespoke, product-specific risks; the latter provides standardised tooling that benefits from cross-industry learning.


What follows

If this analysis is roughly right, what should be done?

Start these organisations. People with safeguards expertise – especially those currently developing safeguards inside frontier labs – should consider doing that work as a product serving the whole industry. Some are already starting: companies like Gray Swan and Lakera exist. But the market is still thin relative to the need. Notably, the outsourcing so far is concentrated in jailbreak detection and input filtering – the most “bolt-on” category of safeguards.

Fund them. Market forces alone may be insufficient or too slow. The customer base for frontier-specific safeguards is small – tops a dozen companies – and the social value of good safeguards exceeds what these companies would pay for them. Worse, if safeguard providers are funded exclusively by the companies they serve, they have an incentive to produce low-cost, just-about-sufficient safeguards. Philanthropic and public funders should provide seed funding and, in some cases, ongoing support to ensure these organisations can prioritise social value over commercial incentives. 

Evaluation organisations should consider expanding into safeguards. There are strong complementarities between risk assessment and safeguard development – if you’re good at finding the vulnerabilities, you’re well-positioned to build the defences and iterate on them. However, this needs to be managed carefully. An organisation that both evaluates a company’s safety and sells it safeguards faces an obvious conflict of interest. It might be severe enough that companies should focus on one or the other. However, possible approaches include maintaining organisational separation between evaluation and safeguard-development arms, or requiring disclosure when the same organisation provides both services to a client. 

Give third parties access to develop safeguards. Third-party safeguard developers need access to frontier models, deployment infrastructure, and relevant data to build and test their products effectively. Without such access, the tools will be limited. Frontier labs should establish structured access programmes specifically for safeguard developers – analogous to the model access they already provide to external evaluators. 

Writing advice for AI governance researchers

20/1/2026

 
This is some advice I often find myself giving people about writing, especially to folks doing AI governance research. 

Write to inform decisions
  • Back-chain from decisions. Make sure you have concrete high-stakes decisions in mind that you’re trying to have your research and writing inform. Imagine yourself in the shoes of someone making that decision. What would you want to know? What would you be confused about? 
  • Remind yourself of why you’re writing. Keep coming back to two questions: what decision am I trying to inform, and who am I writing for? Write down answers to these at the top of the document you're working on. 
  • Keep high standards. The AI governance field is small. There's a huge number of questions. It's surprisingly achievable to actually write the best thing on many topics. Set that as your goal. If you write something on a question, you might have written one of ~3 thorough treatments on the topic. It’s worth making sure you get it right: people might take it seriously!

Get it down
  • Find ways to get started. The hardest part of writing is often getting into flow. Experiment with different methods: talking to people or chatbots, writing a strawman, generating a first pass with AI, talking out loud, writing first thing in the morning, reading something on the topic you disagree with. Different things work for different people – and for different moods.
  • Separate the builder and critic. It's hard to get writing done if you're constantly asking yourself whether a sentence is quite right. If you’re constantly critiquing your writing. One thing I've found helpful is trying to separate my "builder" and my "critic". Give yourself the space to just get your thoughts down on the page. Edit later.
  • Just answer the question. Try to answer the key question of a research project earlier than you'd like to. By doing that, you'll get a sense of where you've got gaps. It also lets you get feedback earlier. For example, try to condense your writing into key 5-10 points every ~40 hours of work. 
  • Talk about your research. Writing is synthesis. Talking forces you to synthesise – and can give you new perspectives, or gets you unstuck. 
  • Zoom in and out. The reason it's hard to write a particular section or sentence is often that you need to change something at a higher level of abstraction. Maybe the flow of the whole piece is off, or you're trying to argue for the wrong thing. Sometimes it's the other way around: you don't know what structure the whole piece should have, but you know some specific things you want to say. If so, start there.

Make it good
  • Avoid satisficing. Your writing should be: clear, concise, skimmable, insightful. Doing all of these at once is hard, but it’s often worth the effort. It's often worth putting far more effort into improving a piece than feels necessary. 
  • Iterate. Most people need many passes at a piece of writing to make it good. Expect to need multiple intense drafting rounds. 
  • All words should do work. Ask yourself what work some piece of text (section, paragraph, sentence, clause, word) or a concept is doing. If it's not doing anything, remove it.
  • Summary first. Don't lay out a long argument and only summarise at the end. Summarise your key claim(s) three times before laying them out in full: in the title, in the abstract, and in the executive summary or introduction. 
  • Reduce abstraction and hedging. We're often very concerned about making sure what we write is true. One way to make sure what you say is true is to be abstract and vague. But that often just reduces the information you communicate. Try being more concrete and specific instead. 
  • Use concrete words and short sentences. Avoid using long words, lots of adjectives, or abstract nouns. Make sure readers don’t lose themselves in your sentences. 
  • Show, don't tell. When writing a summary or introduction, make the key claims, along with the supporting evidence; don’t just say what you’re gonna say. 
  • Clarity is a huge part of the job. There are lots of important ideas in this field waiting for someone to write them up clearly. Clarity helps your thinking, and it's the best way to help your ideas spread.
  • Proportional effort. Your effort-per-word should be roughly proportional to how likely readers are to engage with it. Your title, abstract, figures (including figure descriptions!), and intro/executive summary are almost always what people will engage with the most. 
  • Listen to your gut. When I'm editing, I often get this feeling of "ugh, this isn't quite right." Listen to that voice. That feeling is usually right, even if you can't yet articulate what's wrong.
  • Make it skimmable. Add bolded text and headers to help people parse things quickly. Have your headers give the key points. Use topic sentences. 

Respond to feedback
  • Internalise, don't just address. When responding to comments, Internalise what the comment has to say and address it only insofar as doing so will improve your piece. By falling into the mode of simply "dealing with" the comment – without thinking about how your edit affects the piece as a whole – you’ll often end up with a Frankenstein piece.
  • If they misread it, that's on you. If someone misreads something, it was probably for a reason. Treat it as a signal that your writing is unclear – even if you think they "should" have understood.
    ​​
Hone the craft
  • Just write. The best way to learn to do a thing is often to just do the thing. So keep writing. But write intentionally: get feedback from others, and reflect on what's working.
  • Read good writing. Find writing you think is good and engage with it. Get a sense of why it's good. Immerse yourself in it. 
  • Read about good writing. There are some good books about writing. People often recommend: Style: Lessons in Clarity and Grace and The Sense of Style.

Some advice for aspiring AI governance researchers

31/7/2025

 
Here's some advice I often find myself giving folks getting into the AI governance space: 

Take trends seriously. A lot of the impact GovAI and myself have had often relies on making some pretty basic inferences from important trends. E.g. the insight behind the frontier AI regulation work was simple: AI systems seem to keep improving. If they do, they’ll pose national security risks. If that happens, that may well warrant some government intervention. In short, run to where the ball is going.

Make up your own mind. One mistake I sometimes see people making is focusing too much on not being wrong. To not get epistemic egg on their face. That can result in you not take risks, not trying to actually figure things out. That can stunt your own development. But it can also stunt the development on the field. Your job as a researcher is to contribute to our collective understanding of these really tricky issues, not just your own.

Get a handle on the technical side of things. Read Epoch’s work. Read the model cards from the latest models. Think about what inference scaling might mean. Sometimes people who come from non-technical backgrounds feel like they can’t or won’t understand these things. If you feel that way, I get it, but I’d suggest you just give it a go. The bar you need to meet is not being able to train an AI model, it’s understanding how they work, how they’re developed, and being able to understand the policy implications of that.

Ground yourself. Make sure that you have concrete decisions in mind when designing your research. Talk to decision-makers. Oftentimes, their constraints are not what you’d expect them to be. Further, lots of policy is more about nitty-gritty detail than about the high-level considerations. 

Swim in the waters. The best way to learn about something is to really immerse yourself in it, get obsessed by it. If you’re excited about the stuff you’re working on, lean into that. If you want to do with your Saturday afternoon is reading new papers, go for it. 

Back yourself. This is a young field. It is entirely possible to become a world-leading expert on an important topic in AI governance 1 to 2 years from now. That sounds kind of crazy – especially to someone like me who grew up in Sweden where tall poppy syndrome is rife (or “jantelagen” in Swedish) – but I think it’s true. 

Come pick some fruit. People often talk about looking for low-hanging fruit. My experience recently in the AI governance space is just low hanging fruit smacking me in the face. Of slipping on apples just strewn on the ground, rotting. We need some help here. We need some fruit pickers. We need some pie bakers. Come help out!

​A Collection of AI Governance Research Ideas (2024)

4/11/2024

 
Collated and Edited by:
Moritz von Knebel and Markus Anderljung
More and more people are interested in conducting research on open questions in AI governance. At the same time, many AI governance researchers find themselves with more research ideas than they have time to explore. We hope to address both these needs with this informal collection of 78 AI governance research ideas.
​​
About the collection
There are other related documents out there, e.g. this 2018 research agenda, this list of research ideas from 2021, the AI subsection of this 2021 research agenda, this list of potential topics for academic theses and a more recent collection of ideas from 2024. This list differs in being (i) more recent and (ii) being focused on collating research ideas, rather than questions. It’s a collection of research questions along with hypotheses of how the question could be tackled.

Read More

Frontier AI Regulation in the UK

25/9/2024

 
 
By: Markus Anderljung
The Labour government has committed to introduce legislative requirements on “the developers of the most powerful AI systems,” such as OpenAI, Google DeepMind, Anthropic, xAI, and Meta[1]. These systems are often referred to as “frontier AI”: the most capital-intensive, capable, and general AI models, which currently cost 10-100 million dollars to train.

Frontier AI systems are rapidly improving. Their continued development will have wide-ranging societal effects, creating new economic growth opportunities but also serious new risks. With these new legislative requirements, the government will aim to prevent the deployment of systems that pose unacceptable risks to public safety.

Read More

How Technical Safety Standards Could Promote TAI Safety

9/8/2022

 
​Cullen O’Keefe, Markus Anderljung[1] [2]

Summary
Standard-setting is often an important component of technology safety regulation. However, we suspect that existing standard-setting infrastructure won’t by default adequately address transformative AI (TAI) safety issues. We are therefore concerned that, on our default trajectory, good TAI safety best practices will be overlooked by policymakers due to the lack or insignificance of efforts which identify, refine, recommend, and legitimate TAI safety best practices in time for their incorporation into regulation.

Given this, we suspect the TAI safety and governance communities should invest in capacity to influence technical standard setting for advanced AI systems. There is some urgency to these investments, as they move on institutional timescales. Concrete suggestions include deepening engagement with relevant standard setting organizations (SSOs) and AI regulation, translating emerging TAI safety best practices into technical safety standards, and investigating what an ideal SSO for TAI safety would look like.

Read More
  • Home
  • Research
  • Blog
  • Talks & Podcasts
  • AI calibration game