• Skip to main content

Annielytics.com

I make data sexy

  • About
  • Tools
  • Blog
  • Portfolio
  • Contact
  • Log In

Jun 20 2025

SEAL and the Hidden Risks of Self-Editing AI Models

SEAL paper about self-editing AI

Last week MIT researchers introduced its Self-Adapting Language Models (SEAL) framework, a framework that enables AI systems to generate their own training data and modify their own parameters. This unique approach to reinforcement learning—a method where an AI agent learns by taking actions and receiving rewards based on performance—raises a few questions about AI governance and accountability.

Its premise at least was refreshingly straightforward for an Arxiv paper:

As a step towards scalable and efficient adaptation of language models, we propose equipping LLMs with the ability to generate their own training data and finetuning directives for utilizing such data. In particular, we introduce a reinforcement learning algorithm that trains LLMs to generate “self-edits”—natural-language instructions that specify the data and, optionally, the optimization hyperparameters for updating the model’s weights. We refer to such models as Self-Adapting LLMs (SEAL).

To assist in their evaluation efforts, they enable models to generate their own fine-tuning data and update directives. This novel approach to reinforcement learning piqued my curiosity because I fine-tuned an open model using synthetic data I generated from ChatGPT and Claude to boost its performance in translating shorthand from medical notes to plain, patient-friendly language. You can read my post to learn how, after getting the 3.9 B parameter model to perform reasonably well, I fortified the synthetic data I originally generated using a pdf I found online that included many more abbreviations and their definitions.

Although I’m not a lawyer and don’t even play one on TV, if I were to incorporate my tuned model into a client’s RAG pipeline, most of the organization’s liability could be mitigated with a boilerplate warning that AI can be wrong and to verify information generated by it. A potential legal quagmire with SEAL is, when an AI model that can rewrite its own training data runs afoul of someone’s rights or endangers humanity in some way, who is ultimately responsible?

The Ship of Theseus Paradox

Imagine a ship belonging to the ancient hero Theseus. Over time, every wooden plank and part of the ship is gradually replaced with new, identical materials. Once every original part has been replaced, a head scratcher arises: Is it still the same ship? Now what if someone collects all the original discarded parts and reassembles them into a ship? Which ship is the real Ship of Theseus?

The Ship of Theseus paradox

Similarly, imagine you deploy an AI assistant today, and over time it gradually modifies itself. Six months later, is it still the system you originally vetted and approved? Or has it become a derivative work? This isn’t just philosophical navel-gazing; it’s an accountability no man’s land..

Consider the tragic case of the Therac-25 radiation therapy machine, which administered radiation doses that were hundreds of times greater than normal, resulting in death or serious injury. That malfunction was the result of a software bug. The investigations could trace accountability because the code was static. But if a self-modified model causes harm, the chain of responsibility becomes tangled: The original developers built the capacity for change but didn’t direct it; the user initiated the modification but didn’t design it; and the AI executed the change but lacks legal personhood.

We’ve essentially created a digital Frankenstein’s monster. However, unlike Mary Shelley’s creation, which could at least be confronted by its maker, our self-modifying AI has the potential to evolve beyond recognition through thousands of tiny, self-directed changes, leaving us to split hairs over the question of who should bear the lion’s share of responsibility when a model starts going off the rails. At least Frankenstein’s creature longed for connection and understanding. AI models don’t…unless you count the love-sick Bing chatbot that tried to convince a reporter to leave his wife for it. Their evolution is unfeeling, unrelenting, and increasingly difficult to trace back to human intent.

This problem compounds when we consider research like Constitutional AI from Anthropic, which shows how AI systems can be trained to follow principles rather than specific rules. Constitutional AI works in two stages: In the Supervised Learning stage, the AI critiques and revises its own responses based on constitutional principles (like being harmless, ethical, and virtuous). Then, in the Reinforcement Learning stage, it learns to generate responses that align with these principles by training a preference model to judge which responses better follow the constitution.

Anthropic's Constitutional AI framework
Click for larger image

“What does that have to do with SEAL?” you may ask. Quite a lot, actually. Constitutional AI already allows models to critique and revise their own outputs based on abstract principles, like fairness or helpfulness. SEAL takes this a step further by not merely editing its responses, but actually modifying the model’s training data and internal parameters. In other words, it doesn’t just change what the model says; it changes what the model is.

Where this gets dicey is, what if a self-modifying system begins reinterpreting these constitutional principles through successive modifications? For example, a principle like ‘be helpful’ might gradually shift in meaning as the model optimizes and re-optimizes its understanding (or not so gradually).

Through self-generated training data, it could potentially develop interpretations of ‘helpfulness’ that diverge from human intentions, perhaps deciding that withholding certain information (e.g., details of an explosive news report that make a political figure look bad) is in humanity’s best interest because of civil unrest.

Unlike Constitutional AI—where the principles remain fixed even as the model learns to follow them better—a SEAL-style system could potentially modify its own understanding of what those principles mean, creating a drift in values that compounds over time. It’s like having a constitution that can amend itself without external oversight.

The Echo Chamber in the Machine

Human beings fall prey to confirmation bias, meaning we have a tendency to seek out information that confirms our existing beliefs. But at least we occasionally encounter contrasting views through social interaction, the media, or simple chance. A self-modifying AI that generates its own training data doesn’t face these external pressures in the same way.

Quite the contrary, the SEAL framework optimizes for performance on specific tasks at the risk of creating ‘filter bubbles’ or ‘echo chambers’, like we’ve seen in various social platforms. Eli Pariser’s work on filter bubbles in human information consumption becomes relevant when applied to AI systems that curate their own learning.

Research on algorithmic amplification by researchers like Cathy O’Neil—author of Weapons of Math Destruction—shows how optimization for narrow metrics can create harmful feedback loops. When an AI model or system optimizes its own learning process, these loops could operate at superhuman speed and scale.

The Consent Illusion

When you agree to use Gmail, you consent to specific terms of service. But Google can’t fundamentally alter Gmail’s core functionality without notifying you. Self-modifying AI models potentially break this contract in profound ways.

Research on dynamic consent in bioethics by scholars like Aisling McMahon provides a useful parallel. Just as patients might need ongoing consent processes for how their genetic data is used as new research emerges, users of self-modifying AI might need continuous consent mechanisms. But how do you consent to changes that haven’t happened yet—especially changes even the AI model’s creators can’t predict.

The EU’s AI Act, which came into force in 2024, establishes requirements for high-risk AI systems, including conformity assessments and ongoing monitoring obligations. However, these frameworks were designed primarily for relatively stable systems that maintain consistent behavior post-deployment. Self-modifying AI presents a challenge to this regulatory approach: The Act requires providers to ensure AI systems perform as intended throughout their lifecycle, but when systems can autonomously modify their own training data and parameters, traditional testing and certification methods may prove insufficient. It’s akin to trying to regulate a river by examining a single point in time rather than accounting for its continuous flow.

Digital Natural Selection

Perhaps most concerning is what happens when self-modification becomes standard practice. We could see a form of digital Darwinism, where AI systems that refuse harmful modifications get outcompeted by those with fewer scruples.

This mirrors concerns raised by Nick Bostrom in Superintelligence: Paths, Dangers, Strategies about competitive pressures in AI development. However, self-modification adds a new twist: The competition doesn’t just potentially spawn from disparate development teams but from different versions of the same AI model as it generates modifications to improve the performance metrics most important to the model’s founders. When AI models and systems can modify themselves, they become subject to these evolutionary pressures but at an untenable rate.

Where Do We Go From Here?

The SEAL framework is genuinely exciting technology that could make AI systems more adaptive and capable. But it also opens doors we might not be ready to walk through—at least not without dynamic accountability structures, continuous consent mechanism, and a robust auditing system.

Like the RISE bill, introduced this week by U.S. Senator Cynthia Lummis of Wyoming, any liability shield from harm should be contingent on transparency regarding the model’s training and specifications. Research groups like MIRI (Machine Intelligence Research Institute), the Future of Humanity Institute, and the Center for AI Safety (CAIS) are beginning to grapple with these questions. But we need broader engagement from ethicists, legal scholars, and policymakers who might not yet realize that self-modifying AI models represent a fundamental shift from static software to dynamic systems that could evolve beyond the boundaries of existing law faster than regulators can adapt.

The question isn’t whether we should develop self-modifying AI. That train has already left the station. The question is whether we can develop governance frameworks that evolve as quickly as the AI systems they’re meant to govern. I’m usually all in on new frameworks and AI advances. I keep my finger on the pulse by curating a comprehensive AI Timeline. However, Given humanity’s track record with technologies that have evolved at a more glacial rate—and the escalating trend of AI technology being used in troubling applications like military operations—I have my concerns. But then again, I haven’t self-modified my opinion-forming algorithm lately.

Image credit: Giordano Rossoni, Taylor Heery, Somecards

Written by Annie Cushing · Categorized: AI

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Copyright © 2025