This article doesn’t aim to predict the future of Generative AI. Rather, this piece aims to identify points of contact between regulation and technology. The first part of this analysis looks at three copyright issues when discussing the copyright issues that come with generative models: scale, synthesis, and severability. These characteristics set the stage for the second part of this article, namely, how miscategorizing a technology may lead to regulatory oversight. There is an underlying tension at play as well: large companies are funding what seems to be an arms race in developing larger, more expensive, generative AI models and, as a result, questions regarding data sourcing come up. Thus, where generative models get their data acts as a point of contact with law, more specifically, copyright law.
When looking at the copyright debate surrounding generative AI, one factor to consider is what regulators contemplated when carving out exceptions that permitted reproducing copyrighted materials for creative, research or educational purposes. Whereas individual artists or authors may have studied and replicated a part of another’s corpus in order to develop their own skills, these instances were individual and not at scale.
In contrast, technologies such as large language models (LLMs) or generative adversarial networks (GANs) that underlie ChatGPT and Midjourney ‘deconstruct’ copyrighted material at scale. By deconstruction what’s meant is using AI to detect minute features in training data and extract them to build a model that can then generate similar but not identical data. Feed these systems enough Orwell or Picasso, and they can replicate the style of the original artist. This seems to be similar to someone studying literature or art and then writing or painting. However,, the differentiating factor here is the scale at which generative AI systems perform these actions. Rather than studying hundreds or even thousands of novels and paintings, generative AI training datasets can consist of nearly one trillion words which is the equivalent of 10 million novels at 250 pages each.
Like software in general, AI models are replicable and can be deployed to produce millions of stories or images by serving millions of users. The caveat here is that generative models are not fully automated, even systems like AutoGPT–they still require users to input prompts or outline a series of goals. As a result, massive numbers of output still require human input and, thus, some degree of originality to produce artistic works. The key point to understand is that just as generative systems train at scale, they also deploy at scale and produce a large amount of data.
As such, another question that this analysis raises is how regulators grapple with ‘detector’ and ‘generator’ AI models that work together. Effectively, what do regulators do when they have an AI system that can generate data on which it then trains itself? This might happen by using synthetic data–data that is similar but not identical to copyrighted data that came from a model that was trained on copyrighted data. To illustrate this, consider using AI to generate synthetic X-rays based on real patient data and then train another AI on the synthetic X-rays to perform preliminary medical diagnosis on actual patients. The issue at play here is also one of scale.AI systems could eventually produce massive amounts of training data that is one degree removed from someone else’s copyrighted data. Assuming no exceptions to copyright apply, replicating copyrighted data in a training set would infringe copyright, it is questionable whether the resulting model infringes on copyright, and even less clear whether the synthetic data that a model generates infringes on copyright laws.
In addition, another factor differentiating generative AI is severability. If copyright laws begin to include generative AI models as potential copyright violators, how will organizations ‘sever’ the infringing parts of an AI model from the rest? Effectively, the training data has already been processed, so removing infringing parts of a model’s training data would only prevent future violations but not change the existing model itself. If organizations are required to retrain models, costs would soar as estimates peg the training cost of OpenAI’s older GPT-3 at $4.6 million. Furthermore, the entities that bear these costs would typically be the ones who develop AI in public view and not teams operating in the shadows. Still, whether a model is developed in compliance with copyright law or not, once it is released, it is difficult to contain.
The copyright question is one of many in the generative AI space but serves to illustrate a point: many legal frameworks, both common law and civil, may not have the conceptual frameworks in place to categorize generative AI systems. As a result of gaps in conceptualization, regulators may end up trying to fit a square peg in a round hole. Researchers Natali Helberger and Nicholas Diakopoulos noted that “it is the user who determines whether the system falls into the low or high-risk category” when it comes to generative AI systems rather than the system itself being categorized as such. Hacket et al. reiterated this, noting that with the exception of non-discrimination provisions, regulations for large generative AI systems “should generally focus on deployed applications, not the pre-trained model.”
Fundamentally by looking at the nature of the technology in question, its implementation, use, and effects, one might come away with a different view than if you simply categorized a piece of technology under a wide umbrella. While generative systems and decision-making systems both fall broadly within the umbrella of ‘AI’, a deeper analysis of how these systems are built and how they are used will lead to different conclusions regarding what role they play in society. Although the industry’s rush to produce bigger models may result in harm, regulators’ rush to regulate may result in gaps and loopholes. Categories are important and misconceptualizing them may lead to a misapplication of legal frameworks.
Hassan Patel is a lawyer by training with a background in science, business, technology, and government. He is the Director of Global AI Compliance Engineering at Fairly AI, an AI governance platform. He also sits on the Expert Drafting Team to develop the 2nd Edition of CAN/DGSI 101 Ethical design and use of automated decision systems for the Canadian Digital Governance Council.