Introduction: the role of data protection in the AI race
While regulatory approaches to address AI risks (e.g., horizontal vs sector specific, comprehensive vs multi-layered, risk-based v right-based) are being discussed globally in various fora, the relevance of data protection law can’t be overlooked.
When developing and deploying AI systems that are built to extensively and continuously ingest and regurgitate all kinds of information, compliance with data protection law should be a priority. Core principles such as lawfulness, transparency and fairness, purpose limitation, data minimisation, storage limitation, accuracy, security, accountability and more broadly data protection by design should be at the forefront of the assessment.
But will these principles really impact the AI race in practice? Let’s focus upon lawfulness and its influence in the ChatGPT saga.
Key issue: which legal basis for training LLMs?
Be they called large language models (LLMs), or foundation models (note that there is no consensus on this term), these fancy AI models are usually trained on vast quantities of data, including personal, non-personal, and mixed sets, most of them scrapped from publicly available sources.
Under the EU General Data Protection Regulation (GDPR), one of the primary requirements for a controller is to establish a valid legal basis to ground the collection and use of personal for a specific purpose, including language model training.
Importantly, and contrary to US privacy laws, such as the California Consumer Privacy Act amended by the Consumer Privacy Rights Act, EU data protection law does not exclude from its remit personal data that is publicly available. Under US law there is no expectation of privacy with respect to publicly available information, which explains, to some degree, “the trans-Atlantic divide on privacy and data protection”.
ChatGPT in the spotlight
Developed by OpenAI, a for-profit US company, the GPT (Generative Pre-trained Transformer) language model developed by OpenAI (ChatGPT) is a large generative AI model trained to follow/process an instruction and provide, by understanding or interpreting user’s natural language input, a human-like response. Put bluntly and as explained here, it is “a computational system designed to generate sequences of words, code or other data, starting from a source input, called the prompt.”
How is ChatGPT trained? In the first stage (pre-deployment), the model is trained from massive quantities of data collected from countless sources. Pretraining or initial learning data may come from bulk databases of third-party providers or may be extracted from vast amounts of unstructured and unlabeled data stored on the web (through websites scraping and data mining tools), including “personal information that is available on the public internet”. Next, in post-deployment, ChatGPT is retrained to improve its performance and user’s inputs will be used in this process.
Given the above, it was only a matter of time before the first GDPR enforcement actions.
Let’s dive into the first enforcement order, i.e., the Italian SA’s order
The Italian Data Protection Supervisory Authority’s order is the first of its kind targeting LLMs. It is worth mentioning that the Italian SA has shown, for some time now, interest in the impact of AI systems on data subjects rights and interests, and that the order has triggered some interest in other fora. The EDPB has launched a task force on the subject and other SAs have opened inquiries, such as the French SA, the Spanish SA and German state SAs. Outside the EU, Canada’s federal privacy commissioner has launched probes following complaints. The UK ICO and the US FTC chair have also commented on the topic.
On 30 March 2023, the Italian SA ordered, with immediate effect, the provisional restriction of the processing of Italians’ data. At the same time, the Authority opened an investigation. Among the reasons given to justify the order, one finds a violation of GDPR Article 6 as well as a violation of Article 8. The SA notes:
“the absence of an appropriate legal basis in relation to the collection of personal data and their processing for the purpose of training the algorithms underlying the operation of ChatGPT”.
On 11 April, the SA followed up with more clarifications and gave OpenAI a detailed list of the measures to be adopted. In particular, regarding the legal basis challenge, the SA requested:
“changing the legal basis of the processing of users’ personal data for the purpose of algorithmic training, by removing any reference to contract and relying on consent or legitimate interest as legal bases by having regard to the assessment the Company is required to make from an accountability perspective;”.
The SA thus excluded the performance of a contract legal basis for the AI training phase and gave OpenAI two options: either to rely upon consent or the legitimate interest of the controller (if the legitimate interest ground is a valid option, then it is unclear why there would then be a violation of Article 8).
The SA did not distinguish between the pre-deployment training and post-deployment retraining. While the performance of a contract legal basis does not make sense for pre-deployment training, it is less obvious, at least at first glance, why post-deployment training to the benefit of the AI-as-a-Service provider, i.e., service improvement (to be distinguished from the fine-tuning performed by the AI-as-a-Service customer), could not be grounded on the performance of a contract legal basis. The contractual necessity test under Article 6(1)(b), though, is an objective test. Far from a formalistic approach, it considers, among other aspects, existing power imbalances between the parties to the contract, which implies a narrow interpretation of its remit. The test thus demands that “there must be no realistic, less intrusive alternatives, taking into account the reasonable expectations of the data subject”. A question to ask in this context would be whether the processing of the prompts for service improvement would objectively be an integral part of using the service as per EDPB’s guidelines 2/2019. The answer is likely to be negative and should not be impacted by the compatibility test of Article 6(4).
“Our legitimate interests in protecting our Services from abuse, fraud, or security risks, or in developing, improving, or promoting our Services, including when we train our models. This may include the processing of Account Information, Content, Social Information, and Technical Information. See here for instructions on how you can opt out of our use of your information to train our models”.
Interestingly, OpenAI announced a new subscription tier for ChatGPT enterprise customers, ChatGPT Business, “for professionals who need more control over their data as well as enterprises seeking to manage their end users”. This modality will follow OpenAI API’s data usage policies, implying that customer data would not be used to retrain OpenAI models by default. The legal basis in this particular case would thus be consent.
Has OpenAI managed to cure its position?
ChatGPT is now accessible from Italy with a light implementation of age declaration!
On 28 April, the Italian SA declared:
“The Italian SA welcomes the measures OpenAI implemented and calls upon the company to comply with the additional requests laid down in its 11 April order”.
Despite the SA’s optimism some challenges remain.
The extent to which the GDPR restricts training in these contexts should depend upon the upshot of the balancing of the legitimate interest of OpenAI as data controller (i.e., to conduct its own business) and third parties with the rights and freedoms of data subjects (users and non-users) as explained by Pablo Trigo Kramcsák here.
First, Google Spain teaches us that the legitimate interest legal basis works better when data subjects can exercise corrective rights broadly defined, e.g., the right to be delisted.
The Italian SA was clear that the right to object should benefit both users and non-users, as it had asked OpenAI to:
“mak[e] available, on the Company’s website, at least to data subjects who are connected from Italy, whether or not they are also users of the service, a tool by which they can exercise their right to object to the Company’s processing of their personal data for the purpose of training algorithms and providing the service insofar as such data have been obtained from third parties;”.
OpenAI refers from its privacy notice to an article contained in its “help center” to the possibility to opt out from the reuse of the prompts inputted into the model (conversations), linked to a ChatGPT user account data. OpenAI also refers to a form for requesting personal data removal, which is said to operationalise the right to object. OpeanAI states that it “may use additional sources to verify information, balancing privacy and free expression in accordance with applicable law.” More generally, OpenAI states that individuals “also may have the right to access, correct, restrict, delete, or transfer their personal information.”
How are the rights to object, as well as the rights to correction or erasure really implemented? For this purpose, individuals must provide “evidence” of data processing, including “any relevant prompts that resulted in the model mentioning the data subject”. But OpenAI says that it works to remove personal information from its training datasets upon request “where feasible”. In fact, once data is processed by these large models, it can be difficult to isolate specific data points. Besides, Lilian Edwards has referred to the multiple problems involved in exercising the rights of erasure and rectification with respect to LLM models and Margaret Mitchell anticipates that “OpenAI is going to find it near-impossible to identify individuals’ data and remove it from its models”.
Second, what is the situation of special categories of personal data, subject to the strict rule of Article 9? A LLM could be pretrained and retrained on this type of data. For example, users of these systems may, through their prompts/inputs, reveal (consciously or unconsciously) sensitive data (concerning them or third parties). The complexity of this matter increases when considering that the CJEU has adopted a broad and contextual approach to this concept, extending the protection to potentially sensitive data. In principle, data controllers can only process special categories of data if they can meet one Article 9(2) condition, as well as one Article 6 condition. As mentioned by Irene Karama and Paul de Hert, “the legitimate interest of the controller is not as such a ground constituting an exception to the general prohibition of processing of sensitive personal data in art. 9(1)”. Sebastião Barros Vale recalls that the Article 9(2) exception of special category data made manifestly public has been narrowly interpreted by the EDPB and discusses Article 10(5) of the AI Act proposal (that allows providers of high-risk AI systems, under certain conditions, to process Article 9(1) data when strictly necessary for negative bias control). When pondering the implications of the proposed rule, we must consider not only its scope (it would not be applicable to providers of non-high-risk AI), but also its interaction with Articles 9(2) and 6 GDPR (Marvin van Bekkum and Frederik Zuiderveen Borgesius points that Art. 10(5) would constitute a new “exception to the ban on using special category data”), along with the interpretation of the notion of “substantial public interest”, mentioned in Recital 44 of the proposal, and which has been further specified in the compromised text of the European Parliament (related to negative bias detection and correction). In any case, fairness considerations, as explained below, should remain relevant.
Third, the data processed by ChatGPT –related to both users and non-users- includes data of vulnerable groups, such as children. The question here is what it means to implement data protection by design and by default for children. Age verification is less straightforward than what people usually thinks as explained by the French SA.
Finally, as hinted above, there is more to the legitimate interest balancing than Chapter 3 individual rights. Interestingly, the French SA has questioned the possibility of using the legitimate interest basis for personal data processing that is intrusive and massive, especially when it involves obtaining from the Internet information concerning millions of people. It found that:
“individuals who have published photographs of themselves on websites, or consent to such publication with another data controller, do not expect that they will be reused for the purposes pursued by the company, i.e., the creation of facial recognition software (which combines the image of an individual with a profile containing all the photographs in which they appear, the information those photographs contain as well as the websites on which they are located) and the marketing of this software to law enforcement authorities.” In these words, the French SA expresses a concern about the large-scale exploitation of training data for illegitimate purposes such as surveillance, or model usage fairness. Of note, a model usage fairness argument should not be too quickly defeated by a negative bias control argument.
“GPT-4 can generate potentially harmful content, such as advice on planning attacks or hate speech. It can represent various societal biases and worldviews that may not be representative of the users intent, or of widely shared values. It can also generate code that is compromised or vulnerable. The additional capabilities of GPT-4 also lead to new risk surfaces.”
One difference with the Clearview case, however, is that forms of model usage are not as clearly anticipated.
Although the retraining phase could aim to mitigate AI biases and to prevent the production of discriminatory outputs (due to misrepresentation) or to minimize misinformation, the differential treatment of enterprise customers and non-enterprise customers casts some doubts upon the legitimacy of OpenAI’s interests. What is more, such a differential treatment is hard to justify from a data protection by design and by default perspective and calls into question the effective protection of personal data in zero-price digital markets. This also raises the question of model improvement fairness, i.e., the large-scale exploitation of end user prompts and feedback without fairly compensating them.
In any case, justifying the retraining does not necessarily address pre-training fairness considerations such as extraction fairness, i.e., concerns related to the large-scale exploitation of training data without informing, asking for prior authorization or compensating their creators.
The ChatGPT saga may not have shown the full potential of data protection law yet. It raises the question whether training activities associated with LLM-as-a-service should be grounded upon a variety of legal bases, including both legitimate interest and consent, rather than on legitimate interest only. It is unclear which interests are relevant for the balancing exercise and how to balance them appropriately, although the strongest fairness argument seems to relate to model usage. It’s also unclear how the effectiveness of data subject rights will be assessed and what the impact of the requirement of data protection by design and by default should have on the assessment. In passing, not much has been said on data minimisation.
New questions also arise about the data processing operations that take place when fine-tuning or customizing models for specific tasks through the API, using custom training data (for instance, is there a sole data controller?). There is a variety of business models emerging, with one of them based on an open-source approach that would allow customers-developers to train and deploy AI models on their own datasets while ensuring greater control over how these data are used.