HomeSymposiaTraining large generative AI models based on publicly available personal data: a...

Related Posts

Training large generative AI models based on publicly available personal data: a GDPR conundrum that the AI Act could solve

Reading Time: 5 minutes
Print Friendly, PDF & Email

Large and widely used generative artificial intelligence (AI) models like open-source Stable Diffusion and OpenAI’s proprietary GPT-4 rely on the processing of information that is publicly available on the internet for their development, training, validation, and continuous improvement. Major concerns related to potential infringements of intellectual property rights or instances of defamation seem to arise mostly as a result of such systems’ outputs. But data protection issues may entail from the moment models are initially developed and trained, ie. before their roll-out to customers or the public at large.

As a recent landmark enforcement order by the Italian data protection supervisory authority (Garante) against OpenAI showed, information used in such contexts contains personal data – including sensitive data – of persons in the European Union (EU), and thus its processing by the models’ developers falls under the General Data Protection Regulation (GDPR). Among the Garante’s criticisms, the first and fundamental one is that OpenAI did not secure a legal basis under Article 6 GDPR to process personal data for initially training the algorithms underlying the operation of the popular ChatGPT service. As OpenAI did not collect EU individuals’ consent to process their personal data for such a purpose, it is safe to conclude that the Garante did not consider that OpenAI’s legitimate interests for data collection outweighed EU data subjects’ rights and freedoms, and thus the company’s personal data processing was unlawful.

It is noteworthy to highlight the similarities between this decision and rulings from the Garante and the French supervisory authority (CNIL) in cases involving ClearviewAI’s web scraping of facial images. Evidently, the nature of the personal data processed by ClearviewAI is different from the one fed into Large Language Models (LLMs) such as GPT-4, and ClearviewAI’s goal of uniquely identifying individuals as a service for law enforcement bodies is unlike Stable Diffusion’s mission of enabling internet users to obtain ‘photo-realistic images’ from text inputs. But some of the arguments that regulators used in the ClearviewAI rulings could also apply to the data processing underlying those and other large AI models. Such arguments include the fact that personal data being publicly accessible does not grant the data controller a “general authorization to re-use and further process” it under the legitimate interests legal basis, given its “strong intrusiveness” and lack of foreseeability and transparency for data subjects. These actions, together with the narrow interpretation of ‘manifestly making data public’ according to the European Data Protection Board (EDPB), may significantly hinder the possibilities AI providers have to scrape data from the web to test their AI models against bias and discrimination.

That is not to say that Article 6(1)(f) GDPR and the controller’s legitimate interests may never justify the processing of personal data for training and developing large AI models. In fact, researcher Pablo Trigo Kramcsák argues that legitimate interests may match the purpose of building AI training datasets better than individuals’ consent, given the quality, quantity, and relevance demands of data curation for AI training. A seminal 2020 study by the European Parliamentary Research Service was also sympathetic to the idea of relying on that GDPR provision’s flexibility for AI training purposes, as long as strong security measures (such as pseudonymization) are implemented. Furthermore, albeit with a different lens, the Court of Justice of the European Union (CJEU) has admitted the use of legitimate interests by Google when indexing personal data at a large scale in the operation of its search engine.

All in all, developers of large generative AI models need further clarity about whether and to what extent they can rely on the legitimate interests lawful ground or other alternatives to consent when training their models. There are valid concerns related to the prevention of inaccuracies and biases of AI systems’ output data, which arguably may only be achieved by using sufficiently representative training and validation datasets. And whereas reliance on the Article 6(4) GDPR compatibility test seems to be an avenue for AI developers to repurpose previously collected personal data to attain said valuable goal, the same does not apply to developers who are building large AI models from scratch based on publicly available data.

The European legislator seems to have that need in mind. Article 10(5) of the recently proposed AI Act suggests that developers (‘providers’) of a closed list of ‘high-risk’ AI systems should be able to process so-called special categories of data covered by Article 9 GDPR (such as health and sexual orientation data) to the extent that they are strictly necessary for bias monitoring, detection, and correction. Such processing should be accompanied by appropriate measures to safeguard the fundamental rights of data subjects, such as technical limitations on data reuse and the deployment of privacy-preserving techniques.

Nonetheless, the approach in the initial Commission proposal would not increase foreseeability for AI developers and data subjects with regards to the applicable Article 6(1) GDPR lawful ground when processing personal data for the training of AI models. This is particularly so when the personal data at stake does not fall under one of the heavily protected and broadly interpreted categories under Article 9(1) GDPR, or when AI systems under development would not be considered ‘high-risk’ under the AI Act – notably, where they are general-purpose generative AI systems like ChatGPT.

As the European Parliament is currently finalizing its negotiating position (Report) on the AI Act, lawmakers from both the Parliament and the EU Council sitting at trilogues should think about elaborating boundaries and clear rules for the collection of personal data from publicly available sources for training, validation, and testing of (at least) high-risk AI systems and general-purpose AI systems. As we have seen, obtaining consent from data subjects for those purposes is often unfeasible, and the legitimate interests lawful ground often plays a limited role when sensitive data or opaque data scraping from the web are at stake. A well-balanced legal obligation or public interest task vested in AI developers to collect personal data for the training of trustworthy AI systems through the AI Act, in line with the requirements of Article 6(3) GDPR, could do the trick.

Even if this blog focuses on the processing of personal data by AI providers at the development, initial training and validation stages, it is important to note that such providers may be tempted to – and often do – collect personal data in a real-world setting, ie. after the deployment of the AI system to ensure continuous improvement and fine-tuning of the system. A recent decision from the EDPB in a case related to Whastapp highlights the limitations that specific Article 6 GDPR lawful grounds may have when legitimizing personal data processing in that context. Therefore, the EU legislator has a similar opportunity of clarifying personal data collection at AI systems’ post-marketing stage by detailing Article 61 AI Act and extending its scope to the users of such AI systems. Through such extension, public and private bodies who integrate large AI models into their own products – like Microsoft has done with GPT4 – and are better positioned to assess and address real-world impacts of such models would be empowered to do just that by a clear legal mandate with strong data protection safeguards.

The Garante’s order against OpenAI is only one among multiple enforcement actions European data protection supervisory authorities have initiated or concluded in recent years against controllers who develop or deploy AI systems or automated tools for the processing of personal data. As data watchdogs in countries like France and the Netherlands develop their AI audit capabilities, until (and also after) the AI Act becomes effective they will remain prominent players in ensuring one of the most important parts of the EU’s rulebook for AI – the GDPR – is respected across the continent. Therefore, and as we enter full speed into the age of generative AI, both EU regulators and its legislator should strive to provide developers of large AI systems and individuals whose data is used to train such systems with much needed legal certainty and protections.

Sebastião Barros Vale
EU Policy Fellow at The Future of Privacy Forum (FPF) | + posts

Sebastião Barros Vale serves as the Senior Counsel | The Future of Privacy Forum


Featured Artist