1. Introduction
How do large language models (LLMs) like OpenAI’s Chat GPT and Google Bard work, seem intelligent, and give human-like answers? The answer is data. LLMs are trained on massive datasets of text and code, which allows them to learn the patterns and rules of language. But this comes at a cost: LLMs are incredibly data-hungry, and some of this data might be your own sensitive, personal data.
In this blog post, we will delve into the technical aspects of LLMs’ insatiable appetite for data and explain why such extensive datasets are necessary for their successful training. Next, we will explore the disconcerting relationship between LLMs and data protection in the EU.
2. Hungry for Data
The effectiveness of LLMs lies in their ability to learn patterns and language structures from textual data. However, to achieve this, LLMs must undergo training. Multiple training mechanisms are available, each with its own purpose and approach, but they all involve exposing the LLM to massive datasets.
The first reason why these massive datasets are needed is that, to function appropriately, LLMs need an extensive set of parameters, which they can only obtain by access to massive datasets, or by sharing parameters with other LLMs. These parameters are the weights and biases of the model’s neurons. They are adjusted during training to minimise prediction errors. The model requires extensive exposure to various linguistic nuances and contexts to finetune such a significant number of parameters effectively.
Second, LLMs need to understand the linguistic context.Languages are full of ambiguity, nuance, and polysemy, where words carry multiple meanings depending on the context. For example, the word ‘bank’ can refer to where we sit, a financial institution or the side of a river. LLMs must be trained on diverse real-world examples to disambiguate words and select the most appropriate meaning based on the surrounding text. LLMs require extensive data to learn how to disambiguate and select the most appropriate meaning based on the surrounding text. A limited dataset would hinder this capability, resulting in less accurate text generation.
Third, LLMs must be able to handle rare language phenomena. Languages are rich and diverse, encompassing numerous dialects, slang, various writing styles, and niche terminologies. For example, take the relatively new slang, ‘no cap.’ If a text says, ‘no cap,’ an LLM must grasp whether that means someone is not wearing a hat or speaking the truth. To do so, LLMs need to be exposed to a broad spectrum of linguistic idiosyncrasies to handle user queries effectively.
Finally, LLMs are expected to generalise to new tasks. Whilst LLMs are primarily trained on specific tasks, such as language translation, they should also be capable of generalising to uncharted domains. This includes composing poetry, generating diverse creative content, and providing informative responses to questions. For instance, with creative prompt engineering, we can make an LLM simulate a computer program/app with interactive menus and customised instructions. To achieve this, LLMs should be trained on a large and mixed dataset of text and code. Narrow training data may limit the model’s applicability to new functions.
3. Concerns with Data Protection
So, the more extensive and diverse the dataset used during training, the better an LLM can generalise its language understanding and generate coherent text. However, this raises many questions about data protection, which we will touch upon in the following paragraphs.
First, what kind of data is being used and collected to train LLMs? How long is this data stored? Under the General Data Protection Regulation (GDPR), data subjects have the right to know what data is collected and why. This means that organisations using LLMs must provide clear information about their data collection and retention policies, including the data on which the LLM is trained. In this respect, the GDPR grants data subjects the right to request the deletion of their data, also known as the “right to be forgotten” under Article 17. This means that the companies using LLMs must have processes in place to delete user data upon request. This raises a troubling technical issue — how can specific user data be removed from a model that has been trained on that data?
Furthermore, the GDPR mandates informed and explicit user consent for data processing. However, the consent process in LLM interactions can be unclear, as users may not always be fully aware of how their data is used. For example, are users aware that their interactions with ChatGPT are being stored and analysed? Are they informed about how this data contributes to the training and functioning of that LLM? A recent example of this is found in the healthcare sector. A recent Biomedical Engineering paper explains that LLMs are effective for ophthalmologists taking notes during eye surgery. However, LLMs often mix surgery notes with patients’ medical history and symptoms, which are sensitive personal data. The processing of sensitive personal data, even for health purposes, is subject to strict regulations under the GDPR, Article 9.
In this connection, we can also think of risks related to cybersecurity. LLMs store vast amounts of data, making them attractive targets for cyberattacks and subject to the exposure of sensitive information. Recently, such exposure of sensitive information happened in the matter of ChatGPT. OpenAI announced that ChatGPT’s ‘chatbot source code’ (specifically the open-source library ‘Redis’, which ChatGPT uses to store user interactions per session) had a bug which caused specific data leakage. This bug led to some users of ChatGPT having access to the titles of other users’ conversation history, including snippets of their chats. The bug also exposed payment-related information for premium users.
Another point of concern is international data transfers. Data transfers, particularly those involving AI models like LLMs, often entail intricate legal considerations to ensure data protection. This is especially true when the data is transferred outside the EU, as many organisations developing these models are based abroad. To comply with the GDPR, these organisations must adhere to international data transfer rules. This might involve implementing standard contractual clauses or verifying that the recipient country has sufficient data protection laws. Despite the GDPR’s ‘Adequacy Decision’ offering protection for data subjects who wish to share their data outside the EU, procedural constraints remain. These include checking for safeguards in third countries that have not been issued an Adequacy Decision. Given the borderless nature of LLMs, data protection becomes a complex issue when content generated by these models is transferred across jurisdictions.
Finally, we can also think of the concerns that arise due to the ability of LLMs to analyse user data and provide personalised responses. This creates a potential for profiling and automated decision-making. Ethical implications occur if an LLM makes decisions or predictions about a user based on their data. The Cambridge Analytica case is an example of profiling, albeit unrelated to LLMs. This case revolved around the collection of Facebook users’ data to be utilised by machine-learning algorithms to generate predictions about users’ personal preferences, which were then employed to deliver personalised advertisements. This practice conflicts with the GDPR, Article 22, which grants users the right to be informed about such profiling activities and contest automated system decisions.
4. Final Remarks: High Hopes for the EU AI Act
The proposed EU AI Act is currently in the ‘trilogue’ stage of the EU legislative process, awaiting approval by the Commission, the Council, and Parliament. This Act outlines new regulations and responsibilities for monitoring AI systems used by providers and users. The level of legal involvement depends on the risk level posed by AI systems, which can be classified as unacceptable, high, limited, low, or minimal risk. LLMs would generally fall into the ‘limited’ risk category since they interact with humans and are therefore subject to transparency obligations.
Regarding the Parliament’s position on the AI Act, several suggestions have been put forth for generative AI models. Firstly, service providers must disclose that content has been generated by AI rather than humans. Secondly, companies must ensure that models are trained in a way that minimises the production of illegal content. Lastly, information about the type of training used for LLMs should be made public.
Given the evolving nature of AI and its technological advancements, the AI Act could address risk management processes for AI systems more specifically. The Act’s assessment and mitigation measures should adapt to potential future effects of AI technology through a forward-looking approach. The GDPR, Article 35(7), emphasises that processing operations involving AI technology should prioritise user data protection. Stakeholders play a crucial role in ensuring compliance with these regulations. In particular, controllers behind AI technology may seek guidance from data protection officers to conduct impact assessments (GDPR, Article 35(2)).
The AI Act is expected to become a pivotal regulation as it establishes harmonised rules governing the use of AI systems. As AI advances and integrates into our daily lives, concerns regarding data privacy related to LLMs are growing and cannot be overlooked. Compliance with regulations is essential.