AI models may retain and expose sensitive personal data despite industry safeguards, study finds

by Ruth Ntumba

Imperial researchers find that standard industry safeguards fail to prevent AI models from retaining sensitive information.

New research from Imperial College London suggests that common data-processing safeguards used in training AI models do not reliably prevent the memorisation of sensitive information, even when exact duplicates have been removed from training data. 

The AI industry has long acknowledged that language models can memorise fragments of their training data. Previous investigations have shown how some models can reproduce copyrighted material almost verbatim, prompting the widespread adoption of a technique known as deduplication. This approach removes repeated sequences from training data on the assumption that memorisation primarily results from repeated, identical examples. 

However, researchers from Imperial, led by Professor Yves-Alexandre de Montjoye from the Department of Computing, found that this understanding of memorisation is incomplete. Their research shows that models do not need to encounter identical copies of information in order to retain it. Instead, they can piece together and memorise fragments from similar but non-identical sequences — a phenomenon the researchers call “mosaic memory.” 

"The assumption has been that removing exact copies from training data prevents memorisation," says Matthieu Meeus, co-lead author of the study and a PhD student in the Department of Computing. "Our research shows that models can assemble information from fragments that look different but share a common core." 

The researchers found that text which has been partially modified or reworded can contribute nearly as much to memorisation as exact copies. At 10 per cent modification, fuzzy duplicates retain 60-80 per cent of the memorisation impact. Even sequences with half their content changed still contribute meaningfully. 

CEOs need to have a comprehensive AI strategy that empowers employees to harness massive productivity gains without putting a company’s 'crown jewels'—its proprietary data–at risk. Professor Yves-Alexandre de Montjoye Department of Computing

The findings are published 29 January 2026 in the peer-reviewed journal Nature Communications.  

In practical terms, this matters when multiple people interact with an AI about related topics. Consider a team working on a confidential project: one person asks the chatbot to improve an email summary, another to rewrite a few slides, a third to help draft talking points. Each prompt is different. Each would pass current deduplication filters yet could be memorised by the model. 

The same logic applies to individuals who return to an AI assistant repeatedly – updating tax documents month by month, refining the same project plan, or asking variations of the same question over time. In real training datasets, the researchers found that sequences with 1,000 exact duplicates also had over 20,000 fuzzy duplicates that escape standard detection.  

"Current deduplication techniques were designed for a simpler understanding of how memorisation works," says, Igor Shilov, co-lead author of the study, also a PhD student in the Department of Computing. "They weren't built for this level of system complexity." 

These findings come at a time when generative AI tools are becoming deeply embedded in workplace practice. A recent McKinsey survey found that 71% of organisations now regularly use generative AI in at least one business function. Employees are drafting emails, polishing presentations, and brainstorming strategies with tools like ChatGPT and Claude – often without explicit guidance from their employers about what information is safe to share, a phenomenon known as “Shadow AI.” 

The findings have implications for how organisations think about AI adoption including the use of on-premise approaches based on open-source models or enterprise contracts that promise data won't be used for training sidestep the issue entirely. For consumer-facing tools, the research suggests that current safeguards may not be sufficient to prevent sensitive information from being retained across similar-but-distinct interactions. 

"Employees will use AI to boost their productivity whether a formal policy exists or not" says Professor de Montjoye. "CEOs need to have a comprehensive AI strategy that empowers employees to harness massive productivity gains without putting a company’s 'crown jewels'—its proprietary data–at risk." 

The researchers argue that more sophisticated approaches to data preprocessing--ones that account for similarity rather than just exact matches--may be needed. But those come with trade-offs: more aggressive filtering could remove valuable training data and affect model performance. 

For now, the practical advice for users is straightforward: treat AI assistants like any other external service and think carefully before sharing information you wouldn't want retained by a model that talks to the world. 

Article text (excluding photos or graphics) © Imperial College London.

Photos and graphics subject to third party copyright used with permission or © Imperial College London.

Article people, mentions and related links

Reporters

Ruth Ntumba

Faculty of Medicine

Latest articles