"people usually want to keep control over their data"
A new artificial intelligence model trained on millions of NHS records could help doctors forecast illnesses and hospitalisation rates.
Foresight was created using anonymised health data from 57 million people.
But experts warn that the sheer scale and richness of this data mean there are serious concerns around privacy and the potential for patients to be re-identified.
The model was first developed in 2023 using OpenAI’s GPT-3 and real data from 1.5 million patients in two London hospitals.
Its latest version, built by researchers at University College London (UCL), is powered by Meta’s Llama 2 and trained on 10 billion health events recorded by the NHS in England between 2018 and 2023.
Chris Tomlinson, who leads the project at UCL, said the model could help with disease prediction and prevention:
“The real potential of Foresight is to predict disease complications before they happen, giving us a valuable window to intervene early, and enabling a shift towards more preventative healthcare at scale.”
Despite the promise, the team has not yet released data on how well the model performs. Foresight remains in testing.
Michael Chapman, of NHS Digital, who oversees the data used to train the model, said:
“The data that goes into the model is de-identified, so the direct identifiers are removed.”
But he admitted: “It’s then very hard with rich health data to give 100 per cent certainty that somebody couldn’t be spotted in that dataset.”
Luc Rocher at the University of Oxford said: “Building powerful generative AI models that protect patient privacy is an open, unsolved scientific problem.
“The very richness of data that makes it valuable for AI also makes it incredibly hard to anonymise. These models should remain under strict NHS control where they can be safely used.”
To reduce risks, the model runs in a secure NHS data environment. Only approved researchers can access it.
Amazon Web Services and Databricks supplied the infrastructure, but have no access to the data, according to Tomlinson.
However, the researchers have not tested whether the AI has memorised sensitive details from its training data.
When asked whether such testing had taken place, Tomlinson said:
“It’s looking at doing so in the future.”
The use of such a vast dataset without full public engagement raises ethical concerns, says Caroline Green at the University of Oxford.
She added: “Even if it is being anonymised, it’s something that people feel very strongly about from an ethical point of view, because people usually want to keep control over their data and they want to know where it’s going.”
There is little opportunity for people to opt out.
The NHS says those who refused to share data from their GP won’t be included. But other opt-out mechanisms don’t apply to the Foresight model because the data is “de-identified.”
An NHS England spokesperson said: “As the data used to train the model is anonymised, it is not using personal data and GDPR would therefore not apply.”
But legal experts point out that “de-identified” data is not the same as fully anonymous data.
The UK’s data regulator says the term lacks a clear legal definition and can cause confusion.
Complicating matters further, Foresight is currently used only for research related to Covid-19. This means emergency data laws introduced during the pandemic still apply.
Sam Smith, of medConfidential, said:
“This Covid-only AI almost certainly has patient data embedded in it, which cannot be let out of the lab.”
“Patients should have control over how their data is used.”
As legal and ethical questions mount, some researchers warn that public trust could be damaged if transparency is not prioritised.
Green said: “There is a bit of a problem when it comes to AI development, where the ethics and people are a second thought, rather than the starting point.
“But what we need is the humans and the ethics need to be the starting point, and then comes the technology.”