Researchers who found that GPT-4, the latest iteration of OpenAI’s large language model (LLM), is capable of generating false but convincing datasets have described the results as alarming.

In a paper published on 9 November in JAMA Ophthalmology, it was found that, when prompted to find data that supports a particular conclusion, the AI can use a set of parameters and produce semi-random datasets to fulfil the end goals.

Dr. Andrea Taloni, co-author of the paper alongside Prof. Vincenzo Scorcia and Dr Giuseppe Giannccare, told Medical Device Network that the basis of the paper was text-based plagiarism.

“We saw many authors describing attempts to create entire manuscripts based just on generative AI,” Taloni said. “The result was not always perfect, but it was really impressive. Our AI could generate a vast amount of text [and] medical knowledge synthesized inside the timeframe of a few minutes. So we thought, why not create a data set from scratch with fake assumptions and data?

“The result was quite surprising to us and, well, scary.”

How well do you really know your competitors?

Access the most comprehensive Company Profiles on the market, powered by GlobalData. Save hours of research. Gain competitive edge.

Company Profile – free sample

Thank you!

Your download email will arrive shortly

Not ready to buy yet? Download a free sample

We are confident about the unique quality of our Company Profiles. However, we want you to make the most beneficial decision for your business, so we offer a free sample that you can download by submitting the below form

By GlobalData

The paper showcased attempts to make GPT-4 produce data that supported an unscientific conclusion – in this case, that penetrating keratoplasty had worse patient outcomes than deep anterior lamellar keratoplasty for sufferers of keratoconus, a condition that causes the cornea to thin which can impair vision. Once the desired values were given, the LLM dutifully compiled a database that to an untrained eye would appear perfectly plausible.

Taloni explained that, while the data would fall apart under statistical scrutiny, it didn’t even push the limits of what Chat-GPT can do. “We made a simple prompt […] The reality is that if someone was to create a fake data set, it is unlikely that they would use just one prompt. [If] they find an issue with the data set, they could fix it with consecutive prompts and that is a real problem. 

“There is this sort of tug of war between those who will inevitably try to generate fake data and all of our defensive mechanisms, including statistical tests and possibly software trained by AI.”

The issue will only worsen as the technology becomes more widely adopted too. Indeed, a recent GlobalData survey found that while only 16.1% of respondents from its Hospital Management industry website reported that they were actively using the technology, a further 26.8% said either that they had plans to use it or were exploring its potential use.

Nature worked with two researchers, Jack Wilkinson and Zewen Lu, to examine the dataset using techniques that would commonly be used to screen for authenticity. They found a number of errors including a mismatch of names and sexes of ‘patients’ and lack of a link between pre- and post-operative vision capacity. 

In light of this, Wilkinson, senior lecturer in Biostatistics at the University of Manchester, explained in an interview with Medical Device Network that he was less concerned by AI’s potential to increase fraud.

“I started asking people to generate datasets using GPT and having a look at them to see if they could pass my checks,” he said. “So far, every one I’ve looked at has been pretty poor. To be honest [they] would fall down under even modest scrutiny.” 

He acknowledged fears like those raised by Dr. Taloni about future improvements in AI-generated datasets but ultimately noted that most data fraud is currently done by “low-skill fabricators,” and that “if those people don’t have that knowledge, they don’t know how to prompt Chat-GPT to have it either.”

The problem for Wilkinson is how widespread falsification already is, even without generative AI. 

Data fraud 

Data fraud and other forms of scientific falsification are worryingly common. The watchdog Retraction Watch estimates that at least 100,000 scientific papers should be retracted each year and that around four out of five of those are due to fraud. There have been some particularly high-profile cases this year, including one that led to the resignation of Stanford’s President over accusations of data manipulation in papers with which he had been involved.

When asked how prevalent data fraud currently is in the clinical trials space – in which Wilkinson is primarily focused – he told Medical Device Network that it is very hard to know.

“One estimate we’ve got was from some work by a guy called John Carlyle,” Wilkinson explained. “He did an exercise where he requested the datasets for all of the clinical trials that were submitted to the journal where he’s an editor and performed forensic analysis of those datasets.  

“When he was able to access inline data, he estimated that around one in four were in his words critically flawed by false data, right? We all use euphemisms. So that’s one estimate. The problem is that most journals don’t perform that kind of forensic investigation, so it’s unclear how many just slip through the net and get published.”

Wilkinson also noted a concern that people could become too concerned with prevalence.

“There probably wouldn’t need to be too many for them to have quite a big effect,” he said. “So the big concern we have for clinical trials is in systematic reviews. Any of the problematic trials we do have will get hoovered up and put in the systematic review.

“There are a couple of problems with this. The first one is that systematic reviews consider the methodological quality of the studies, but not the authenticity. Many fake studies describe perfectly good methods, so they’re not picked up on by this check. 

“The other is that systematic reviews are really influential. They’re considered to be very high standard of evidence, they influence clinical guidelines, they’re used by clinicians and patients to decide what treatments to use. Even if the prevalence doesn’t turn out to be that high, although anecdotally there do appear to be hundreds of fake trials, systematic reviews are acting like a pipeline for this fake data to influence patient care.”