Study calls for improved methods of testing AI symptom checkers

The current method of benchmarking online symptoms checkers is misleading, according to new research.

The study, led by Dr Austen El-Osta from the Self-Care Academic Research Unit (SCARU), used clinical vignettes to benchmark an online symptom checker against the opinion of medical experts. The results found this method led to significant variability in responses.

Online symptom checkers (OSCs) are tools that ask users to input data about their own symptoms along with information such as gender and age. By using artificial intelligence (AI) or algorithms, an OSC will suggest a range of conditions that fit the information provided. OSCs may also provide triage information, for example whether the individual should self-care or seek professional medical attention.

The role of clinical vignettes

As part of the study, more than 130 case vignettes covering 18 medical areas were created by the Royal College of General Practitioners (RCGP). The RCGP also provided a ‘diagnostic solution’ and an appropriate triage recommendation for each vignette.

Vignettes are short medical stories outlining a patient’s symptoms and other relevant information and are traditionally used to test medical students. They have also been the principle way AI symptom checkers have tested their accuracy and assured their safety since their creation 7 years ago.

The research team initially gave the RCGP vignettes to an independent panel of experienced Imperial College clinicians who read them and gave their opinion on the correct prioritisation for the patient. The three triage options were either: (1) the patient should care for themselves, (2) see a GP, or (3) go to A&E. The clinicians were also asked to suggest what the three most likely conditions for the patient would be given their story.

The study found that while the Imperial doctors agreed most of the time on the self-care conditions, there was some variability on whether the person described in the other vignette stories should go and see a GP or go to hospital. Overall, they agreed with the RCGP clinicians for more than three-quarters of cases and only disagreed in 1 in 4 cases (26%).

When it came to naming the most probable condition, Imperial’s independent panel of doctors agreed with the RCGP’s doctors 72% of the time.

Benchmarking online symptom checkers

The team used a combination of medical academics and laypeople with no experience to pretend they were the patients with the symptoms described in the vignettes, and submit their responses to the online symptom checker ‘Healthily’.

Healthily’s results matched the RCGP recommendations 62% of the time and suggested the correct condition in 61% of the cases overall.

The team concluded that Healthily was generally working at a safe level of probable risk. It recommended ‘unsafe’ triage in 28.6% of cases, and ‘very unsafe’ 3.7% of the time (e.g., told someone they were able to self-care when they should go to hospital immediately).

Dr El-Osta explained: “Artificial intelligence enabled OSC are being developed to be safer and more accurate, In a real life situation, a doctor or an online symptom checker can ask a real patient relevant questions and the patient will be able to answer truthfully.

"However, it is not possible to have the same situation when using a vignette because the patient case described is necessarily limited in the number of words, and may not have the answer to a question a symptom checker could ask. This means that inputters using vignettes only often have no choice but to make some assumptions and put in responses that legitimately change the range of possible triage and condition options. In this case, the OSC may not come out with an answer that matches the vignette or GP opinion, but it may be appropriate for what was inputted.”

Real-life patients

Imperial SCARU concluded that online symptom checkers could only be truly verified if their performance was cross-checked against scenarios using real patients and interactions with GPs “as opposed to using artificial case vignettes.”

Dr El-Osta, said: “This piece of research started as an accuracy report and became something more far-reaching. We need to rethink the standard of testing for AI symptom checkers in light of this study. Research in this space is important because the routine use of safe and accurate online symptom checkers has the potential to ‘democratize self-care’ for all, and empower individuals to seek the right level of support when needed”

“The current use of vignettes isn’t serving the industry or the consumer. We are keen to continue this work to find an appropriate gold standard method of testing that can take account of all the variabilities we uncovered in this study.”

The full study is available on BMJ Open: What is the suitability of clinical vignettes in benchmarking the performance of online symptom checkers? An audit study