Digital health researchers have found significant variance in coverage, accuracy and safety among the eight most popular online symptom assessment apps, raising questions about how ready some of these technologies are for use in clinical settings.
The study, published in BMJ Open, compared the results of Ada Health, Babylon, Buoy, K Health, Mediktor, Symptomate, WebMD and Your.MD.
Ada Health chief medical officer Dr Claire Novorol said: “Symptom assessment apps have seen rapid uptake by users in recent years as they are easy to use, convenient and can provide invaluable guidance and peace of mind. When used in a clinical setting to support – rather than replace – doctors, they also have huge potential to reduce the burden on strained healthcare systems and improve outcomes.”
Researchers assessed each app using 200 clinical vignettes – fictional patient cases based partly on real world examples – and used a panel of human general practitioners (GPs) to benchmark performance.
The vignettes were generated using transcripts from the UK’s NHS 111 non-emergency telephone service and through cases the research team had seen themselves. They were then reviewed by an external panel of primary care practitioners to ensure quality and to assess diagnosis and urgency.
The vignettes were then entered into each of the apps by eight external GPs playing the role of patient. Each app was tested once against every vignette. Seven external GPs were also tested with the vignettes, providing preliminary diagnoses for the vignettes after telephone consultations.
The study found that only a handful of apps came close to the performance of human GPs.
The study looked at how comprehensively the apps covered all possible conditions and user types – a tool with poor coverage may exclude users who are too young, too old or pregnant, for example. Human GPs provided 100% coverage.
The most comprehensive app was Ada, which provided a condition suggestion 99% of the time, followed by WebMD at 93% and Buoy at 88.5%. The lowest scorers were Babylon, which was able to provide a condition suggestion only 51.5% of the time, followed by Symptomate at 61.5% and Your.MD at 64.5%.
Accuracy of each symptom assessment was also tested by comparing the conditions suggested with what a panel of doctors deemed to be the ‘gold standard’ response for each case. This metric was also highly variable, compared to the 82.1% accuracy of the GPs.
While Ada was rated as the most accurate, with the right condition in its top three suggestions 71% of the time, the other apps fell far below this. The next most accurate was Buoy with 43% accuracy, with the lowest scorer being Symptomate with only 27.5% accuracy.
The study also assessed the safety of the apps in question by examining whether the advice they gave to users had the appropriate level of urgency. Most apps gave safe advice, all scoring above 80%, but only Ada, Babylon and Symptomate came close to the 97% safety rating of the human GPs.
Brown University associate professor of Medical Science Dr Hamish Fraser said: “These results should help to determine which apps are ready for clinical testing in observational studies and then randomised controlled trials. The study design could form a model for future evaluations of symptom checker apps, and as part of assessment for regulatory approval.”