Background & Objectives
The potential applications of artificial intelligence (AI), including the intersection of machine learning, deep learning and natural language processing, form a rapidly growing field in medicine and healthcare. The impact of AI, specifically large language models (LLMs) and generative AI, on patient care is immeasurable at this time, but may be vast. ChatGPT4, a large language model AI solution, is capable of text generation, language translation, text summarization, question answering, chatbot and automated content generations. Patient education from rheumatologists plays a significant role in self-management of rheumatic diseases. But when patients have questions, can AI generate accurate, comprehensive answers?
Ye et al. conducted a single-center, cross-sectional survey of rheumatology patients and physicians in Edmonton, Canada, to explore that question. The researchers assessed the quality of AI responses to patient-generated rheumatology questions by having participants rate a series of responses, some of which were generated by LLM chatbot and some of which were written by physicians.
Methods
Patient questions and physician-generated answers were extracted from the Alberta Rheumatology website, which houses resources for rheumatology patients. In this study, Ye et al. tried to match length of AI and physician-generated responses. Participants completed a one-time questionnaire evaluating typed responses to these real rheumatology patient questions, rating each one’s comprehensiveness and readability. Physician participants also evaluated the accuracy of responses, using a scale of 1–10 (1 being poor, 10 being excellent).
To minimize potential bias from pre-existing attitudes toward AI/chatbots, participants (patients and physicians) were not only blinded to the source of each answer, but they were also initially blinded to the study objective, and recruitment materials did not mention the use of AI. Only after evaluating each set of questions and answers were participants told that one answer was generated by AI.
Results
Patients rated no significant difference between AI and physician-generated responses in comprehensiveness or readability. However, rheumatologists rated AI responses significantly poorer than physician responses on comprehensiveness, readability and accuracy. After learning that one answer for each question was AI-generated, physicians were able to correctly identify AI-generated answers at a higher proportion than patients.
Conclusion
Rheumatology patients rated AI-generated responses to patient questions similarly to physician-generated responses in terms of comprehensiveness, readability and overall preference. However, rheumatologists rated AI responses significantly poorer than physician responses, suggesting that LLM-chatbot responses are of overall poorer quality than physician responses, a difference of which patients may not be aware.