Two artificial intelligence models underperform on examinations in a veterinary curriculum

Michelle C. Coleman, DVM, PhD, DACVIM mccole@uga.edu

and James N. Moore, DVM, PhD

DOI:: https://doi.org/10.2460/javma.23.12.0666

Volume/Issue:: Volume 262: Issue 5

Received:: 05 Dec 2023
Accepted:: 08 Jan 2024
Online Publication Date:: 21 Feb 2024

Open access

Abstract

OBJECTIVE

Advancements in artificial intelligence (AI) and large language models have rapidly generated new possibilities for education and knowledge dissemination in various domains. Currently, our understanding of the knowledge of these models, such as ChatGPT, in the medical and veterinary sciences is in its nascent stage. Educators are faced with an urgent need to better understand these models in order to unleash student potential, promote responsible use, and align AI models with educational goals and learning objectives. The objectives of this study were to evaluate the knowledge level and consistency of responses of 2 platforms of ChatGPT, namely GPT-3.5 and GPT-4.0.

SAMPLE

A total of 495 multiple-choice and true/false questions from 15 courses used in the assessment of third-year veterinary students at a single veterinary institution were included in this study.

METHODS

The questions were manually entered 3 times into each platform, and answers were recorded. These answers were then compared against those provided by the faculty members coordinating the courses.

RESULTS

GPT-3.5 achieved an overall performance score of 55%, whereas GPT-4.0 had a significantly (P < .05) greater performance score of 77%. Importantly, the performance scores of both platforms were significantly (P < .05) below that of the veterinary students (86%).

CLINICAL RELEVANCE

Findings of this study suggested that veterinary educators and veterinary students retrieving information from these AI-based platforms should do so with caution.