Benchmarking GPT-5 Performance and Repeatability on the Japanese National Examination for Radiological Technologists over the Past Decade (2016–2025)

Umehara Kensuke; Ota Junko; Tatsuya Nishii; Kishimoto Riwa; Takayuki Ishida

doi:10.1016/j.ejrai.2025.100064

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

Benchmarking GPT-5 Performance and Repeatability on the Japanese National Examination for Radiological Technologists over the Past Decade (2016–2025)

https://repo.qst.go.jp/records/2001857

アイテムタイプ

学術雑誌論文 / Journal Article(1)

公開日

2025-12-23

タイトル

Benchmarking GPT-5 Performance and Repeatability on the Japanese National Examination for Radiological Technologists over the Past Decade (2016–2025)

言語

eng

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_6501

資源タイプ

journal article

著者

Umehara Kensuke
Ota Junko
Tatsuya Nishii
Kishimoto Riwa
Takayuki Ishida

抄録

内容記述タイプ

Abstract

内容記述

Purpose:To evaluate GPT-5 against GPT-4o on the Japanese National Examination for Radiological Technologists (2016–2025), assessing accuracy, repeatability, and factors influencing performance differences.Materials and methods:We analyzed 1992 multiple-choice questions involving text and images, spanning the medical and engineering domains. Both models answered all questions in Japanese under identical conditions across three independent runs. Majority-vote accuracy (correct if ≥ 2 of 3 runs were correct) and first-attempt accuracy were compared using McNemar’s test. Repeatability was quantified with Fleiss’ κ. Univariable and multivariable analyses were conducted to identify question-level factors associated with GPT-5 improvements.Results:Across all 10 examination years, GPT-5 achieved a majority-vote accuracy of 92.8 % (95 % CI: 91.5–93.8), consistently outperforming GPT-4o at 72.4 % (95 % CI: 70.4–74.4; P < .001). Repeatability was higher for GPT-5 (κ = 0.925, 95 % CI: 0.915–0.935) than for GPT-4o (κ = 0.904, 95 % CI: 0.894–0.914), with correct answers in all three runs for 88.2 % vs. 68.9 % of items. GPT-5 performed better than GPT-4o in text-based (96.5 % vs. 78.1 %) and image-based questions (72.6 % vs. 41.9 %). Significant improvements were observed for MRI, CT, and radiography images; however, performance improvements were smaller for clinically oriented ultrasound and nuclear medicine images. The greatest advantages were observed in calculation questions (97.3 % vs. 39.3 %) and engineering-related domains, consistent with external benchmarks highlighting GPT-5’s improved reasoning.Conclusion:GPT-5 demonstrated significantly higher accuracy and repeatability than GPT-4o over a decade of examination, with improvements in quantitative reasoning, engineering content, and diagram interpretation. Although improvements extended to medical images, performance in clinical image interpretation remains limited.

書誌情報

European Journal of Radiology Artificial Intelligence

巻 5, p. 100064, 発行日 2025-12

出版者

Elsevier

ISSN

収録物識別子タイプ

ISSN

収録物識別子

3050-5771

DOI

識別子タイプ

DOI

Versions

Ver.1

2026-01-06 05:25:18.013283

Show All versions

Cite as

Other

エクスポート

OAI-PMH

JPCOAR 2.0
JPCOAR 1.0
DublinCore
DDI

Other Formats

インデックスリンク

インデックスツリー

アイテム

Benchmarking GPT-5 Performance and Repeatability on the Japanese National Examination for Radiological Technologists over the Past Decade (2016–2025)

× Umehara Kensuke

× Ota Junko

× Tatsuya Nishii

× Kishimoto Riwa

× Takayuki Ishida

Versions

Share

Cite as

Other

エクスポート

コミュニティ

メニューを最小化

インデックスリンク

インデックスツリー

アイテム

Benchmarking GPT-5 Performance and Repeatability on the Japanese National Examination for Radiological Technologists over the Past Decade (2016–2025)

× Umehara Kensuke

× Ota Junko

× Tatsuya Nishii

× Kishimoto Riwa

× Takayuki Ishida

Versions

Share

Cite as

Other

エクスポート

コミュニティ