Large Language Models Encode Clinical Knowledge
Karan Singhal et al. 2022. (View Paper → )
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high…To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online…
Selecting the right benchmark to evaluate models is important. This LLM is nearly as good at representing the current clinical consensus as clinicians are (as judged by a panel of clinicians). 92.6% med-PaLM vs 92.9% Clinician. But there are other measures where it falls well behind clinicians.
Would you trust a system like this to give you medical advice? Explainability of AI systems is going to become increasingly important. I’d want any system I consulted to be able to answer…
- How confident are you in this prediction?
- What sources have you used?
- What’s the best way for me to get a second opinion?
- How similar is this to other examples that have been ratified by human doctors?