MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning

Shuyue Stella Li¹, Vidhisha Balachandran³, Shangbin Feng¹, Jonathan S. Ilgen²,

Emma Pierson⁴, Pang Wei Koh^1,5, Yulia Tsvetkov¹

¹Department of Computer Science, University of Washington
²Department of Medicine, University of Washington
³Carnegie Mellon University ⁴Cornell Tech ⁵Allen Institute for AI

Paper Data Code Slides

News & Updates

Excited to share that MediQ is accepted to Neurips 2024! Come check us out at the poster session (Thu 12 Dec 11 a.m. — 2 p.m. PST, East Exhibit Hall A-C #4805)!
We added a new training dataset containing synthetic clinical conversations with follow-up questions (generated from MedQA). Now you can train your own Expert agent to ask questions!
We are planning to add two more datasets to the MediQ benchmark: health questions in the wild parsed from r/AskDocs and rare cases from NEJM. Stay tuned!

TLDR

When the LLM is unsure, how do we make it ask follow-up questions to gather more information? We introduce MEDIQ, a framework for simulating realistic clinical interactions, where an Expert model asks information-seeking questions when needed and respond reliably. We show that adapting LLMs to interactive information-seeking settings is nontrivial, and propose an abstention module to better estimate model confidence and ask better questions . MEDIQ improves diagnostic accuracy by 20.3%, but performance still lags compared to an upper bound when full information is given upfront.

An example MediQ interaction, where the Expert system is expected to elicit information from the patient until it is confident in its diagnosis.

Abstract

Users typically engage with LLMs interactively, yet most existing benchmarks evaluate them in a static, single-turn format, posing reliability concerns in interactive scenarios. We identify a key obstacle towards reliability: LLMs are trained to answer any question, even with incomplete context or insufficient knowledge. In this paper, we propose to change the static paradigm to an interactive one, develop systems that proactively ask questions to gather more information and respond reliably, and introduce an benchmark - MediQ - to evaluate question-asking ability in LLMs. MediQ simulates clinical interactions consisting of a Patient System and an adaptive Expert System; with potentially incomplete initial information, the Expert refrains from making diagnostic decisions when unconfident, and instead elicits missing details via follow-up questions. We provide a pipeline to convert single-turn medical benchmarks into an interactive format. Our results show that directly prompting state-of-the-art LLMs to ask questions degrades performance, indicating that adapting LLMs to proactive information-seeking settings is nontrivial. We experiment with abstention strategies to better estimate model confidence and decide when to ask questions, improving diagnostic accuracy by 22.3%; however, performance still lags compared to an (unrealistic in practice) upper bound with complete information upfront. Further analyses show improved interactive performance with filtering irrelevant contexts and reformatting conversations. Overall, we introduce a novel problem towards LLM reliability, an interactive MediQ benchmark and a novel question-asking system, and highlight directions to extend LLMs' information-seeking abilities in critical domains.

The MEDIQ framework for simulating realistic clinical interactions.

Comparison among the standard single-turn medical question-answering task, where all necessary information is provided upfront (left), the general LLM response not specialized to the patient's situation (middle), and the interactive MEDIQ task, where the Expert asks follow-up questions to gather necessary information.

Explicit reasoning steps of the Expert System in MEDIQ, which includes an abstention module to decide whether to ask more questions.

How do existing LLMs perform with Limited Information?

Main results on non-interactive settings — Accuracy in non-intereactive setups with decreasing amount of available information (left of dashed line) and accuracy of the baseline (BASIC) and the improved (BEST) interactive setup (right of dashed line).

First, we reduce the amount of information presented to the Expert system too show that end-task accuracy is correlated with the amount of information available (as shown in the Non-Interactive Setups to the left of dotted vertical separator line).

Then, we provide the Initial information to the expert system give it the option to ask follow-up questions in an interactive manner (BASIC-Interactive). The model's performance drops from when given the same Initial information in the non-interactive setup (Initial Non-Interactive). This indicates that adapting LLMs to interactive information-seeking settings is nontrivial.

Finally, we show that our BEST setup, which incorporates explicit clinical reasoning and more accurate confidence estimation, effectively seeks additional information and improves performance.

Why does the BASIC interactive setup fail to perform clinical reasoning?

Is the BASIC interactive Expert system actually acquiring additional information? We looked into the question-asking behavior of the models and observed that LLMs almost never ask questions even when given the option. Instead, they tend to choose to answer the inquiry with incomplete information, which often lead to incorrect answers. So we hypothesize that the inability to ask questions lead to the poor performance, and in the following sections, we try to improve (1) the model's tendency to ask questions and (2) the quality of the questions.

Conversational Format and Irrelevant Information Distract the Expert System

Why did the performance drop so much with the BASIC baseline interactive setting? There is a striking 11.3% relative drop in accuracy compared to its non-interactive counterpart with the same Initial information (NI-Initial) across all benchmarked LLMs (7.43% for GPT-3.5 on iMedQA). We show that the irrelevant information and the conversation format of the information both contribute to the poor performance. When we remove irrelevant information---the questions that are not answerable using the patient record---and/or keeping only unique information by removing repeated questions (that are usually unanswerable as well), the accuracy increases as shown in the blue bars. As we convert the conversation log format into a paragraph format, accuracy further increases as shown in the orange bars, showing that it's easier for models to integrate information in a paragraph format.

Specialized Reasoning Modules Improve Expert System Performance

We improve the Expert system by having a dedicated abstention module. The dedicated abstention module produces an abstention decision first, then use separate question generation module and decision making modules to allow for more specialized instructions and simpler decisions at each step. We experimented with different confidence estimation formats by prompting the model to produce a numerical confidence score (Numerical), a binary confident/unconfident decision (Binary), and a scalar confidence level rating (Scale). On top of this, we apply self-consistency (SC)---repeating the prompt multiple times and taking the average of outcomes---and rationale generation (RG)---generating an explanation to identify knowledge gaps---for the confidence judgment. We show that rationale generation always unilaterally improves performance, while self-consistency only helps with rationale generation.

How much of the performance gap can be closed by asking questions?

We decompose the clinical reasoning process of the Expert into deciding when to ask questions and what questions to ask, and show that both contribute to performance gains. When to ask questions is controled by the confidence estimation: we find that appropriate confidence threshold setting improves accuracy (left) and rationale generation improves confidence estimation (middle). Finaly, we show that rationale generation also helps identify knowledge gaps and leads to better questions.

Ablation results on confidence thresholds. — Accuracy over conversation lengths with independent abstention and quetion-generation modules averaged across abstention strategies with linear extrapolation. Increasing confidence threshold leads to more questions and higher accuracy.

Ablation results: rationale generation reduces calibration error. — Confidence scores with and without rationale generation (RG) averaged across Scale-based abstention strategies. RG leads to both lower initial confidence and lower estimated calibration error (ECE).

Ablation results: rationale improves performance through question generation — Accuracy with and without rationale generation (RG) across Scale-based abstention strategies. The rationale generated often suggest knowledge gaps and guides question generator to produce more effective quetions.

How do information-seeking behaviors differ across different patient records?

We parse the patient records by their medical specialties¹, question types², age, and gender groups. We find that when given limited initial information, proactive information-seeking benefits different groups differently. This leads to the natural next step in the exploration of information-seeking LLM agents: how do we design different information-seeking behaviors to adapt to different scenarios? We leave this exciting question for future work.

Ablation results: interactive system helps/hinders different medical specialties differently. — Specialties like Ophthalmology and Neurosurgery benefit from information-seeking, but Family Medicine doesn't benefit as much. Questions testing for clinical experience (Step 2&3) benefit more from information seeking than questions on foundational science knowledge questions (Step 1).

^{1. The list of medical specialties and their definitions are obtained from the American Board of Medical Specialties.↩}

^{2. There are 3 levels in the United States Medical Licensing Examination (USMLE): Step 1 tests foundational science knowledge, Steps 2 and 3 evaluate clinical experience and the ability to practice medicine for patient care and management.↩}

Poster

BibTeX

@inproceedings{li2024mediq,
        title={MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning},
        author={Li, Shuyue Stella and Balachandran, Vidhisha and Feng, Shangbin and Ilgen, Jonathan S and Pierson, Emma and Koh, Pang Wei and Tsvetkov, Yulia},
        journal={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
        year={2024}
      }