Expert Model IconMediQ: Question-Asking LLMs for Adaptive and Reliable Clinical Reasoning

1Department of Computer Science, University of Washington
2Department of Medicine, University of Washington
3Carnegie Mellon University 4Cornell Tech 5Allen Institute for AI

TLDR

When the LLM is unsure, how do we make it ask follow-up questions to gather more information? We introduce MEDIQ, a framework for simulating realistic clinical interactions, where an Expert model asks information-seeking questions when needed and respond reliably. We show that adapting LLMs to interactive information-seeking settings is nontrivial, and propose an abstention module to better estimate model confidence and ask better questions . MEDIQ improves diagnostic accuracy by 20.3%, but performance still lags compared to an upper bound when full information is given upfront.

An example MediQ interaction, where the Expert system is expected to elicit information from the patient until it is confident in its diagnosis.

Abstract

In high-stakes domains like clinical reasoning, AI assistants powered by large language models (LLMs) are yet to be reliable and safe. We identify a key obstacle towards reliability: existing LLMs are trained to answer any question, even with incomplete context in the prompt or insufficient parametric knowledge. We propose to change this paradigm to develop more careful LLMs that ask follow-up questions to gather necessary and sufficient information and respond reliably. We introduce MEDIQ, a framework to simulate realistic clinical interactions, which incorporates a Patient System and an adaptive Expert System. The Patient may provide incomplete information in the beginning; the Expert refrains from making diagnostic decisions when unconfident, and instead elicits missing details from the Patient via follow-up questions. To evaluate MEDIQ, we convert MedQA and Craft-MD---medical benchmarks for diagnostic question answering---into an interactive setup. We develop a reliable Patient system and prototype several Expert systems, first showing that directly prompting state-of-the-art LLMs to ask questions degrades the quality of clinical reasoning, indicating that adapting LLMs to interactive information-seeking settings is nontrivial. We then augment the Expert with a novel abstention module to better estimate model confidence and decide whether to ask more questions, thereby improving diagnostic accuracy by 20.3%; however, performance still lags compared to an (unrealistic in practice) upper bound when full information is given upfront. Further analyses reveal that interactive performance can be improved by filtering irrelevant contexts and reformatting conversations. Overall, our paper introduces a novel problem towards LLM reliability, a novel MEDIQ framework, and highlights important future directions to extend the information-seeking abilities of LLM assistants in critical domains.

How do existing LLMs perform with Limited Information?

Main results on non-interactive settings
Accuracy in non-intereactive setups with decreasing amount of available information (left of dashed line) and accuracy of the baseline (BASIC) and the improved (BEST) interactive setup (right of dashed line).

First, we reduce the amount of information presented to the Expert system too show that end-task accuracy is correlated with the amount of information available (as shown in the Non-Interactive Setups to the left of dotted vertical separator line).

Then, we provide the Initial information to the expert system give it the option to ask follow-up questions in an interactive manner (BASIC-Interactive). The model's performance drops from when given the same Initial information in the non-interactive setup (Initial Non-Interactive). This indicates that adapting LLMs to interactive information-seeking settings is nontrivial.

Finally, we show that our BEST setup, which incorporates explicit clinical reasoning and more accurate confidence estimation, effectively seeks additional information and improves performance.

Why does the BASIC interactive setup fail to perform clinical reasoning?

Main results on non-interactive settings

Is the BASIC interactive Expert system actually acquiring additional information? We looked into the question-asking behavior of the models and observed that LLMs almost never ask questions even when given the option. Instead, they tend to choose to answer the inquiry with incomplete information, which often lead to incorrect answers. So we hypothesize that the inability to ask questions lead to the poor performance, and in the following sections, we try to improve (1) the model's tendency to ask questions and (2) the quality of the questions.

Conversational Format and Irrelevant Information Distract the Expert System

Main results on non-interactive settings

Why did the performance drop so much with the BASIC baseline interactive setting? There is a striking 11.3% relative drop in accuracy compared to its non-interactive counterpart with the same Initial information (NI-Initial) across all benchmarked LLMs (7.43% for GPT-3.5 on iMedQA). We show that the irrelevant information and the conversation format of the information both contribute to the poor performance. When we remove irrelevant information---the questions that are not answerable using the patient record---and/or keeping only unique information by removing repeated questions (that are usually unanswerable as well), the accuracy increases as shown in the blue bars. As we convert the conversation log format into a paragraph format, accuracy further increases as shown in the orange bars, showing that it's easier for models to integrate information in a paragraph format.

Specialized Reasoning Modules Improve Expert System Performance

Main results on non-interactive settings

We improve the Expert system by having a dedicated abstention module. The dedicated abstention module produces an abstention decision first, then use separate question generation module and decision making modules to allow for more specialized instructions and simpler decisions at each step. We experimented with different confidence estimation formats by prompting the model to produce a numerical confidence score (Numerical), a binary confident/unconfident decision (Binary), and a scalar confidence level rating (Scale). On top of this, we apply self-consistency (SC)---repeating the prompt multiple times and taking the average of outcomes---and rationale generation (RG)---generating an explanation to identify knowledge gaps---for the confidence judgment. We show that rationale generation always unilaterally improves performance, while self-consistency only helps with rationale generation.

How much of the performance gap can be closed by asking questions?

We decompose the clinical reasoning process of the Expert into deciding when to ask questions and what questions to ask, and show that both contribute to performance gains. When to ask questions is controled by the confidence estimation: we find that appropriate confidence threshold setting improves accuracy (left) and rationale generation improves confidence estimation (middle). Finaly, we show that rationale generation also helps identify knowledge gaps and leads to better questions.

Main results on non-interactive settings
Accuracy over conversation lengths with independent abstention and quetion-generation modules averaged across abstention strategies with linear extrapolation. Increasing confidence threshold leads to more questions and higher accuracy.
Main results on non-interactive settings
Confidence scores with and without rationale generation (RG) averaged across Scale-based abstention strategies. RG leads to both lower initial confidence and lower estimated calibration error (ECE).
Main results on non-interactive settings
Accuracy with and without rationale generation (RG) across Scale-based abstention strategies. The rationale generated often suggest knowledge gaps and guides question generator to produce more effective quetions.
Main results on non-interactive settings
Example of identified knowledge gap via rationale generation that leads to more relevant question generation.

BibTeX

@inproceedings{li2024mediq,
          title={{MediQ}: Question-Asking {LLMs} and a Benchmark for Adaptive and Reliable Clinical Reasoning},
          author={Li, Shuyue Stella and Balachandran, Vidhisha and Feng, Shangbin and Ilgen, Jonathan and Pierson, Emma and Koh, Pang Wei and Tsvetkov, Yulia},
          booktitle={Proc. NeurIPS},
          year={2024}
}