Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos

University of Washington
Equal Contribution

Quilt-LLaVA is capable of describing the prominent medical regions within a histopathology patch. Additionally, it can be utilized to reason towards a diagnosis based on the current observations. Note: The image includes eosinophils and lymphocytes, and is sampled from a WSI showing rare benign dermatitis.

Key Takeaways

  • Visual Instruction Tuning in histopathology has not been successful so far due to two reasons: 1) Image-language datasets lacking spatial groundings for words, making it hard for models to have spatial awareness, and 2) Existing datasets are based on PubMed, which provides isolated image-text pairs, with which it is very challenging to encode holistic understanding.
  • We utilize educational YouTube content to establish a visual instruction tuning dataset. First, we extract the narrator's mouse cursor to ground their words in images to encode spatial awareness, and second, we utilize broader video content when generating our instruction tuning dataset, thereby providing a holistic understanding of Whole Slide Images.
  • With our instruction tuning data, we jointly train a vision and text encoder to have a visual chatbot, which outperforms SOTA on both in-house and public histopathology benchmarks.


The gigapixel scale of whole slide images (WSIs) poses a challenge for histopathology multi-modal chatbots, requiring a global WSI analysis for diagnosis, compounding evidence from different WSI patches. Current visual instruction datasets, generated through large language models, focus on creating question/answer pairs for individual image patches, which may lack diagnostic capacity on their own in histopathology, further complicated by the absence of spatial grounding in histopathology image captions.

To bridge this gap, we introduce Quilt-Instruct, a large-scale dataset of 107,131 histopathology-specific instruction question/answer pairs, that is collected by leveraging educational histopathology videos from YouTube, which provides spatial localization of captions by automatically extracting narrators' cursor movements. In addition, we provide contextual reasoning by extracting diagnosis and supporting facts from the entire video content to guide the extrapolative reasoning of GPT-4. Using Quilt-Instruct, we train Quilt-LLaVA, which can reason beyond the given single image patch, enabling diagnostic reasoning and the capability of spatial awareness. To evaluate Quilt-LLaVA, we propose a comprehensive evaluation dataset created from 985 images and 1283 human-generated question-answers directly extracted from videos. We also thoroughly evaluate Quilt-LLaVA using public histopathology datasets, where Quilt-LLaVA significantly outperforms SOTA by over 10% on relative GPT-4 score and 4% and 9% on open and closed set VQA.

We have created a grounded image-text dataset from educational histopathology videos on YouTube. The bottom row displays an illustrative example. First, we detect frames that have a stable background. Then we extract the narrators' mouse cursors. Then, we perform spatio-temporal clustering on the mouse pointer locations to obtain dense visual groundings for the narrators' speech. Using this method, we create grounded image-text dataset, from which we generate Quilt-Instruct to train our visual Language Learning Model, Quilt-LLaVA.

Generating Quilt-Instruct

Similar to LLaVA-Med, we use conversation and detailed description based prompts (We call these independent prompts, where each image/caption pair is considered independent to one another), where we utilize GPT-4 to generate Q&A pairs from image captions. However unlike LLaVA-Med, which lacks spatial groundings, we extract mouse pointers to ground narrator's speech into spatial regions of images, leading to better spatial awareness.

Some examples of detailed description and conversation based Q&A samples from the given description.

Traditional image-caption datasets often consist of pairs that lack contextual connection, limiting the Q/A pairs generated by GPT-4 to the context of a single image. This is particularly a limitation for histopathology images, which require holistic analysis beyond a single image patch. To overcome this, we propose reasoning-based prompting techniques: Complex Reasoning and Iterative Abductive Reasoning, where we distill the global facts and the diagnosis from the broader video content, and leverage these when prompting GPT-4, enabling it to extrapolate in a contextually anchored manner, thereby reducing the risk of hallucinations.

In complex reasoning, given a caption, along with a diagnosis and contributory facts, we prompt GPT-4 in a diagnostic reasoning task designed to extrapolate beyond the immediate context of the given caption. More broadly, we instruct GPT-4 to utilize its inherent medical knowledge to interpret the contents of a single image caption, while subconsciously incorporating the diagnosis and supporting facts extracted from the broader video.

In iterative abductive reasoning, we simulate a conversation between two GPT-4 agents, mimicking a scenario where a professional doctor uses our model to ask longer medically intricate questions about an image. The first agent, termed Human-GPT, is provided with a single image caption and is tasked with abductively reasoning about the possible diagnoses and the facts used to arrive at these conclusions. The second agent, referred to as the AI Assistant GPT, is privy to the diagnosis and facts, simulating someone who has viewed the WSI of this particular patient. The AI Assistant evaluates the accuracy of the abduction derived by Human-GPT and provides comments or hints at potentially overlooked details using its inherent medical knowledge while utilizing diagnosis and facts. The conversation continues back and forth until a conclusion is made or the conversation reaches to the upper limit.

Some examples of reasoning-based Q&A samples generated from the given caption, along with the diagnosis and the facts leading up to that diagnosis, extracted from the broader video content.

Model Architecture and Training

We adopted the approach proposed in LLaVA-Med, where we initialize Quilt-LLaVA with the general-domain LLaVA and trained for two stages: Histopathology Domain Alignment using Quilt dataset and instruction-tuning on Quilt-Instruct dataset.

Human Generated VQA Dataset for Evaluation

To evaluate Quilt-LLaVA, alongside public VQA pathology datasets, we also generated Quilt-VQA by extracting Q&A dataset from naturally occurring questions/answers given in the videos. With the help of GPT4 and some handcrafted algorithms, we collect a rich evaluation dataset of 1283 Q&A pairs. Top two rows show image-dependent Q&A pairs and bottom two rows show general-knowledge Q&A pairs. The original question posed by the narrator of the video is highlighted in yellow.

Furthermore, we experimented with the visual prompting methodology outlined in Visual Prompting using Red Circle for evaluating the performance of our model. This involves utilizing the subset of QUILT-VQA with bounding boxes to create ellipses that encapsulate the concepts highlighted by these boxes.


We beat SOTA on both open ended (free form text generation) and close set (multiple choice based answers) tasks on both public and in-house datasets.

Qualitative Comparisons Against SOTA Visual LLMs.


If you find our work useful, please cite our paper:

      title={Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos}, 
      author={Mehmet Saygin Seyfioglu and Wisdom O. Ikezogwo and Fatemeh Ghezloo and Ranjay Krishna and Linda Shapiro},