Skip to main content

Webinars & Videos

Vision Towards Multimodal Foundation Models with Medical Imaging

How can you incorporate language-based information to improve AI-based algorithms for imaging? This presentation by Akshay Chaudhari, Assistant Professor in the Integrative Biomedical Imaging Informatics at Stanford (IBIIS) section in the Department of Radiology, discusses how to use large language models (LLMs) to achieve greater imaging outcomes.

In this talk, he shares how contrastive language image pretraining (CLIP) can turn medical images, LLM image descriptions, and clinical notes into algorithms that can classify images, draft findings and predict disease. While most radiology AI models only work well for one task, these multimodal foundation models are best at zero-shot learning: the ability to accurately respond to a question they’ve never seen before.

Dr. Chaudhari discusses the latest research into radiology foundation models from the Stanford Machine Intelligence in Medical Imaging research group, including their finding that multimodal models trained with CLIP are better at drafting radiology reports than GPT4. He describes the group’s approach to training a foundation model with 8 billion parameters on internal chest X-Ray data. Using 15,000 abdominal CT scans and corresponding radiology reports for the training data, they are now able to predict cardiometabolic disease and musculoskeletal disorders better than popular models like RadFM and BiomedCLIP. Dr. Chaudhari also chats about the ability to synthesize images with vision-language models to improve the generalizability of AI and the importance of having access to platforms that manage large datasets, ensure image quality, and curate complex data.