Umar Marikkar
Academic and research departments
Surrey Institute for People-Centred Artificial Intelligence (PAI).About
My research project
Foundation models for understanding medical data"Foundation models like "BERT: Pre-training of deep Bidirectional Transformers for Language Understanding" & "Generative Pretraining Transformer: Improving language understanding by generative pre-training (GPT-N)" have transformed natural language processing (NLP). However, the foundation models for vision started to emerge 2.5 years later at the beginning of 2021 with the introduction of group masked model learning (GMML) in "SiT: Self-supervised vision Transformer". The use of these foundation models in healthcare is underexplored.
The research project proposes to study the role of these foundation models in multimodal healthcare analysis and develop Machine Learning models for classifying cardiopulmonary conditions (e.g. Pneumonia) using multimodal data: vital signs, chest X-rays and meta-data including Electronic Health Records. The project will use recently published large database (MIMIC-CXR) of 377,110 images for 65,379 patients presenting to the Emergency department in Boston between 2011-2016. Each imaging study contains one or more images, typically frontal and lateral views (over 65 %). A recent work of the current team (CLMIU: Commonsense Learning in Multimodal Image Understanding) has established that using foundation models for vision & NLP for vision-language pre-training is more beneficial and
has already alleviated the need of object detector which is considered as a critical pre-processing step for visual input. The PhD research will build advanced multimodal healthcare analysis algorithms suitable for several downstream applications by building upon foundation models. Some of the downstream healthcare application can include, classification, detection, segmentation, grounding of disease in imaging data and unsupervised discovery of patterns. Due to healthcare, a particular emphasis will be given to the explainability of the decisions made by the algorithms.
Supervisors
"Foundation models like "BERT: Pre-training of deep Bidirectional Transformers for Language Understanding" & "Generative Pretraining Transformer: Improving language understanding by generative pre-training (GPT-N)" have transformed natural language processing (NLP). However, the foundation models for vision started to emerge 2.5 years later at the beginning of 2021 with the introduction of group masked model learning (GMML) in "SiT: Self-supervised vision Transformer". The use of these foundation models in healthcare is underexplored.
The research project proposes to study the role of these foundation models in multimodal healthcare analysis and develop Machine Learning models for classifying cardiopulmonary conditions (e.g. Pneumonia) using multimodal data: vital signs, chest X-rays and meta-data including Electronic Health Records. The project will use recently published large database (MIMIC-CXR) of 377,110 images for 65,379 patients presenting to the Emergency department in Boston between 2011-2016. Each imaging study contains one or more images, typically frontal and lateral views (over 65 %). A recent work of the current team (CLMIU: Commonsense Learning in Multimodal Image Understanding) has established that using foundation models for vision & NLP for vision-language pre-training is more beneficial and
has already alleviated the need of object detector which is considered as a critical pre-processing step for visual input. The PhD research will build advanced multimodal healthcare analysis algorithms suitable for several downstream applications by building upon foundation models. Some of the downstream healthcare application can include, classification, detection, segmentation, grounding of disease in imaging data and unsupervised discovery of patterns. Due to healthcare, a particular emphasis will be given to the explainability of the decisions made by the algorithms.
Publications
Immunofluorescence (IF) images reveal detailed information about structures and functions at the subcellular level. However, unlike RGB images, IF datasets pose challenges for deep learning models due to their inconsistencies in channel count and configuration, stemming from varying staining protocols across laboratories and studies. Although existing approaches build channel-adaptive models for training , they do not perform evaluations across IF datasets with unseen channel configurations. To address this, we first introduce a biologically informed view of cellular image channels by grouping them into either context or concept, where we treat the context channels as a reference for the concept channels in the image. We leverage this view to propose Channel Conditioned Cell Representations (C3R), a framework that learns representations that transfers well to both in-distribution (ID) and out-of-distribution (OOD) datasets which contain same and different channel configurations , respectively. C3R is a twofold framework comprising a channel-adaptive encoder architecture and a masked knowledge distillation training strategy, both built around the context-concept principle. We find that C3R outperforms existing benchmarks on both ID and OOD tasks, while yielding state-of-the-art results on frozen encoder evaluation on the CHAMMI benchmark. Our method opens a new pathway for cross-dataset generalization between IF datasets, with no need for retraining on unseen channel configurations.