Srinivasa Rao Nandam
Academic and research departments
Centre for Vision, Speech and Signal Processing (CVSSP), Surrey Institute for People-Centred Artificial Intelligence (PAI).About
My research project
Foundation models for multimodal understandingFoundation models for natural language processing (NLP) have already seen a huge sucess after the seminal work of BERT (BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding) and GPT-N (Generative Pretraining Transformer: Improving language understanding by generative pre-training) which are recognised as the early foundation models for NLP in 2018.
However, the foundation models for computer vision started to emerge three years later at the beginning of 2021 with the seminal work of SiT (SiT: Self-supervised vIsion Transformer (under review)) which proposed the idea of group masked model learning (GMML).
Equipped with both NLP and vision foundation models, the aim of the PhD will be to study the role of these foundation models (e.g., NLP, vision, audio etc.) in multimodal analysis and understanding. For example, the initial work of the current research team (CLMIU: Commonsense Learning in Multimodal Image Understanding (under review)) has already established that using foundation models for vision in multimodal image understanding is more beneficial and has already alleviated the need of object detector which is considered as a critical pre-processing step of visual input.
The PhD research will build more advanced multimodal and cross-modal algorithms suitable for several downstream applications by building upon foundation models. The explainability of the decision and tasks performed by multimodal algorithm will also be of particular focus during the PhD study.
Supervisors
Foundation models for natural language processing (NLP) have already seen a huge sucess after the seminal work of BERT (BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding) and GPT-N (Generative Pretraining Transformer: Improving language understanding by generative pre-training) which are recognised as the early foundation models for NLP in 2018.
However, the foundation models for computer vision started to emerge three years later at the beginning of 2021 with the seminal work of SiT (SiT: Self-supervised vIsion Transformer (under review)) which proposed the idea of group masked model learning (GMML).
Equipped with both NLP and vision foundation models, the aim of the PhD will be to study the role of these foundation models (e.g., NLP, vision, audio etc.) in multimodal analysis and understanding. For example, the initial work of the current research team (CLMIU: Commonsense Learning in Multimodal Image Understanding (under review)) has already established that using foundation models for vision in multimodal image understanding is more beneficial and has already alleviated the need of object detector which is considered as a critical pre-processing step of visual input.
The PhD research will build more advanced multimodal and cross-modal algorithms suitable for several downstream applications by building upon foundation models. The explainability of the decision and tasks performed by multimodal algorithm will also be of particular focus during the PhD study.