Jiantao Wu

Postgraduate Research Student

jiantao.wu@surrey.ac.uk

Academic and research departments

Surrey Institute for People-Centred Artificial Intelligence (PAI).

About

My research project

Foundation Models for Visual Understanding

Much of the success of the AI is attributed to supervised-pretraining (SP) of deep neural networks and then adopting them for downstream applications specific tasks. However, there are several challenges and issues with labelled data, e.g, labelling cost, noisy-labels, incomplete and inadequate labels (focusing on only dominant concept/concepts in images), inherent human labelling bias. An alternative to supervised-learning is self/un-supervised-learning which learns without labels. Two of the key principles of self-supervised-learning in (computer) vision-based AI systems are a) get augmented views of the same input and enforce some notion of consistency between the views, b) mask part of the input and recover that part from rest of the unmasked input.

Although SSL is predicted to be the future of AI research, as put by AI pioneers, “The revolution will not be supervised". However, until very recently self-supervised-pretraining fell behind supervised-pretraining for computer vision applications hindering the realisation of SSL dream. This was changed by Group Masked Model Learning (GMML) which was proposed in our seminal work SiT: Self-supervised vIsion Transformers in 2021. The GMML marked a milestone in AI development as it is the first method to outperform supervised-pretraining and learned semantic concepts without using any labels. The core idea of GMML is adopted by tech giants like Microsoft, Facebook and several others for application areas including, computer vision, audio, medical-image analysis, anomaly detection, multimodal analysis and many more.

Supervisors

Sara Atito Ali Ahmed

Muhammad Awais

Publications

Jiantao Wu, Shentong Mo, Muhammad Awais, Sara Atito, Zhenhua Feng, Josef Kittler Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding, In: arXiv.org

DOI: 10.48550/arxiv.2308.11448

Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data. In the realm of computer vision, pretrained vision transformers (ViTs) have played a pivotal role in advancing transfer learning. Nonetheless, the escalating cost of finetuning these large models has posed a challenge due to the explosion of model size. This study endeavours to evaluate the effectiveness of pure self-supervised learning (SSL) techniques in computer vision tasks, obviating the need for finetuning, with the intention of emulating human-like capabilities in generalisation and recognition of unseen objects. To this end, we propose an evaluation protocol for zero-shot segmentation based on a prompting patch. Given a point on the target object as a prompt, the algorithm calculates the similarity map between the selected patch and other patches, upon that, a simple thresholding is applied to segment the target. Another evaluation is intra-object and inter-object similarity to gauge discriminatory ability of SSP ViTs. Insights from zero-shot segmentation from prompting and discriminatory abilities of SSP led to the design of a simple SSP approach, termed MMC. This approaches combines Masked image modelling for encouraging similarity of local features, Momentum based self-distillation for transferring semantics from global to local features, and global Contrast for promoting semantics of global features, to enhance discriminative representations of SSP ViTs. Consequently, our proposed method significantly reduces the overlap of intra-object and inter-object similarities, thereby facilitating effective object segmentation within an image. Our experiments reveal that MMC delivers top-tier results in zero-shot semantic segmentation across various datasets.

Jiantao Wu, Shentong Mo, Sara Ahmed, Zhen-Hua Feng, Josef Vaclav Kittler, Muhammad Awais (2024)DailyMAE: Towards Pretraining Masked Autoencoders in One Day

Recently, masked image modeling (MIM), an important self-supervised learning (SSL) method, has drawn attention for its effectiveness in learning data representation from unlabeled data. Numerous studies underscore the advantages of MIM, highlighting how models pretrained on extensive datasets can enhance the performance of downstream tasks. However, the high computational demands of pretraining pose significant challenges, particularly within academic environments, thereby impeding the SSL research progress. In this study, we propose efficient training recipes for MIM based SSL that focuses on mitigating data loading bottlenecks and employing progressive training techniques and other tricks to closely maintain pretraining performance. Our library enables the training of a MAE-Base/16 model on the ImageNet 1K dataset for 800 epochs within just 18 hours, using a single machine equipped with 8 A100 GPUs. By achieving speed gains of up to 5.8 times, this work not only demonstrates the feasibility of conducting high-efficiency SSL training but also paves the way for broader accessibility and promotes advancement in SSL research particularly for prototyping and initial testing of SSL ideas.