Dr Armin Mustafa

Associate Professor in Computer Vision and AI

PhD, M.Tech, FHEA, Royal Academy of Engineering Research Fellow

+44 (0)1483 684262

armin.mustafa@surrey.ac.uk

https://arminmustafa.github.io/

39 BA 01

Academic and research departments

Centre for Vision, Speech and Signal Processing (CVSSP), School of Computer Science and Electronic Engineering, Surrey Institute for People-Centred Artificial Intelligence (PAI).

About

Biography

I am currently a Royal Academy of Engineering Research Fellow in the Centre for Vision, Speech and Signal Processing, University of Surrey working in 4D Vision for perceptive machines. The emergence of machines that interact with their environment has led to an increasing demand for automatic visual understanding of real-world scenes. My research exploits Artificial Intelligence (AI) to better understand complex scenes so that machines can efficiently model and interpret real-world for a range of socially beneficial applications including entertainment and creative industries, autonomous systems, and augmented/virtual reality. Over the past ten years, I have pioneered advances in 4D vision, NLP and Scene Understanding to understand complex real-world dynamic scenes.

I finished PhD in general dynamic scene reconstruction from multi-view videos in 2016 from the University of Surrey, supervised by Prof. Adrian Hilton, after which I became a Research Fellow at CVSSP, University of Surrey. I have previously worked at Samsung Research Institute, Bangalore, India for 3 years (2010 - 2013) in Computer Vision.

Areas of specialism

Computer Vision; Scene Understanding; 3D/4D Vision; Virtual Reality; Light Fields; Machine Learning; Video Captioning; Augmented Reality; Artificial Intelligence; Audio-visual Video Understanding

University roles and responsibilities

Early Career Representative, CVSSP
Surrey AI Fellow, PAI

Affiliations and memberships

British Machine Vision Association (BMVA)

Member

Institute of Electrical and Electronics Engineers (IEEE)

Member

Institution of Engineering and Technology (IET)

Member

Association for Computing Machinery (ACM)

Member

The Women's Engineering Society (WES)

Member

STEM Learning, UK

STEM Ambassador

Awards

2018 - Research Fellowship, The Royal Academy of Engineering , UK.
2017 - Young Researcher award, CVPR.
2016 - Doctoral Consortium grant, CVPR.
2015 - BMVA travel grant for ICCV.
2014 - Set-Squared Research to Innovator grant, Global #1 University incubator, UK.
2013 - Overseas Research Scholarship, FEPS, The University of Surrey, UK.
2010 - Cadence Silver Medal, Indian Institute of Technology, Kanpur, India.

News

26 NOV 2025

Recent Surrey graduate named National Grid Graduate of the Year

19 JUN 2021

CVSSP academics showcase 11 papers at leading computer vision conference

In the media

19 September 2018

Academy awards eight new research fellowships

Royal Academy of Engineering

13 October 2016

Selected presentations at ECCV 2016 - 4D Match Trees for Non-rigid Surface Alignment

Computer Vision News, RSIP Vision

Research

Research projects

AI4ME - BBC Prospective Partnership on Future of Personalised Media

£15 pound UKRI Prosperity Partnership with BBC - 2021 - 2026

4D Vision for Perceptive Machines

5 year fellowship funded by Royal Academy of Engineering

ALIVE: Live Action Light Fields for Immersive VR Experiences

Innovate UK project in collaboration with Figment Productions and Foundry

IMPART: Intelligent Management Platform for Advanced Real-Time media processes

EU FP7 project in collaboration with Filmlight, Double Negative, AUTH, UPF and BUT

Supervision

Postgraduate research supervision

Current

Sarthak Batra ”Text or Video to 3D for Scenes with Multiple Interacting People”
Ayushi Dutta ”Multi-person Reconstruction and Rendering of Dynamic Scenes”
Asmar Nadeem ”Audio-visual Scene Understanding from Monocular Video”
Soon Yau on ”Automatic Storyboard Generation” from 2021 - Now
Nikolina Kubiak on ”Computation Relighting in Video” from 2020 - Now

Completed

Stephanie Stoll on ”Automatic Sign Language Production” from 2019 – 2022
Mertalp Ocal on ”Self-Supervised 3D Reconstruction of Complex Dynamic Scenes” from 2018 - Now
Akin Caliskan on ”Dynamic 3D Human Reconstruction From Video” from 2017 - 2021

Teaching

I currently contribute to teaching on the following modules:

EEE3032 Computer Vision and Pattern Recognition
EEEM004 MSc Projects
EEE3017 UG Projects

Publications

Ayushi Dutta, Marco Pesavento, Marco Volino, Adrian Hilton, Armin Mustafa (2025)Realistic Clothed Human and Object Joint Reconstruction from a Single Image, In: Hansung Kim, Claudio Guarnera, Peter Eisert, Peter Vangorp, Changjae Oh (eds.), Proceedings of the 22nd ACM SIGGRAPH European Conference on Visual Media Production11pp. 1-12 ACM

DOI: 10.1145/3756863.3769705

In recent years, there has been a growing interest towards developing personalized human avatars, for applications ranging from virtual reality, movie production, gaming and social telepresence. In the future, these avatars will be expected to also interact with everyday objects. Achieving this requires not only accurate human reconstruction, but also a joint understanding of surrounding objects and their interactions, making 3D human-object reconstruction key to its success. Existing methods that can jointly reconstruct 3D humans and objects from a single RGB image produce only coarse or template-based shapes, thus failing to capture realistic details in the reconstruction, such as loose clothing on the human body. In this work, for the first time, we propose an approach to jointly reconstruct 3D clothed humans and objects, given a monocular image of a human-object scene. At the core of our framework, is a novel attention-based model that jointly learns an implicit function for the human and the object. Given a query point, our model utilizes pixel-aligned features from the input human-object image as well as from separate, non-occluded views of the human and the object as synthesized by a diffusion model. This allows the model to reason about human-object spatial relationships as well as to recover details from both visible and occluded regions, enabling realistic reconstruction. To guide the reconstruction, we condition the neural implicit model on human-object pose estimation priors. To support training and evaluation, we also introduce a synthetic human-object dataset. We demonstrate on real-world datasets that our approach significantly improves the perceptual quality of 3D human-object reconstruction.

Soon Yau Cheong, Armin Mustafa, Andrew Gilbert (2025)ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet, In: A DelBue, C Canton, J Pont-Tuset, T Tommasi (eds.), COMPUTER VISION-ECCV 2024 WORKSHOPS, PT I15623pp. 267-285 Springer Nature

DOI: 10.1007/978-3-031-91569-7_17

This paper introduces ViscoNet, a novel one-branch-adapter architecture for concurrent spatial and visual conditioning. Our lightweight model requires trainable parameters and dataset size multiple orders of magnitude smaller than the current state-of-the-art IP-Adapter. However, our method successfully preserves the generative power of the frozen text-to-image (T2I) backbone. Notably, it excels in addressing mode collapse, a pervasive issue previously overlooked. Our novel architecture demonstrates outstanding capabilities in achieving a harmonious visual-text balance, unlocking unparalleled versatility in various human image generation tasks, including pose re-targeting, virtual try-on, stylization, person re-identification, and textile transfer. Demo and code are available from project page https://soon- yau.github.io/visconet/.

Tony Alex, Sara Atito Ali Ahmed, Armin Mustafa, Muhammad Awais, Philip J B Jackson (2025)SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes, In: ICLR 2025 - The Thirteenth International Conference on Learning Representations - Proceedings ICLR

DOI: 10.48550/arXiv.2506.12222

Self-supervised pre-trained audio networks have seen widespread adoption in real-world systems, particularly in multi-modal large language models. These networks are often employed in a frozen state, under the assumption that the SSL pre-training has sufficiently equipped them to handle real-world audio. However, a critical question remains: how well do these models actually perform in real-world conditions, where audio is typically polyphonic and complex, involving multiple overlapping sound sources? Current audio SSL methods are often benchmarked on datasets predominantly featuring monophonic audio, such as environmental sounds, and speech. As a result, the ability of SSL models to generalize to polyphonic audio, a common characteristic in natural scenarios, remains underexplored. This limitation raises concerns about the practical robustness of SSL models in more realistic audio settings. To address this gap, we introduce Self-Supervised Learning from Audio Mixtures (SSLAM), a novel direction in audio SSL research, designed to improve, designed to improve the model's ability to learn from polyphonic data while maintaining strong performance on monophonic data. We thoroughly evaluate SSLAM on standard audio SSL benchmark datasets which are predominantly monophonic and conduct a comprehensive comparative analysis against SOTA methods using a range of high-quality, publicly available polyphonic datasets. SSLAM not only improves model performance on polyphonic audio, but also maintains or exceeds performance on standard audio SSL benchmarks. Notably, it achieves up to a 3.9\% improvement on the AudioSet-2M (AS-2M), reaching a mean average precision (mAP) of 50.2. For polyphonic datasets, SSLAM sets new SOTA in both linear evaluation and fine-tuning regimes with performance improvements of up to 9.1\% (mAP).

Davide Berghi, Craig Cieciura, Farshad Einabadi, Maxine Glancy, Oliver Charles Camilleri, Philip Anthony Foster, Asmar Nadeem, Faegheh Sardari, Jinzheng Zhao, Marco Volino, Armin Mustafa, Philip J B Jackson, Adrian Hilton (2024)ForecasterFlexOBM: A Multi-View Audio-Visual Dataset for Flexible Object-Based Media Production, In: ForecasterFlexOBM: A multi-view audio-visual dataset for flexible object-based media production IEEE

DOI: 10.1109/ICME57554.2024.10687655

Leveraging machine learning techniques, in the context of object-based media production, could enable provision of personalized media experiences to diverse audiences. To fine-tune and evaluate techniques for personalization applications, as well as more broadly, datasets which bridge the gap between research and production are needed. We introduce and publicly release such a dataset, themed around a UK weather forecast and shot against a blue-screen background, of three professional actors/presenters – one male and one female (English) and one female (British Sign Language). Scenes include both production and research-oriented examples, with a range of dialogue, motions, and actions. Capture techniques consisted of a synchronized 4K resolution 16-camera array, production-typical microphones plus professional audio mix, a 16-channel microphone array with collocated Grasshopper3 camera, and a photogrammetry array. We demonstrate applications relevant to virtual production and creation of personalized media including neural radiance fields, shadow casting, action/event detection, speaker source tracking and video captioning.

Nikolina Kubiak, Armin Mustafa, Graeme Phillipson, Stephen Jolly, Simon J Hadfield (2024)S3R-Net: A Single-Stage Approach to Self-Supervised Shadow Removal, In: Proceedings of the 2024 IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR 2024) Institute of Electrical and Electronics Engineers (IEEE)

In this paper we present S3R-Net, the Self-Supervised Shadow Removal Network. The two-branch WGAN model achieves self-supervision relying on the unify-and-adaptphenomenon - it unifies the style of the output data and infers its characteristics from a database of unaligned shadow-free reference images. This approach stands in contrast to the large body of supervised frameworks. S3R-Net also differentiates itself from the few existing self-supervised models operating in a cycle-consistent manner, as it is a non-cyclic, unidirectional solution. The proposed framework achieves comparable numerical scores to recent selfsupervised shadow removal models while exhibiting superior qualitative performance and keeping the computational cost low. Code & pretrained models are available at https://github.com/n-kubiak/S3R-Net

Soon Yau Cheong, Duygu Ceylan, Armin Mustafa, Andrew Gilbert, Chun-Hao Huang (2025)Boosting Camera Motion Control for Video Diffusion Transformers

Despite recent advancements in camera control methods for U-Net based video diffusion models, these methods have been shown to be ineffective for transformer-based diffusion models (DiT). In this paper, we investigate the underlying causes of this issue and propose solutions. Our study reveals that camera control performance depends heavily on the choice of conditioning methods, rather than on camera pose representations , as is commonly believed. To address the persistent motion degradation in DiT, we introduce Camera Motion Guidance (CMG), a classifier-free guidance approach that boosts camera motion by over 400%. Additionally, we present a sparse camera control pipeline that improves training data efficiency and simplifies the process of specifying camera poses for long videos. Project page at https://soon-yau.github.io/ CameraMotionGuidance.

Tony Alex, Sara Ahmed, Armin Mustafa, Muhammad Awais, Philip JB Jackson (2024)Max-AST: Combining Convolution, Local and Global Self-Attentions for Audio Event Classification, In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024)pp. 1061-1065 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP48485.2024.10447697

In the domain of audio transformer architectures, prior research has extensively investigated isotropic architectures that capture the global context through full self-attention and hierarchical architectures that progressively transition from local to global context utilising hierarchical structures with convolutions or window-based attention. However, the idea of imbuing each individual block with both local and global contexts, thereby creating a hybrid transformer block, remains relatively under-explored in the field.To facilitate this exploration, we introduce Multi Axis Audio Spectrogram Transformer (Max-AST), an adaptation of MaxViT to the audio domain. Our approach leverages convolution, local window-attention, and global grid-attention in all the transformer blocks. The proposed model excels in efficiency compared to prior methods and consistently outperforms state-of-the-art techniques, achieving significant gains of up to 2.6% on the AudioSet full set. Further, we performed detailed ablations to analyse the impact of each of these components on audio feature learning. The source code is available at https://github.com/ta012/MaxAST.git

Tony Alex, Sara Ahmed, Armin Mustafa, Muhammad Awais, Philip J. B. Jackson (2024)DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification, In: AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence38(16)1968pp. 17647-17655 AAAI Press

DOI: 10.1609/aaai.v38i16.29716

Convolutional neural networks (CNNs) and Transformer-based networks have recently enjoyed significant attention for various audio classification and tagging tasks following their wide adoption in the computer vision domain. Despite the difference in information distribution between audio spectrograms and natural images, there has been limited exploration of effective information retrieval from spectrograms using domain-specific layers tailored for the audio domain. In this paper, we leverage the power of the Multi-Axis Vision Transformer (MaxViT) to create DTF-AT (Decoupled Time-Frequency Audio Transformer) that facilitates interactions across time, frequency, spatial, and channel dimensions. The proposed DTF-AT architecture is rigorously evaluated across diverse audio and speech classification tasks, consistently establishing new benchmarks for state-of-the-art (SOTA) performance. Notably, on the challenging AudioSet 2M classification task, our approach demonstrates a substantial improvement of 4.4% when the model is trained from scratch and 3.2% when the model is initialised from ImageNet-1K pre-trained weights. In addition, we present comprehensive ablation studies to investigate the impact and efficacy of our proposed approach. The codebase and pretrained weights are available on https://github.com/ta012/DTFAT.git

Soon Yau Cheong, Armin Mustafa, Andrew Gilbert (2022)KPE: Keypoint Pose Encoding for Transformer-based Image Generation, In: The 33rd British Machine Vision Conference Proceedings

DOI: 10.48550/arXiv.2203.04907

Transformers have recently been shown to generate high quality images from text input. However, the existing method of pose conditioning using skeleton image tokens is computationally inefficient and generate low quality images. Therefore we propose a new method; Keypoint Pose Encoding (KPE); KPE is 10× more memory efficient and over 73% faster at generating high quality images from text input conditioned on the pose. The pose constraint improves the image quality and reduces errors on body extremities such as arms and legs. The additional benefits include invariance to changes in the target image domain and image resolution, making it easily scalable to higher resolution images. We demonstrate the versatility of KPE by generating photorealistic multiperson images derived from the DeepFashion dataset [1].We also introduce a evaluation method People Count Error (PCE) that is effective in detecting error in generated human images. (a) (b) Figure 1: (a) Our pose constrained text-to-image model supports partial and full pose view, multiple people, different genders, at different scales. (b) The Architectural diagram of our pose-guided text-to-image generation model. The text, pose keypoints and image are encoded into tokens and go into an transformer. *The target image encoding section is required only for training and is not needed in inference.

Nikolina Kubiak, Armin Mustafa, Graeme Phillipson, Stephen Jolly, Simon Hadfield (2021)SILT: Self-supervised Lighting Transfer Using Implicit Image Decomposition

DOI: 10.48550/arXiv.2110.12914

We present SILT, a Self-supervised Implicit Lighting Transfer method. Unlike previous research on scene relighting, we do not seek to apply arbitrary new lighting configurations to a given scene. Instead, we wish to transfer the lighting style from a database of other scenes, to provide a uniform lighting style regardless of the input. The solution operates as a two-branch network that first aims to map input images of any arbitrary lighting style to a unified domain, with extra guidance achieved through implicit image decomposition. We then remap this unified input domain using a discriminator that is presented with the generated outputs and the style reference, i.e. images of the desired illumination conditions. Our method is shown to outperform supervised relighting solutions across two different datasets without requiring lighting supervision.

Soon Yau Cheong, Armin Mustafa, Andrew Gilbert UPGPT: Universal Diffusion Model for Person Image Generation, Editing and Pose Transfer

DOI: 10.48550/arxiv.2304.08870

Existing person image generative models can do either image generation or pose transfer but not both. We propose a unified diffusion model, UPGPT to provide a universal solution to perform all the person image tasks - generative, pose transfer, and editing. With fine-grained multimodality and disentanglement capabilities, our approach offers fine-grained control over the generation and the editing process of images using a combination of pose, text, and image, all without needing a semantic segmentation mask which can be challenging to obtain or edit. We also pioneer the parameterized body SMPL model in pose-guided person image generation to demonstrate new capability - simultaneous pose and camera view interpolation while maintaining a person's appearance. Results on the benchmark DeepFashion dataset show that UPGPT is the new state-of-the-art while simultaneously pioneering new capabilities of edit and pose transfer in human image generation.

Faegheh Sardari, Armin Mustafa, Philip J B Jackson, Adrian Hilton (2024)CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing, In: Computer Vision – ECCV 2024. Lecture Notes in Computer Science15069

Weakly supervised audio-visual video parsing (AVVP) methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels. Existing approaches tackle this by leveraging unimodal and cross-modal contexts. However, we argue that while cross-modal learning is beneficial for detecting audible-visible events, in the weakly supervised scenario, it negatively impacts unaligned audible or visible events by introducing irrelevant modality information. In this paper, we propose CoLeaF, a novel learning framework that optimizes the integration of cross-modal context in the embedding space such that the network explicitly learns to combine cross-modal information for audible-visible events while filtering them out for unaligned events. Additionally, as videos often involve complex class relationships, modelling them improves performance. However, this introduces extra computational costs into the network. Our framework is designed to leverage cross-class relationships during training without incurring additional computations at inference. Furthermore, we propose new metrics to better evaluate a method’s capabilities in performing AVVP. Our extensive experiments demonstrate that CoLeaF significantly improves the state-of-the-art results by an average of 1.9% and 2.4% F-score on the LLP and UnAV-100 datasets, respectively.

Soon Yau Cheong, Armin Mustafa, Andrew Gilbert (2023)UPGPT: Universal Diffusion Model for Person Image Generation, Editing and Pose Transfer, In: 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)pp. 4175-4184 IEEE

DOI: 10.1109/ICCVW60793.2023.00451

Text-to-image models (T2I) such as StableDiffusion have been used to generate high quality images of people. However, due to the random nature of the generation process, the person has a different appearance e.g. pose, face, and clothing, despite using the same text prompt. The appearance inconsistency makes T2I unsuitable for pose transfer. We address this by proposing a multimodal diffusion model that accepts text, pose, and visual prompting. Our model is the first unified method to perform all person image tasks-generation, pose transfer, and mask-less edit. We also pioneer using small dimensional 3D body model parameters directly to demonstrate new capability - simultaneous pose and camera view interpolation while maintaining the person's appearance.

Davide Berghi, Craig Cieciura, Farshad Einabadi, Maxine Glancy, Oliver Charles Camilleri, Philip Anthony Foster, Asmar Nadeem, Faegheh Sardari, Jinzheng Zhao, Marco Volino, Armin Mustafa, Philip J B Jackson, Adrian Douglas Mark Hilton ForecasterFlexOBM: A multi-view audio-visual dataset for flexible object-based media production, In: ForecasterFlexOBM: A Multi-View Audio-Visual Dataset for Flexible Object-Based Media Production University of Surrey

DOI: 10.15126/surreydata.900912

Asmar Nadeem, Adrian Hilton, Robert Dawes, Graham Thomas, Armin Mustafa (2024)CAD - Contextual Multi-modal Alignment for Dynamic AVQA, In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/WACV57701.2024.00709

In the context of Audio Visual Question Answering (AVQA) tasks, the audio and visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing AVQA methods suffer from two major shortcomings; the audiovisual (AV) information passing through the network isn't aligned on Spatial and Temporal levels; and, intermodal (audio and visual) Semantic information is often not balanced within a context; this results in poor performance. In this paper, we propose a novel end-to-end Contextual Multi-modal Alignment (CAD) network that addresses the challenges in AVQA methods by i) introducing a parameter-free stochastic Contextual block that ensures robust audio and visual alignment on the Spatial level; ii) proposing a pre-training technique for dynamic audio and visual alignment on Temporal level in a self-supervised setting , and iii) introducing a cross-attention mechanism to balance audio and visual information on Semantic level. The proposed novel CAD network improves the overall performance over the state-of-the-art methods on average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our proposed contributions to AVQA can be added to the existing methods to improve their performance without additional complexity requirements.

Mahrukh Awan, Asmar Nadeem, Muhammad Junaid Awan, Armin Mustafa, Syed Sameed Husain Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification

Exploiting both audio and visual modalities for video classification is a challenging task, as the existing methods require large model architectures, leading to high computational complexity and resource requirements. Smaller architectures, on the other hand, struggle to achieve optimal performance. In this paper, we propose Attend Fusion, an audio-visual (AV) fusion approach that introduces a compact model architecture specifically designed to capture intricate audio-visual relationships in video data. Through extensive experiments on the challenging YouTube-8M dataset, we demonstrate that Attend-Fusion achieves an F1 score of 75.64% with only 72M parameters, which is comparable to the performance of larger baseline models such as Fully-Connected Late Fusion (75.96% F1 score, 341M parameters). Attend-Fusion achieves similar performance to the larger baseline model while reducing the model size by nearly 80%, highlighting its efficiency in terms of model complexity. Our work demonstrates that the Attend-Fusion model effectively combines audio and visual information for video classification, achieving competitive performance with significantly reduced model size. This approach opens new possibilities for deploying high-performance video understanding systems in resource-constrained environments across various applications.

Jack Oliver Hilliard, Adrian Hilton, Jean-Yves Guillemaut (2023)HDR Illumination Outpainting with a Two-Stage GAN Model, In: Marco Volino, Armin Mustafa, Peter Vangorp (eds.), Proceedings of the 20th ACM SIGGRAPH European Conference on Visual Media Production1pp. 1-9 ACM

DOI: 10.1145/3626495.3626510

In this paper we present a method for single-view illumination estimation of indoor scenes, using image-based lighting, that incorporates state-of-the-art outpainting methods. Recent advancements in illumination estimation have focused on improving the detail of the generated environment map so it can realistically light mirror reflective surfaces. These generated maps often include artefacts at the borders of the image where the panorama wraps around. In this work we make the key observation that inferring the panoramic HDR illumination of a scene from a limited field of view LDR input can be framed as an outpainting problem (whereby the original image must be expanded beyond its original borders). We incorporate two key techniques used in outpainting tasks: i) separating the generation into multiple networks (a diffuse lighting network and a high-frequency detail network) to reduce the amount to be learnt by a single network, ii) utilising an inside-out method of processing the input image to reduce the border artefacts. Further to incorporating these outpainting methods we also introduce circular padding before the network to help remove the border artefacts. Results show the proposed approach is able to relight diffuse, specular and mirror surfaces more accurately than existing methods in terms of the position of the light sources and pixelwise accuracy, whilst also reducing the artefacts produced at the borders of the panorama.

Marco Volino, Armin Mustafa, Jean-Yves Guillemaut, Adrian Hilton (2019)Light Field Compression using Eigen Textures, In: 2019 INTERNATIONAL CONFERENCE ON 3D VISION (3DV 2019)pp. 482-490 IEEE

DOI: 10.1109/3DV.2019.00060

Light fields are becoming an increasingly popular method of digital content production for visual effects and virtual/augmented reality as they capture a view dependent representation enabling photo realistic rendering over a range of viewpoints. Light field video is generally captured using arrays of cameras resulting in tens to hundreds of images of a scene at each time instance. An open problem is how to efficiently represent the data preserving the view-dependent detail of the surface in such a way that is compact to store and efficient to render. In this paper we show that constructing an Eigen texture basis representation from the light field using an approximate 3D surface reconstruction as a geometric proxy provides a compact representation that maintains view-dependent realism. We demonstrate that the proposed method is able to reduce storage requirements by > 95% while maintaining the visual quality of the captured data. An efficient view-dependent rendering technique is also proposed which is performed in eigen space allowing smooth continuous viewpoint interpolation through the light field.

Soon Yau Cheong, Armin Mustafa, Andrew Gilbert ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet

This paper introduces ViscoNet, a novel method that enhances text-to-image human generation models with visual prompting. Unlike existing methods that rely on lengthy text descriptions to control the image structure, ViscoNet allows users to specify the visual appearance of the target object with a reference image. ViscoNet disentangles the object's appearance from the image background and injects it into a pre-trained latent diffusion model (LDM) model via a ControlNet branch. This way, ViscoNet mitigates the style mode collapse problem and enables precise and flexible visual control. We demonstrate the effectiveness of ViscoNet on human image generation, where it can manipulate visual attributes and artistic styles with text and image prompts. We also show that ViscoNet can learn visual conditioning from small and specific object domains while preserving the generative power of the LDM backbone.

Armin Mustafa, Adrian Hilton (2019)Semantically Coherent 4D Scene Flow of Dynamic Scenes, In: International Journal of Computer Vision Springer Verlag

DOI: 10.1007/s11263-019-01241-w

Simultaneous semantically coherent object-based long-term 4D scene flow estimation, co-segmentation and reconstruction is proposed exploiting the coherence in semantic class labels both spatially, between views at a single time instant, and temporally, between widely spaced time instants of dynamic objects with similar shape and appearance. In this paper we propose a framework for spatially and temporally coherent semantic 4D scene flow of general dynamic scenes from multiple view videos captured with a network of static or moving cameras. Semantic coherence results in improved 4D scene flow estimation, segmentation and reconstruction for complex dynamic scenes. Semantic tracklets are introduced to robustly initialize the scene flow in the joint estimation and enforce temporal coherence in 4D flow, semantic labelling and reconstruction between widely spaced instances of dynamic objects. Tracklets of dynamic objects enable unsupervised learning of long-term flow, appearance and shape priors that are exploited in semantically coherent 4D scene flow estimation, co-segmentation and reconstruction. Comprehensive performance evaluation against state-of-the-art techniques on challenging indoor and outdoor sequences with hand-held moving cameras shows improved accuracy in 4D scene flow, segmentation, temporally coherent semantic labelling, and reconstruction of dynamic scenes.

Armin Mustafa, Adrian Hilton (2017)Semantically Coherent Co-segmentation and Reconstruction of Dynamic Scenes, In: CVPR 2017 Proceedingspp. 5583-5592 IEEE

DOI: 10.1109/CVPR.2017.592

In this paper we propose a framework for spatially and temporally coherent semantic co-segmentation and reconstruction of complex dynamic scenes from multiple static or moving cameras. Semantic co-segmentation exploits the coherence in semantic class labels both spatially, between views at a single time instant, and temporally, between widely spaced time instants of dynamic objects with similar shape and appearance. We demonstrate that semantic coherence results in improved segmentation and reconstruction for complex scenes. A joint formulation is proposed for semantically coherent object-based co-segmentation and reconstruction of scenes by enforcing consistent semantic labelling between views and over time. Semantic tracklets are introduced to enforce temporal coherence in semantic labelling and reconstruction between widely spaced instances of dynamic objects. Tracklets of dynamic objects enable unsupervised learning of appearance and shape priors that are exploited in joint segmentation and reconstruction. Evaluation on challenging indoor and outdoor sequences with hand-held moving cameras shows improved accuracy in segmentation, temporally coherent semantic labelling and 3D reconstruction of dynamic scenes.

Ben-Zion A. Rubshtein, Genady Ya Grabarnik, Mustafa A. Muratov, Yulia S. Pashkova, Armin Mustafa (2016)Foundations of Symmetric Spaces of Measurable Functions Lorentz, Marcinkiewicz and Orlicz Spaces Prefacepp. IX-X Springer Nature

Soon Yau Cheong, Andrew Gilbert, Armin Mustafa (2023)UPGPT: Universal Diffusion Model for Person Image Generation, Editing and Pose Transfer

Text-to-image models (T2I) such as StableDiffusion have been used to generate high quality images of people. However , due to the random nature of the generation process, the person has a different appearance e.g. pose, face, and clothing, despite using the same text prompt. The appearance inconsistency makes T2I unsuitable for pose transfer. We address this by proposing a multimodal diffusion model that accepts text, pose, and visual prompting. Our model is the first unified method to perform all person image tasks-generation, pose transfer, and mask-less edit. We also pioneer using small dimensional 3D body model parameters directly to demonstrate new capability-simultaneous pose and camera view interpolation while maintaining the per-son's appearance.

Asmar Nadeem, Adrian Hilton, Robert Dawes, Graham Thomas, Annin Mustafa (2023)SEM-POS: Grammatically and Semantically Correct Video Captioning, In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)pp. 2606-2616 IEEE

DOI: 10.1109/CVPRW59228.2023.00259

Generating grammatically and semantically correct captions in video captioning is a challenging task.. The captions generated from the existing methods are either word-by-word that do not align with grammatical structure or miss key information from the input videos. To address these issues, we introduce a novel global-local fusion network, with a Global-Local Fusion Block (GLFB) that encodes and fuses features from different parts of speech (POS) components with visual-spatial features. We use novel combinations of different POS components - 'determinant + subject', 'auxiliary verb', 'verb', and 'determinant + object' for supervision of the POS blocks - Det + Subject, Aux Verb, Verb, and Det + Object respectively. The novel global-local fusion network together with POS blocks helps align the visual features with language description to generate grammatically and semantically correct captions. Extensive qualitative and quantitative experiments on benchmark MSVD and MSRVTT datasets demonstrate that the proposed approach generates more grammatically and semantically correct captions compared to the existing methods, achieving the new state-of-the-art. Ablations on the POS blocks and the GLFB demonstrate the impact of the contributions on the proposed method.

A Mustafa, H Kim, J-Y Guillemaut, ADM Hilton (2016)Temporally coherent 4D reconstruction of complex dynamic scenes, In: CVPR 2016 Proceedings

This paper presents an approach for reconstruction of 4D temporally coherent models of complex dynamic scenes. No prior knowledge is required of scene structure or camera calibration allowing reconstruction from multiple moving cameras. Sparse-to-dense temporal correspondence is integrated with joint multi-view segmentation and reconstruction to obtain a complete 4D representation of static and dynamic objects. Temporal coherence is exploited to overcome visual ambiguities resulting in improved reconstruction of complex scenes. Robust joint segmentation and reconstruction of dynamic objects is achieved by introducing a geodesic star convexity constraint. Comparative evaluation is performed on a variety of unstructured indoor and outdoor dynamic scenes with hand-held cameras and multiple people. This demonstrates reconstruction of complete temporally coherent 4D scene models with improved nonrigid object segmentation and shape reconstruction.

Rafał Mantiuk, Christian Richardt, Marco Volino, Armin Mustafa (2021)Proceedings CVMP 2021: 18th ACM SIGGRAPH European Conference on Visual Media Production The Association for Computing Machinery

Quang-Hieu Pham, Pierre Sevestre, Ramanpreet Singh Pahwa, Huijing Zhan, Chun Ho Pang, Yuda Chen, Armin Mustafa, Vijay Chandrasekhar, Jie Lin (2020)A 3D Dataset: Towards Autonomous Driving in Challenging Environments, In: 2020 IEEE International Conference on Robotics and Automation (ICRA)pp. 2267-2273 IEEE

DOI: 10.1109/ICRA40945.2020.9197385

With the increasing global popularity of self-driving cars, there is an immediate need for challenging real-world datasets for benchmarking and training various computer vision tasks such as 3D object detection. Existing datasets either represent simple scenarios or provide only day-time data. In this paper, we introduce a new challenging A*3D dataset which consists of RGB images and LiDAR data with a significant diversity of scene, time, and weather. The dataset consists of high-density images (≈ 10 times more than the pioneering KITTI dataset), heavy occlusions, a large number of nighttime frames (≈ 3 times the nuScenes dataset), addressing the gaps in the existing datasets to push the boundaries of tasks in autonomous driving research to more challenging highly diverse environments. The dataset contains 39K frames, 7 classes, and 230K 3D object annotations. An extensive 3D object detection benchmark evaluation on the A*3D dataset for various attributes such as high density, day-time/night-time, gives interesting insights into the advantages and limitations of training and testing 3D object detection in real-world setting.

Armin Mustafa, Christopher Russell, Adrian Hilton (2019)U4D: Unsupervised 4D Dynamic Scene Understanding, In: Proceedings 2019 IEEE International Conference on Computer Vision IEEE

DOI: 10.1109/ICCV.2019.01052

We introduce the first approach to solve the challenging problem of unsupervised 4D visual scene understanding for complex dynamic scenes with multiple interacting people from multi-view video. Our approach simultaneously estimates a detailed model that includes a per-pixel semantically and temporally coherent reconstruction, together with instance-level segmentation exploiting photo-consistency, semantic and motion information. We further leverage recent advances in 3D pose estimation to constrain the joint semantic instance segmentation and 4D temporally coherent reconstruction. This enables per person semantic instance segmentation of multiple interacting people in complex dynamic scenes. Extensive evaluation of the joint visual scene understanding framework against state-of-the-art methods on challenging indoor and outdoor sequences demonstrates a significant (≈ 40%) improvement in semantic segmentation, reconstruction and scene flow accuracy.

Armin Mustafa, Hansung Kim, Adrian Hilton (2018)MSFD: Multi-scale segmentation based feature detection for wide-baseline scene reconstruction, In: IEEE Transactions on Image Processing28(3)pp. 1118-1132 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/TIP.2018.2872906

A common problem in wide-baseline matching is the sparse and non-uniform distribution of correspondences when using conventional detectors such as SIFT, SURF, FAST, A-KAZE and MSER. In this paper we introduce a novel segmentation based feature detector (SFD) that produces an increased number of accurate features for wide-baseline matching. A multi-scale SFD is proposed using bilateral image decomposition to produce a large number of scale-invariant features for wide-baseline reconstruction. All input images are over-segmented into regions using any existing segmentation technique like Watershed, Mean-shift, and SLIC. Feature points are then detected at the intersection of the boundaries of three or more regions. The detected feature points are local maxima of the image function. The key advantage of feature detection based on segmentation is that it does not require global threshold setting and can therefore detect features throughout the image. A comprehensive evaluation demonstrates that SFD gives an increased number of features which are accurately localised and matched between wide-baseline camera views; the number of features for a given matching error increases by a factor of 3-5 compared to SIFT; feature detection and matching performance is maintained with increasing baseline between views; multi-scale SFD improves matching performance at varying scales. Application of SFD to sparse multi-view wide-baseline reconstruction demonstrates a factor of ten increase in the number of reconstructed points with improved scene coverage compared to SIFT/MSER/A-KAZE. Evaluation against ground-truth shows that SFD produces an increased number of wide-baseline matches with reduced error.

Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton (2023)PAT: Position-Aware Transformer for Dense Multi-Label Action Detection, In: 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)pp. 2980-2989 IEEE

DOI: 10.1109/ICCVW60793.2023.00321

We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video by exploiting multi-scale temporal features. In existing methods, the self-attention mechanism in transformers loses the temporal positional information, which is essential for robust action detection. To address this issue, we (i) embed relative positional encoding in the self-attention mechanism and (ii) exploit multi-scale temporal relationships by designing a novel non-hierarchical network, in contrast to the recent transformer-based approaches that use a hierarchical structure. We argue that joining the self-attention mechanism with multiple sub-sampling processes in the hierarchical approaches results in increased loss of positional information. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets, and show that PAT improves the current state-of-the-art result by 1.1% and 0.6% mAP on the Charades and MultiTHUMOS datasets, respectively, thereby achieving the new state-of-the-art mAP at 26.5% and 44.6%, respectively. We also perform extensive ablation studies to examine the impact of the different components of our proposed network.

Akin Caliskan, Armin Mustafa, Adrian Hilton Temporal Consistency Loss for High Resolution Textured and Clothed 3DHuman Reconstruction from Monocular Video

DOI: 10.48550/arxiv.2104.09259

We present a novel method to learn temporally consistent 3D reconstruction of clothed people from a monocular video. Recent methods for 3D human reconstruction from monocular video using volumetric, implicit or parametric human shape models, produce per frame reconstructions giving temporally inconsistent output and limited performance when applied to video. In this paper, we introduce an approach to learn temporally consistent features for textured reconstruction of clothed 3D human sequences from monocular video by proposing two advances: a novel temporal consistency loss function; and hybrid representation learning for implicit 3D reconstruction from 2D images and coarse 3D geometry. The proposed advances improve the temporal consistency and accuracy of both the 3D reconstruction and texture prediction from a monocular video. Comprehensive comparative performance evaluation on images of people demonstrates that the proposed method significantly outperforms the state-of-the-art learning-based single image 3D human shape estimation approaches achieving significant improvement of reconstruction accuracy, completeness, quality and temporal consistency.

Nikolina Kubiak, Elliot Wortman, Armin Mustafa, Graeme Phillipson, Stephen Jolly, Simon Hadfield (2024)RenDetNet: Weakly-supervised Shadow Detection with Shadow Caster Verification

Existing shadow detection models struggle to differentiate dark image areas from shadows. In this paper, we tackle this issue by verifying that all detected shadows are real, i.e. they have paired shadow casters. We perform this step in a physically-accurate manner by dif-ferentiably re-rendering the scene and observing the changes stemming from carving out estimated shadow casters. Thanks to this approach, the RenDetNet proposed in this paper is the first learning-based shadow detection model whose supervisory signals can be computed in a self-supervised manner. The developed system compares favourably against recent models trained on our data. As part of this publication, we release our code on github.

Faegheh Sardari, Armin Mustafa, Philip J. B Jackson, Adrian Hilton PAT: Position-Aware Transformer for Dense Multi-Label Action Detection

DOI: 10.48550/arxiv.2308.05051

We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video by exploiting multi-scale temporal features. In existing methods, the self-attention mechanism in transformers loses the temporal positional information, which is essential for robust action detection. To address this issue, we (i) embed relative positional encoding in the self-attention mechanism and (ii) exploit multi-scale temporal relationships by designing a novel non hierarchical network, in contrast to the recent transformer-based approaches that use a hierarchical structure. We argue that joining the self-attention mechanism with multiple sub-sampling processes in the hierarchical approaches results in increased loss of positional information. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets, and show that PAT improves the current state-of-the-art result by 1.1% and 0.6% mAP on the Charades and MultiTHUMOS datasets, respectively, thereby achieving the new state-of-the-art mAP at 26.5% and 44.6%, respectively. We also perform extensive ablation studies to examine the impact of the different components of our proposed network.

A Mustafa, H Kim, H Imre, A Hilton (2015)Segmentation based features for wide-baseline multi-view reconstruction, In: International Conference on 3D Vision (3DV)pp. 282-290

DOI: 10.1109/3DV.2015.39

A common problem in wide-baseline stereo is the sparse and non-uniform distribution of correspondences when using conventional detectors such as SIFT, SURF, FAST and MSER. In this paper we introduce a novel segmentation based feature detector SFD that produces an increased number of ‘good’ features for accurate wide-baseline reconstruction. Each image is segmented into regions by over-segmentation and feature points are detected at the intersection of the boundaries for three or more regions. Segmentation-based feature detection locates features at local maxima giving a relatively large number of feature points which are consistently detected across wide-baseline views and accurately localised. A comprehensive comparative performance evaluation with previous feature detection approaches demonstrates that: SFD produces a large number of features with increased scene coverage; detected features are consistent across wide-baseline views for images of a variety of indoor and outdoor scenes; and the number of wide-baseline matches is increased by an order of magnitude compared to alternative detector-descriptor combinations. Sparse scene reconstruction from multiple wide-baseline stereo views using the SFD feature detector demonstrates at least a factor six increase in the number of reconstructed points with reduced error distribution compared to SIFT when evaluated against ground-truth and similar computational cost to SURF/FAST.

Stephanie Stoll, Armin Mustafa, Jean-Yves Guillemaut (2022)There and Back Again: 3D Sign Language Generation from Text Using Back-Translation, In: 2022 INTERNATIONAL CONFERENCE ON 3D VISION, 3DVpp. 187-196 IEEE

DOI: 10.1109/3DV57658.2022.00031

We introduce the first method to automatically generate 3D mesh sequences from text, inspired by the challenging problem of Sign Language Production (SLP). The approach only requires simple 2D annotations for training, which can be automatically extracted from video. Rather than incorporating high-definition or motion capture data, we propose back-translation as a powerful paradigm for supervision: By first addressing the arguably simpler problem of translating 2D pose sequences to text, we can leverage this to drive a transformer-based architecture to translate text to 2D poses. These are then used to drive a 3D mesh generator. Our mesh generator Pose2Mesh uses temporal information, to enforce temporal coherence and significantly reduce processing time. The approach is evaluated by generating 2D pose, and 3D mesh sequences in DGS (German Sign Language) from German language sentences. An extensive analysis of the approach and its sub-networks is conducted, reporting BLEU and ROUGE scores, as well as Mean 2D Joint Distance. Our proposed Text2Pose model outperforms the current state-of-the-art in SLP, and we establish the first benchmark for the complex task of text-to-3D-mesh-sequence generation with our Text2Mesh model.

Marco Volino, Armin Mustafa, Jean-Yves Guillemaut, Adrian Hilton (2020)Light Field Video for Immersive Content Production, In: Real VR – Immersive Digital Realitypp. 33-64 Springer International Publishing

DOI: 10.1007/978-3-030-41816-8_2

Light field video for content production is gaining both research and commercial interest as it has the potential to push the level of immersion for augmented and virtual reality to a close-to-reality experience. Light fields densely sample the viewing space of an object or scene using hundreds or even thousands of images with small displacements in between. However, a lack of standardised formats for compression, storage and transmission, along with the lack of tools to enable editing of light field data currently make it impractical for use in real-world content production. In this chapter we address two fundamental problems with light field data, namely representation and compression. Firstly we propose a method to obtain a 4D temporally coherent representation from the input light field video. This is an essential problem to solve that will enable efficient compression editing. Secondly, we present a method for compression of light field data based on the eigen texture method that provides a compact representation and enables efficient view-dependent rendering at interactive frame rates. These approaches achieve an order of magnitude compression and temporally consistent representation that are important steps towards practical toolsets for light field video content production.

Armin Mustafa, Marco Volino, Hansung Kim, Jean-Yves Guillemaut, Adrian Hilton (2020)Temporally coherent general dynamic scene reconstruction, In: International Journal of Computer Vision Springer

DOI: 10.1007/s11263-020-01367-2

Existing techniques for dynamic scene re- construction from multiple wide-baseline cameras pri- marily focus on reconstruction in controlled environ- ments, with fixed calibrated cameras and strong prior constraints. This paper introduces a general approach to obtain a 4D representation of complex dynamic scenes from multi-view wide-baseline static or moving cam- eras without prior knowledge of the scene structure, ap- pearance, or illumination. Contributions of the work are: An automatic method for initial coarse reconstruc- tion to initialize joint estimation; Sparse-to-dense tem- poral correspondence integrated with joint multi-view segmentation and reconstruction to introduce tempo- ral coherence; and a general robust approach for joint segmentation refinement and dense reconstruction of dynamic scenes by introducing shape constraint. Com- parison with state-of-the-art approaches on a variety of complex indoor and outdoor scenes, demonstrates im- proved accuracy in both multi-view segmentation and dense reconstruction. This paper demonstrates unsuper- vised reconstruction of complete temporally coherent 4D scene models with improved non-rigid object seg- mentation and shape reconstruction and its application to various applications such as free-view rendering and virtual reality.

Armin Mustafa, Hansung Kim, Adrian Hilton (2016)4D Match Trees for Non-rigid Surface Alignment, In: Computer Vision – ECCV 2016 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I9905(1)pp. 213-229

DOI: 10.1007/978-3-319-46448-0_13

This paper presents a method for dense 4D temporal alignment of partial reconstructions of non-rigid surfaces observed from single or multiple moving cameras of complex scenes. 4D Match Trees are introduced for robust global alignment of non-rigid shape based on the similarity between images across sequences and views. Wide-timeframe sparse correspondence between arbitrary pairs of images is established using a segmentation-based feature detector (SFD) which is demonstrated to give improved matching of non-rigid shape. Sparse SFD correspondence allows the similarity between any pair of image frames to be estimated for moving cameras and multiple views. This enables the 4D Match Tree to be constructed which minimises the observed change in non-rigid shape for global alignment across all images. Dense 4D temporal correspondence across all frames is then estimated by traversing the 4D Match tree using optical flow initialised from the sparse feature matches. The approach is evaluated on single and multiple view images sequences for alignment of partial surface reconstructions of dynamic objects in complex indoor and outdoor scenes to obtain a temporally consistent 4D representation. Comparison to previous 2D and 3D scene flow demonstrates that 4D Match Trees achieve reduced errors due to drift and improved robustness to large non-rigid deformations.

Asmar Nadeem, Adrian Hilton, Robert Dawes, Graham Thomas, Armin Mustafa CAD -- Contextual Multi-modal Alignment for Dynamic AVQA

DOI: 10.48550/arxiv.2310.16754

In the context of Audio Visual Question Answering (AVQA) tasks, the audio visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing AVQA methods suffer from two major shortcomings; the audio-visual (AV) information passing through the network isn't aligned on Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic information is often not balanced within a context; this results in poor performance. In this paper, we propose a novel end-to-end Contextual Multi-modal Alignment (CAD) network that addresses the challenges in AVQA methods by i) introducing a parameter-free stochastic Contextual block that ensures robust audio and visual alignment on the Spatial level; ii) proposing a pre-training technique for dynamic audio and visual alignment on Temporal level in a self-supervised setting, and iii) introducing a cross-attention mechanism to balance audio and visual information on Semantic level. The proposed novel CAD network improves the overall performance over the state-of-the-art methods on average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our proposed contributions to AVQA can be added to the existing methods to improve their performance without additional complexity requirements.

Mertalp Ocal, Armin Mustafa RealMonoDepth: Self-Supervised Monocular Depth Estimation for General Scenes

DOI: 10.48550/arxiv.2004.06267

We present a generalised self-supervised learning approach for monocular estimation of the real depth across scenes with diverse depth ranges from 1--100s of meters. Existing supervised methods for monocular depth estimation require accurate depth measurements for training. This limitation has led to the introduction of self-supervised methods that are trained on stereo image pairs with a fixed camera baseline to estimate disparity which is transformed to depth given known calibration. Self-supervised approaches have demonstrated impressive results but do not generalise to scenes with different depth ranges or camera baselines. In this paper, we introduce RealMonoDepth a self-supervised monocular depth estimation approach which learns to estimate the real scene depth for a diverse range of indoor and outdoor scenes. A novel loss function with respect to the true scene depth based on relative depth scaling and warping is proposed. This allows self-supervised training of a single network with multiple data sets for scenes with diverse depth ranges from both stereo pair and in the wild moving camera data sets. A comprehensive performance evaluation across five benchmark data sets demonstrates that RealMonoDepth provides a single trained network which generalises depth estimation across indoor and outdoor scenes, consistently outperforming previous self-supervised approaches.

Armin Mustafa, Marco Volino, Jean-Yves Guillemaut, Adrian Hilton (2018)4D Temporally Coherent Light-field Video, In: 3DV 2017 Proceedings IEEE

DOI: 10.1109/3DV.2017.00014

Light-field video has recently been used in virtual and augmented reality applications to increase realism and immersion. However, existing light-field methods are generally limited to static scenes due to the requirement to acquire a dense scene representation. The large amount of data and the absence of methods to infer temporal coherence pose major challenges in storage, compression and editing compared to conventional video. In this paper, we propose the first method to extract a spatio-temporally coherent light-field video representation. A novel method to obtain Epipolar Plane Images (EPIs) from a spare lightfield camera array is proposed. EPIs are used to constrain scene flow estimation to obtain 4D temporally coherent representations of dynamic light-fields. Temporal coherence is achieved on a variety of light-field datasets. Evaluation of the proposed light-field scene flow against existing multiview dense correspondence approaches demonstrates a significant improvement in accuracy of temporal coherence.

Asmar Nadeem, Adrian Hilton, Robert Dawes, Graham Thomas, Armin Mustafa SEM-POS: Grammatically and Semantically Correct Video Captioning

DOI: 10.48550/arxiv.2303.14829

Generating grammatically and semantically correct captions in video captioning is a challenging task. The captions generated from the existing methods are either word-by-word that do not align with grammatical structure or miss key information from the input videos. To address these issues, we introduce a novel global-local fusion network, with a Global-Local Fusion Block (GLFB) that encodes and fuses features from different parts of speech (POS) components with visual-spatial features. We use novel combinations of different POS components - 'determinant + subject', 'auxiliary verb', 'verb', and 'determinant + object' for supervision of the POS blocks - Det + Subject, Aux Verb, Verb, and Det + Object respectively. The novel global-local fusion network together with POS blocks helps align the visual features with language description to generate grammatically and semantically correct captions. Extensive qualitative and quantitative experiments on benchmark MSVD and MSRVTT datasets demonstrate that the proposed approach generates more grammatically and semantically correct captions compared to the existing methods, achieving the new state-of-the-art. Ablations on the POS blocks and the GLFB demonstrate the impact of the contributions on the proposed method.

Armin Mustafa, Akin Caliskan, Lourdes Agapito, Adrian Hilton (2021)Multi-person Implicit Reconstruction from a Single Image, In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)pp. 14469-14478 IEEE

DOI: 10.1109/CVPR46437.2021.01424

We present a new end-to-end learning framework to obtain detailed and spatially coherent reconstructions of multiple people from a single image. Existing multi-person methods suffer from two main drawbacks: they are often model-based and therefore cannot capture accurate 3D models of people with loose clothing and hair; or they require manual intervention to resolve occlusions or interactions. Our method addresses both limitations by introducing the first end-to-end learning approach to perform model-free implicit reconstruction for realistic 3D capture of multiple clothed people in arbitrary poses (with occlusions) from a single image. Our network simultaneously estimates the 3D geometry of each person and their 6DOF spatial locations, to obtain a coherent multi-human reconstruction. In addition, we introduce a new synthetic dataset that depicts images with a varying number of inter-occluded humans and a variety of clothing and hair styles. We demonstrate robust, high-resolution reconstructions on images of multiple humans with complex occlusions, loose clothing and a large variety of poses and scenes. Our quantitative evaluation on both synthetic and real world datasets demonstrates state-of-the-art performance with significant improvements in the accuracy and completeness of the reconstructions over competing approaches.

Armin Mustafa, Hansung Kim, Jean-Yves Guillemaut, Adrian Hilton (2015)General Dynamic Scene Reconstruction from Multiple View Video, In: 2015 IEEE International Conference on Computer Vision (ICCV)pp. 900-908 IEEE

This paper introduces a general approach to dynamic scene reconstruction from multiple moving cameras without prior knowledge or limiting constraints on the scene structure, appearance, or illumination. Existing techniques for dynamic scene reconstruction from multiple wide-baseline camera views primarily focus on accurate reconstruction in controlled environments, where the cameras are fixed and calibrated and background is known. These approaches are not robust for general dynamic scenes captured with sparse moving cameras. Previous approaches for outdoor dynamic scene reconstruction assume prior knowledge of the static background appearance and structure. The primary contributions of this paper are twofold: an automatic method for initial coarse dynamic scene segmentation and reconstruction without prior knowledge of background appearance or structure; and a general robust approach for joint segmentation refinement and dense reconstruction of dynamic scenes from multiple wide-baseline static or moving cameras. Evaluation is performed on a variety of indoor and outdoor scenes with cluttered backgrounds and multiple dynamic non-rigid objects such as people. Comparison with state-of-the-art approaches demonstrates improved accuracy in both multiple view segmentation and dense reconstruction. The proposed approach also eliminates the requirement for prior knowledge of scene structure and appearance.

Akin Caliskan, Armin Mustafa, Adrian Hilton (2021)Temporal Consistency Loss for High Resolution Textured and Clothed 3D Human Reconstruction from Monocular Video, In: 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021pp. 1780-1790 IEEE

DOI: 10.1109/CVPRW53098.2021.00197

We present a novel method to learn temporally consistent 3D reconstruction of clothed people from a monocular video. Recent methods for 3D human reconstruction from monocular video using volumetric, implicit or parametric human shape models, produce per frame reconstructions giving temporally inconsistent output and limited performance when applied to video. In this paper we introduce an approach to learn temporally consistent features for textured reconstruction of clothed 3D human sequences from monocular video by proposing two advances: a novel temporal consistency loss function; and hybrid representation learning for implicit 3D reconstruction from 2D images and coarse 3D geometry. The proposed advances improve the temporal consistency and accuracy of both the 3D reconstruction and texture prediction from a monocular video. Comprehensive comparative performance evaluation on images of people demonstrates that the proposed method significantly outperforms the state-of-the-art learning-based single image 3D human shape estimation approaches achieving significant improvement of reconstruction accuracy, completeness, quality and temporal consistency.

Akin Caliskan, Armin Mustafa, Evren Imre, Adrian Hilton (2021)Multi-view Consistency Loss for Improved Single-Image 3D Reconstruction of Clothed People, In: Computer Vision – ACCV 2020pp. 71-88 Springer International Publishing

DOI: 10.1007/978-3-030-69525-5_5

We present a novel method to improve the accuracy of the 3D reconstruction of clothed human shape from a single image. Recent work has introduced volumetric, implicit and model-based shape learning frameworks for reconstruction of objects and people from one or more images. However, the accuracy and completeness for reconstruction of clothed people is limited due to the large variation in shape resulting from clothing, hair, body size, pose and camera viewpoint. This paper introduces two advances to overcome this limitation: firstly a new synthetic dataset of realistic clothed people, 3DVH; and secondly, a novel multiple-view loss function for training of monocular volumetric shape estimation, which is demonstrated to significantly improve generalisation and reconstruction accuracy. The 3DVH dataset of realistic clothed 3D human models rendered with diverse natural backgrounds is demonstrated to allows transfer to reconstruction from real images of people. Comprehensive comparative performance evaluation on both synthetic and real images of people demonstrates that the proposed method significantly outperforms the previous state-of-the-art learning-based single image 3D human shape estimation approaches achieving significant improvement of reconstruction accuracy, completeness, and quality. An ablation study shows that this is due to both the proposed multiple-view training and the new 3DVH dataset. The code and the dataset can be found at the project website: https://akincaliskan3d.github.io/MV3DH/.

Armin Mustafa, Adrian Hilton (2019)Understanding real-world scenes for human-like machine perception, In: Proceedings of the Machine Intelligence 21 (MI21-HLC) workshop Imperial College Press

The rise of autonomous machines in our day-to-day lives has led to an increasing demand for machine perception of real-world to be more robust, accurate and human-like. The research in visual scene un- derstanding over the past two decades has focused on machine perception in controlled environments such as indoor, static and rigid objects. There is a gap in literature for machine perception in general complex scenes (outdoor with multiple interacting people). The proposed research ad- dresses the limitations of existing methods by proposing an unsupervised framework to simultaneously model, semantically segment and estimate motion for general dynamic scenes captured from multiple view videos with a network of static or moving cameras. In this talk I will explain the proposed joint framework to understand general dynamic scenes for ma- chine perception; give a comprehensive performance evaluation against state-of-the-art techniques on challenging indoor and outdoor sequences; and demonstrate applications such as virtual, augmented, mixed reality (VR/AR/MR) and broadcast production (Free-view point video - FVV).

ARMIN MUSTAFA, Chris Russell, Adrian Hilton (2022)4D Temporally Coherent Multi-Person Semantic Reconstruction and Segmentation, In: International journal of computer vision Springer

DOI: 10.1007/s11263-022-01599-4

We introduce the first approach to solve the challenging problem of automatic 4D visual scene understanding for complex dynamic scenes with multiple interacting people from multi-view video. Our approach simultaneously estimates a detailed model that includes a per-pixel semantically and temporally coherent reconstruction, together with instance-level segmentation exploiting photo-consistency, semantic and motion information. We further leverage recent advances in 3D pose estimation to constrain the joint semantic instance segmentation and 4D temporally coherent reconstruction. This enables per person semantic instance segmentation of multiple interacting people in complex dynamic scenes. Extensive evaluation of the joint visual scene understanding framework against state-of-the-art methods on challenging indoor and outdoor sequences demonstrates a significant (≈40%) improvement in semantic segmentation, reconstruction and scene flow accuracy. In addition to the evaluation on several indoor and outdoor scenes, the proposed joint 4D scene understanding framework is applied to challenging outdoor sports scenes in the wild captured with manually operated wide-baseline broadcast cameras.