Monocular self-supervised 4D semantic reconstruction of general dynamic scenes

At the Centre for Vision, Speech and Signal Processing (CVSSP), we’re developing exciting and ground-breaking technologies. These include facial recognition for security and medical imaging, and 3D spatial audio and 3D reconstruction from video for visual-effects production in films, games and virtual reality.

Start date

1 October 2025

Duration

3 years

Application deadline

14 March 2025

Funding information

Awards cover UK tuition fees and provide a stipend at the UKRI rate for a period of 3 years. The stipend (Tax-free maintenance payment based on the UKRI minimum rate) is £20,780 from 1 October 2025.

As a doctoral student, you may be able to access added funding to cover the cost of training and development.

Supervised by

Dr Armin Mustafa

Associate Professor in Computer Vision and AI

Dr Marco Volino

Lecturer in Computer Vision and Graphics

Prof Adrian Hilton

Director Surrey Institute for People-Centred AI | Director of Centre for Vision, Speech and Signal Processing

About

Monocular 4D reconstruction of social scenes is an open challenging problem because the dynamic elements in social scenes change their shape, location, lighting, and backgrounds with time, making it extremely difficult to track 3D points of each person/object in time.

3D models will be estimated and will be extended to learn per-pixel motion in dynamic scenes, using pre-annotated sparse temporal 2D/3D labels for 4D reconstruction from monocular video. This captures both spatial (3D) and temporal (4D) evolution for a deeper understanding of dynamic interactions. Current per-pixel motion estimation methods often suffer from unreliable correlations and accumulated inaccuracies. Diffusion models can enhance the correlation reliability and resilience to noise due to its denoising intrinsic and are useful to model long-range dependencies for spatial-temporal 4D semantic reconstruction of videos with multiple interacting people. A novel uncertainty-aware diffusion probabilistic model will learn temporally consistent features for 4D temporal correspondence. A novel temporal consistency loss and hybrid representation will allow the processing of multiple video frames to create temporally coherent 4D reconstructions. 4D temporal reasoning will be integrated in the model using graphs to model relationships between people and objects to capture both short-term and long-term dynamics, for temporally coherent reconstruction.

Dr. Armin Mustafa (Lead Supervisor), Dr. Marco Volino (Second Supervisor) Prof. Adrian Hilton (Third Supervisor) and Expert Mentor at CoSTAR partner institution (TBC).

This PhD is aligned with the CoSTAR research themes: AI Futures / Createch Futures.

Find out more information on CoSTAR and the PhD opportunities.

Eligibility criteria

Open to candidates who pay UK/home rate fees. See UKCISA for further information.

You will need to meet the minimum entry requirements for our Vision, Speech and Signal Processing PhD programme.

How to apply

Applying Stage 1: Please contact the Lead Supervisor Dr Armin Mustafa of the PhD opportunity via email (armin.mustafa@surrey.ac.uk) and work with the Lead Supervisor and supervisory team to send your CV and 500-words. Expression of Interest by Friday March 14th, 2025.