Foundation models for multi-modal video understanding

This research will develop AI models to understand the interaction of people with the world around them from video with a focus on applications in retail in collaboration with a major UK retailer.

Start date

1 October 2025

Duration

3.5 years

Application deadline

Funding source

EPSRC DLA + industry sponsorship (Tesco)

Funding information

EPSRC stipend rate, £20,780.

About

Humans understand complex actions and interactions in day-to-day environments. Advancing scene understanding to interpret multiple human actions and interactions allows machines to improve performance and make accurate decisions for impact in broadcast production (personalised media production, analysis of sports person action performance), and autonomous driving (assist drivers to make safer informed decisions).

Existing methods in scene understanding either work for a single object action or simple scenes. 

Multiple people's actions/interactions detection in a video is an open problem including the challenging of dealing with complex and diverse actions, environments, varying camera viewpoints and motion, occlusions, background motion, and low-quality videos. Different human actions and interactions may require different levels of granularity and abstraction, depending on the context and the objective of the analysis.

Research will be applied for interaction in retail in collaboration with a major UK supermarket chain. In supermarkets, there are many actions taking place such as breaking boxes to restock shelves, grabbing a product from a shelf, placing a product to a basket, etc. These actions can be also broken down into finer sub-actions. For instance, picking a product from a shelf can be considered as a simple action, but it can also be decomposed into reaching, grasping, and lifting. The aim of this project is to propose foundation models for multi-modal action detection in large retail superstores.

Research will start with a simple action detection like picking a product from a shelf and then expand to more complex actions and of various nature. The actions will be detected from this approach can be used on different problems and setups, such as stores, depots, and warehouses, therefore it could have a much broader scope. Potential applications beyond the scope of the PhD include healthcare and wellbeing at home. 

Eligibility criteria

Open to any UK or international candidates. Up to 30% of our UKRI funded studentships can be awarded to candidates paying international rate fees.

This is an exciting opportunity for to a highly motivated individual to contribute to research at the forefront of AI and computer vision leading to deployment in real-world applications. Applicants should have demonstrated excellent achievement in a numerate discipline at undergraduate and/or masters level, with a strong interest and practical experience in AI and machine learning. Industry experience of AI and computer vision software development is also an advantage.

 

How to apply

Applications should be submitted via the Vision, Speech and Signal Processing PhD programme page. In place of a research proposal you should upload a document stating the title of the project that you wish to apply for and the name of the relevant supervisor.

Studentship FAQs

Read our studentship FAQs to find out more about applying and funding.

Application deadline

Contact details

Armin Mustafa
39 BA 00
Telephone: +44 (0)1483 684262
E-mail: armin.mustafa@surrey.ac.uk
studentship-cta-strip

Studentships at Surrey

We have a wide range of studentship opportunities available.