2pm - 3pm

Monday 16 December 2024

SceneTrilogy: On Scene Sketches and its Relationship with Text and Photo

PhD Viva Open Presentation - Pinaki Chowdhury

Hybrid event - All Welcome!

Free

21BA02 - Arthur C Clarke building
University of Surrey
Guildford
Surrey
GU2 7XH

SceneTrilogy: On Scene Sketches and its Relationship with Text and Photo

ABSTRACT:

Sketches have been used from prehistoric times as a means for humans to express and record ideas. The level of expressiveness they carry remains unparalleled, even in the face of language -- recall that moment that you instinctively resort to pen and paper (or Zoom Whiteboard) to sketch out an idea? However, research on sketches for visual understanding has largely focused on object-level sketches. In contrast, scene sketches not only exhibit abstraction on individual objects but also on global scene configurations. As research on object-level sketches matures, a promising shift is emerging towards scene-level tasks such as scene recognition, scene captioning, scene synthesis, and scene retrieval.

Despite this shift, challenges like the lack of datasets and fear-to-sketch (i.e., “I can't sketch”) limits scene sketch research. This thesis addresses these challenges through four contributions with the theme of SceneTrilogy -- understand freehand sketches and flexibly combine it with text and photos.

In the first chapter, we show that scene sketches are inherently “partial”: (i) they may omit certain objects from the corresponding photo due to subjective interpretation, or (ii) they contain significant empty (white) regions due to object-level abstraction. We solve this by advocating a cross-modal set-based approach using optimal transport and a intra-modal weighted adjacency matrices, to yield robust performance to partial scene sketches.

Second, we introduce FS-COCO, the first freehand scene sketch dataset consisting of 10,000 vector sketches by 100 non-expert individuals. Each sketch is accompanied by a text description from the same participant, along with corresponding photos from the MS-COCO dataset.

Third, we cultivate the expressiveness of sketches for the fundamental task of object detection. Our framework enables instance-aware detection, such as detecting a specific “zebra” within a herd, and part-aware detection, focusing on the desired part of an object, like the “head” of a “zebra”.

Finally, we complete the SceneTrilogy by integrating sketch, photo, and text representations into a flexible three-way embedding. This embedding supports “optionality” across two dimensions: (i) across modalities -- allowing any combination of modalities to serve as a query for downstream tasks, and (ii) across tasks -- enabling the embedding to be used for both discriminative tasks (e.g., retrieval) and generative tasks (e.g., captioning).