Dr Özge Mercanoğlu Sincan

Research Fellow in Computer Vision and Deep Learning

PhD

o.mercanoglusincan@surrey.ac.uk

Academic and research departments

Centre for Vision, Speech and Signal Processing (CVSSP).

About

Biography

Özge Mercanoğlu Sincan is a Research Fellow at the Centre for Vision, Speech and Signal Processing (CVSSP) at the University of Surrey. She is currently working with Prof. Richard Bowden on sign language recognition and translation. She received her Ph.D. degree in computer engineering from Ankara University, Turkey, in 2021 under the supervision of Assoc. Prof. Hacer Yalim Keles. Her main research interests include computer vision and deep learning.

My qualifications

2012

BSc degree in computer engineering

Ankara University

2015

MSc degree in computer engineering

Ankara University

2021

PhD degree in computer engineering

Ankara University

Previous roles

2013 - 2021

Research Assistant

Ankara University, Turkey

2012 - 2013

Software Developer

NKR Software, Turkey

Publications

Ozge Mercanoglu Sincan, Richard Bowden (2025)Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation, In: IVA Adjunct '25: Adjunct Proceedings of the 25th ACM International Conference on Intelligent Virtual Agents7 Association for Computing Machinery (ACM)

DOI: 10.1145/3742886.3756703

Sign Language Translation (SLT) aims to convert sign language videos into spoken or written text. While early systems relied on gloss annotations as an intermediate supervision, such annotations are costly to obtain and often fail to capture the full complexity of continuous signing. In this work, we propose a two-phase, dual visual encoder framework for gloss-free SLT, leveraging contrastive visual–language pretraining. During pretraining, our approach employs two complementary visual backbones whose outputs are jointly aligned with each other and with sentence-level text embed-dings via a contrastive objective. During the downstream SLT task, we fuse the visual features and input them into an encoder–decoder model. On the Phoenix-2014T benchmark, our dual encoder architecture consistently outperforms its single-stream variants and achieves the highest BLEU-4 score among existing gloss-free SLT approaches.

Ozge Mercanoglu Sincan, Richard Bowden (2025)Spotter+GPT: Turning Sign Spottings into Sentences with LLMs, In: IVA Adjunct '25: Adjunct Proceedings of the 25th ACM International Conference on Intelligent Virtual AgentsIn Press(In Press) Association for Computing Machinery (ACM)

DOI: 10.1145/3742886.3756708

Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos. In this paper, we introduce a lightweight, modular SLT framework, Spotter+GPT, that leverages the power of Large Language Models (LLMs) and avoids heavy end-to-end training. Spotter+GPT breaks down the SLT task into two distinct stages. First, a sign spotter identifies individual signs within the input video. The spotted signs are then passed to an LLM, which transforms them into meaningful spoken language sentences. Spotter+GPT eliminates the requirement for SLT-specific training. This significantly reduces computational costs and time requirements. The source code and pretrained weights of the Spotter are available online.

Low Jian He, Ozge Mercanoglu Sincan, Richard Bowden (2025)Sign Spotting Disambiguation using Large Language Models, In: IVA 2025 - Adjunct Proceedings of the 25th ACM International Conference on Intelligent Virtual Agents16pp. 1-9 Association for Computing Machinery (ACM)

DOI: 10.1145/3742886.3756720

Sign spotting, the task of identifying and localizing individual signs within continuous sign language video, plays a pivotal role in scaling dataset annotations and addressing the severe data scarcity issue in sign language translation. While automatic sign spotting holds great promise for enabling frame-level supervision at scale, it grapples with challenges such as vocabulary inflexibility and ambiguity inherent in continuous sign streams. Hence, we introduce a novel, training-free framework that integrates Large Language Models (LLMs) to significantly enhance sign spotting quality. Our approach extracts global spatio-temporal and hand shape features, which are then matched against a large-scale sign dictionary using dynamic time warping and cosine similarity. This dictionary-based matching inherently offers superior vocabulary flexibility without requiring model retraining. To mitigate noise and ambiguity from the matching process, an LLM performs context-aware gloss disambiguation via beam search, notably without fine-tuning. Extensive experiments on both synthetic and real-world sign language datasets demonstrate our method’s superior accuracy and sentence fluency compared to traditional approaches, highlighting the potential of LLMs in advancing sign spotting.

Ozge Mercanoglu Sincan, Jian He Low, Sobhan Asasi, Richard Bowden (2025)Gloss-Free Sign Language Translation: An Unbiased Evaluation of Progress in the Field, In: Computer Vision and Image Understanding261104498 Elsevier

DOI: 10.1016/j.cviu.2025.104498

Sign Language Translation (SLT) aims to automatically convert visual sign language videos into spoken language text and vice versa. While recent years have seen rapid progress, the true sources of performance improvements often remain unclear. Do reported performance gains come from methodological novelty, or from the choice of a different backbone, training optimizations, hyperparameter tuning, or even differences in the calculation of evaluation metrics? This paper presents a comprehensive study of recent gloss-free SLT models by re-implementing key contributions in a unified codebase. We ensure fair comparison by standardizing preprocessing, video encoders, and training setups across all methods. Our analysis shows that many of the performance gains reported in the literature often diminish when models are evaluated under consistent conditions, suggesting that implementation details and evaluation setups play a significant role in determining results. We make the codebase publicly available here to support transparency and reproducibility in SLT research.

Jian He Low, Harry Thomas Walsh, Ozge Mercanoglu Sincan, Richard Bowden (2025)Hands-On: Segmenting Individual Signs from Continuous Sequences, In: 2025 IEEE 19th International Conference on Automatic Face and Gesture Recognition (FG) IEEE

DOI: 10.1109/FG61629.2025.11099255

This work tackles the challenge of continuous signlanguage segmentation, a key task with huge implications forsign language translation and data annotation. We proposea transformer-based architecture that models the temporaldynamics of signing and frames segmentation as a sequencelabeling problem using the Begin-In-Out (BIO) tagging scheme.Our method leverages the HaMeR hand features, and iscomplemented with 3D Angles. Extensive experiments show thatour model achieves state-of-the-art results on the DGS Corpus,while our features surpass prior benchmarks on BSLCorpus.

Ozge Mercanoglu Sincan, Necati Cihan Camgöz, Richard Bowden (2024)Is context all you need? Scaling Neural Sign Language Translation to Large Domains of Discourse, In: 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2023)pp. 1947-1957 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICCVW60793.2023.00210

Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos, both of which have different grammar and word/gloss order. From a Neural Machine Translation (NMT) perspective, the straightforward way of training translation models is to use sign language phrase-spoken language sentence pairs. However, human interpreters heavily rely on the context to understand the conveyed information , especially for sign language interpretation, where the vocabulary size may be significantly smaller than their spoken language equivalent. Taking direct inspiration from how humans translate, we propose a novel multi-modal transformer architecture that tackles the translation task in a context-aware manner, as a human would. We use the context from previous sequences and confident predictions to disambiguate weaker visual cues. To achieve this we use complementary transformer encoders, namely: (1) A Video Encoder, that captures the low-level video features at the frame-level, (2) A Spotting Encoder, that models the recognized sign glosses in the video, and (3) A Context Encoder, which captures the context of the preceding sign sequences. We combine the information coming from these encoders in a final transformer decoder to generate spoken language translations. We evaluate our approach on the recently published large-scale BOBSL dataset, which contains ∼1.2M sequences , and on the SRF dataset, which was part of the WMT-SLT 2022 challenge. We report significant improvements on state-of-the-art translation performance using contextual information, nearly doubling the reported BLEU-4 scores of baseline approaches.

Sobhan Asasi, Mohamed Ilyes Lakhal, Ozge Mercanoglu Sincan, Richard Bowden (2025)Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation, In: Proceedings of The 36th British Machine Vision Conference British Machine Vision Association

Sign Language Translation (SLT) is a challenging task that requires bridging the modality gap between visual and linguistic information while capturing subtle variations in hand shapes and movements. To address these challenges, we introduce BeyondGloss, a novel gloss-free SLT framework that leverages the spatio-temporal reasoning capabilities of Video Large Language Models (VideoLLMs). Since existing VideoLLMs struggle to model long videos in detail, we propose a novel approach to generate finegrained, temporally-aware textual descriptions of hand motion. A contrastive alignment module aligns these descriptions with video features during pre-training, encouraging the model to focus on hand-centric temporal dynamics and distinguish signs more effectively. To further enrich hand-specific representations, we distill fine-grained features from Hand Mesh Recovery (HaMeR). Additionally, we apply a contrastive loss between sign video representations and target language embeddings to reduce the modality gap in pre-training. BeyondGloss achieves state-of-the-art performance on the Phoenix14T and CSL-Daily benchmarks, demonstrating the effectiveness of the proposed framework. Our code is available at https://github.com/elsobhano/BeyondGloss.

Anton Pelykh, Ozge Mercanoglu Sincan, Richard Bowden (2024)Giving a Hand to Diffusion Models: A Two-Stage Approach to Improving Conditional Human Image Generation, In: 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)pp. 1-10 IEEE

DOI: 10.1109/FG59268.2024.10582008

Recent years have seen significant progress in human image generation, particularly with the advancements in diffusion models. However, existing diffusion methods encounter challenges when producing consistent hand anatomy and the generated images often lack precise control over the hand pose. To address this limitation, we introduce a novel approach to pose-conditioned human image generation, dividing the process into two stages: hand generation and subsequent body outpainting around the hands. We propose training the hand generator in a multi-task setting to produce both hand images and their corresponding segmentation masks, and employ the trained model in the first stage of generation. An adapted ControlNet model is then used in the second stage to outpaint the body around the generated hands, producing the final result. A novel blending technique is introduced to preserve the hand details during the second stage that combines the results of both stages in a coherent way. This involves sequential expansion of the outpainted region while fusing the latent representations, to ensure a seamless and cohesive synthesis of the final image. Experimental evaluations demonstrate the superiority of our proposed method over state-of-the-art techniques, in both pose accuracy and image quality, as validated on the HaGRID dataset. Our approach not only enhances the quality of the generated hands but also offers improved control over hand pose, advancing the capabilities of pose-conditioned human image generation. The source code is available here. 1 1 https://github.com/apelykh/hand-to-diffusion

Alexandre Symeonidis-Herzig, Ozge Mercanoglu Sincan, Richard Bowden (2025)VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis, In: 2025 IEEE/CVF International Conference on Computer Vision (ICCV 2025) Institute of Electrical and Electronics Engineers (IEEE)

Realistic, high-fidelity 3D facial animations are crucial for expressive avatar systems in human-computer interaction and accessibility. Although prior methods show promising quality, their reliance on the mesh domain limits their ability to fully leverage the rapid visual innovations seen in 2D computer vision and graphics. We propose VisualSpeaker, a novel method that bridges this gap using photorealistic differentiable rendering, supervised by visual speech recognition, for improved 3D facial animation. Our contribution is a perceptual lip-reading loss, derived by passing photorealistic 3D Gaussian Splatting avatar renders through a pre-trained Visual Automatic Speech Recognition model during training. Evaluation on the MEAD dataset demonstrates that VisualSpeaker improves both the standard Lip Vertex Error metric by 56.1% and the perceptual quality of the generated animations, while retaining the controllability of mesh-driven animation. This perceptual focus naturally supports accurate mouthings, essential cues that disambiguate similar manual signs in sign language avatars.

Jian He Low, Ozge Mercanoglu Sincan, Richard Bowden (2025)SAGE: Segment-Aware Gloss-Free Encoding for Token-Efficient Sign Language Translation, In: 2025 IEEE/CVF International Conference on Computer Vision (ICCV 2025) Institute of Electrical and Electronics Engineers (IEEE)

Gloss-free Sign Language Translation (SLT) has advanced rapidly, achieving strong performances without relying on gloss annotations. However, these gains have often come with increased model complexity and high computational demands, raising concerns about scalability, especially as large-scale sign language datasets become more common. We propose a segment-aware visual tokenization framework that leverages sign segmentation to convert continuous video into discrete, sign-informed visual tokens. This reduces input sequence length by up to 50% compared to prior methods, resulting in up to 2.67× lower memory usage and better scalability on larger datasets. To bridge the visual and linguistic modalities, we introduce a token-to-token contrastive alignment objective, along with a dual-level supervision that aligns both language embeddings and intermediate hidden states. This improves fine-grained cross-modal alignment without relying on gloss-level supervision. Our approach notably exceeds the performance of state-of-the-art methods on the PHOENIX14T benchmark, while significantly reducing sequence length. Further experiments also demonstrate our improved performance over prior work under comparable sequence-lengths, validating the potential of our tokenization and alignment strategies.

Harry Thomas Walsh, Edward Fish, Ozge Mercanoglu Sincan, Mohamed Ilyes Lakhal, Richard Bowden (2025)SLRTP2025 Sign Language Production Challenge: Methodology, Results, and Future Work, In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 (CVPR 2025) Institute of Electrical and Electronics Engineers (IEEE)

Sign Language Production (SLP) is the task of generating sign language video from spoken language inputs. The field has seen a range of innovations over the last few years, with the introduction of deep learning-based approaches providing significant improvements in the realism and naturalness of generated outputs. However, the lack of standardized evaluation metrics for SLP approaches hampers meaningful comparisons across different systems. To address this, we introduce the first Sign Language Production Challenge, held as part of the third SLRTP Workshop at CVPR 2025. The competition's aims are to evaluate architectures that translate from spoken language sentences to a sequence of skeleton poses, known as Text-to-Pose (T2P) translation , over a range of metrics. For our evaluation data, we use the RWTH-PHOENIX-Weather-2014T dataset, a Ger-man Sign Language-Deutsche Gebärdensprache (DGS) weather broadcast dataset. In addition, we curate a custom hidden test set from a similar domain of discourse. This paper presents the challenge design and the winning methodologies. The challenge attracted 33 participants who submitted 231 solutions, with the top-performing team achieving BLEU-1 scores of 31.40 and DTW-MJE of 0.0574. The winning approach utilized a retrieval-based framework and a pre-trained language model. As part of the workshop, we release a standardized evaluation network , including high-quality skeleton extraction-based key-points establishing a consistent baseline for the SLP field, which will enable future researchers to compare their work against a broader range of methods.

Harry Walsh, Ozge Mercanoglu Sincan, Ben Saunders, Richard Bowden Gloss Alignment Using Word Embeddings

DOI: 10.48550/arxiv.2308.04248

Capturing and annotating Sign language datasets is a time consuming and costly process. Current datasets are orders of magnitude too small to successfully train unconstrained \acf{slt} models. As a result, research has turned to TV broadcast content as a source of large-scale training data, consisting of both the sign language interpreter and the associated audio subtitle. However, lack of sign language annotation limits the usability of this data and has led to the development of automatic annotation techniques such as sign spotting. These spottings are aligned to the video rather than the subtitle, which often results in a misalignment between the subtitle and spotted signs. In this paper we propose a method for aligning spottings with their corresponding subtitles using large spoken language models. Using a single modality means our method is computationally inexpensive and can be utilized in conjunction with existing alignment techniques. We quantitatively demonstrate the effectiveness of our method on the \acf{mdgs} and \acf{bobsl} datasets, recovering up to a 33.22 BLEU-1 score in word alignment.

Ozge Mercanoglu Sincan, Necati Cihan Camgoz, Richard Bowden Is context all you need? Scaling Neural Sign Language Translation to Large Domains of Discourse

DOI: 10.48550/arxiv.2308.09622

Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos, both of which have different grammar and word/gloss order. From a Neural Machine Translation (NMT) perspective, the straightforward way of training translation models is to use sign language phrase-spoken language sentence pairs. However, human interpreters heavily rely on the context to understand the conveyed information, especially for sign language interpretation, where the vocabulary size may be significantly smaller than their spoken language equivalent. Taking direct inspiration from how humans translate, we propose a novel multi-modal transformer architecture that tackles the translation task in a context-aware manner, as a human would. We use the context from previous sequences and confident predictions to disambiguate weaker visual cues. To achieve this we use complementary transformer encoders, namely: (1) A Video Encoder, that captures the low-level video features at the frame-level, (2) A Spotting Encoder, that models the recognized sign glosses in the video, and (3) A Context Encoder, which captures the context of the preceding sign sequences. We combine the information coming from these encoders in a final transformer decoder to generate spoken language translations. We evaluate our approach on the recently published large-scale BOBSL dataset, which contains ~1.2M sequences, and on the SRF dataset, which was part of the WMT-SLT 2022 challenge. We report significant improvements on state-of-the-art translation performance using contextual information, nearly doubling the reported BLEU-4 scores of baseline approaches.

Ozge Mercanoglu Sincan, Hacer Yalim Keles, Ozge Mercanoglu Sincan (2020)AUTSL: A Large Scale Multi-Modal Turkish Sign Language Dataset and Baseline Methods, In: IEEE access8181340pp. 181340-181355 IEEE

DOI: 10.1109/ACCESS.2020.3028072

Sign language recognition is a challenging problem where signs are identified by simultaneous local and global articulations of multiple sources, i.e. hand shape and orientation, hand movements, body posture, and facial expressions. Solving this problem computationally for a large vocabulary of signs in real life settings is still a challenge, even with the state-of-the-art models. In this study, we present a new large-scale multi-modal Turkish Sign Language dataset (AUTSL) with a benchmark and provide baseline models for performance evaluations. Our dataset consists of 226 signs performed by 43 different signers and 38,336 isolated sign video samples in total. Samples contain a wide variety of backgrounds recorded in indoor and outdoor environments. Moreover, spatial positions and the postures of signers also vary in the recordings. Each sample is recorded with Microsoft Kinect v2 and contains color image (RGB), depth, and skeleton modalities. We prepared benchmark training and test sets for user independent assessments of the models. We trained several deep learning based models and provide empirical evaluations using the benchmark; we used Convolutional Neural Networks (CNNs) to extract features, unidirectional and bidirectional Long Short-Term Memory (LSTM) models to characterize temporal information. We also incorporated feature pooling modules and temporal attention to our models to improve the performances. We evaluated our baseline models on AUTSL and Montalbano datasets. Our models achieved competitive results with the state-of-the-art methods on Montalbano dataset, i.e. 96.11% accuracy. In AUTSL random train-test splits, our models performed up to 95.95% accuracy. In the proposed user-independent benchmark dataset our best baseline model achieved 62.02% accuracy. The gaps in the performances of the same baseline models show the challenges inherent in our benchmark dataset. AUTSL benchmark dataset is publicly available at https://cvml.ankara.edu.tr .

Harry Walsh, Ozge Mercanoglu Sincan, Ben Saunders, Richard Bowden (2023)Gloss Alignment using Word Embeddings, In: 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)pp. 1-5 IEEE

DOI: 10.1109/ICASSPW59220.2023.10193013

Capturing and annotating Sign language datasets is a time consuming and costly process. Current datasets are orders of magnitude too small to successfully train unconstrained Sign Language Translation (SLT) models. As a result, research has turned to TV broadcast content as a source of large-scale training data, consisting of both the sign language interpreter and the associated audio subtitle. However, lack of sign language annotation limits the usability of this data and has led to the development of automatic annotation techniques such as sign spotting. These spottings are aligned to the video rather than the subtitle, which often results in a misalignment between the subtitle and spotted signs. In this paper we propose a method for aligning spottings with their corresponding subtitles using large spoken language models. Using a single modality means our method is computationally inexpensive and can be utilized in conjunction with existing alignment techniques. We quantitatively demonstrate the effectiveness of our method on the Meine DGS-Annotated (MeineDGS) and BBC-Oxford British Sign Language (BOBSL) datasets, recovering up to a 33.22 BLEU-1 score in word alignment.

Ozge Mercanoglu Sincan, Hacer Yalim Keles (2022)Using Motion History Images With 3D Convolutional Networks in Isolated Sign Language Recognition, In: IEEE access10pp. 18608-18618 IEEE

DOI: 10.1109/ACCESS.2022.3151362

Sign language recognition using computational models is a challenging problem that requires simultaneous spatio-temporal modeling of the multiple sources, i.e. faces, hands, body, etc. In this paper, we propose an isolated sign language recognition model based on a model trained using Motion History Images (MHI) that are generated from RGB video frames. RGB-MHI images represent spatio-temporal summary of each sign video effectively in a single RGB image. We propose two different approaches using this RGB-MHI model. In the first approach, we use the RGB-MHI model as a motion-based spatial attention module integrated into a 3D-CNN architecture. In the second approach, we use RGB-MHI model features directly with the features of a 3D-CNN model using a late fusion technique. We perform extensive experiments on two recently released large-scale isolated sign language datasets, namely AUTSL and BosphorusSign22k. Our experiments show that our models, which use only RGB data, can compete with the state-of-the-art models in the literature that use multi-modal data.

Ozge Mercanoglu Sincan, Julio C. S. Jacques Junior, Sergio Escalera, Hacer Yalim Keles, Ozge Mercanoglu Sincan (2021)ChaLearn LAP Large Scale Signer Independent Isolated Sign Language Recognition Challenge: Design, Results and Future Research, In: 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021pp. 3467-3476 IEEE

DOI: 10.1109/CVPRW53098.2021.00386

The performances of Sign Language Recognition (SLR) systems have improved considerably in recent years. However, several open challenges still need to be solved to allow SLR to be useful in practice. The research in the field is in its infancy in regards to the robustness of the models to a large diversity of signs and signers, and to fairness of the models to performers from different demographics. This work summarises the ChaLearn LAP Large Scale Signer Independent Isolated SLR Challenge, organised at CVPR 2021 with the goal of overcoming some of the aforementioned challenges. We analyse and discuss the challenge design, top winning solutions and suggestions for future research. The challenge attracted 132 participants in the RGB track and 59 in the RGB+Depth track, receiving more than 1.5K submissions in total. Participants were evaluated using a new large-scale multi-modal Turkish Sign Language (AUTSL) dataset, consisting of 226 sign labels and 36,302 isolated sign video samples performed by 43 different signers. Winning teams achieved more than 96% recognition rate, and their approaches benefited from pose/hand/face estimation, transfer learning, external data, fusion/ensemble of modalities and different strategies to model spatio-temporal information. However, methods still fail to distinguish among very similar signs, in particular those sharing similar hand trajectories.