11am - 12 noon
Friday 25 September 2020
Neural sign language recognition and translation
PhD Viva Presentation by Necati Cihan Camgoz. All are welcome.
Free
This event has passed
Abstract
Sign languages have been studied by computer vision researchers for the last three decades. One of the end goals of vision-based sign language research is to build systems that can understand and translate sign languages to spoken/written languages or vice versa, to create a more natural medium of communication between the hearing and the Deaf. However, most research to date has mainly focused on isolated sign recognition and spotting, neglecting the underlying rich grammatical and linguistic structures of sign language that differ from spoken language. More recently, Continuous Sign Language Recognition (CSLR) has become feasible with the availability of large benchmark datasets, such as the RWTH-PHOENIX-Weather-2014 (PHOENIX14), and the development of algorithms that can learn from weak annotations. Although, CSLR is able to recognize sign gloss sequences, further progress is required to produce meaningful spoken/written language interpretations of continuous sign language videos.
In this thesis, we introduce the Sign Language Translation (SLT) problem and lay groundwork for future research on this topic. The objective of SLT is to generate spoken/written language translations from continuous sign language videos, taking into account the different word orders and grammar. We evaluate our approaches on the RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset, the first and the currently only publicly available Continuous SLT dataset aimed at vision based sign language research. It provides spoken language translations and gloss level annotations for German Sign Language videos of weather broadcasts. We lay down several evaluation protocols to underpin future research in this newly established field.
In the first contribution chapter of this thesis, we formalize SLT in the framework of Neural Machine Translation (NMT) and propose the first SLT approach, Neural Sign Language Translation. We combine Convolutional Neural Networks (CNNs) and attention-based encoder-decoder models, which allows us to jointly learn the spatial representations, the underlying language model, and the mapping between sign and spoken language. We investigate different configurations of the proposed network with both end-to-end and pretrained settings (using expert gloss annotations). In our experiments, recognizing glosses and then translating them to spoken languages (Sign2Gloss2Text) drastically outperforms an end-to-end direct translation approach (Sign2Text). Sign2Gloss2Text utilizes a state-of-the-art CSLR model to predict gloss sequences from sign language videos and then solves SLT as text-to-text translation problem. This suggests that using gloss level intermediate representations, essentially dividing the process into two stages, is necessary to train accurate SLT models.
Glosses are incomplete text-based representations of continuous multi-channel visual signals, that are sign languages. Thus, the best performing two step configuration of Neural Sign Language Translation has an inherent information bottleneck limiting translation. To address this issue, in the second contribution chapter of this thesis we formulate SLT as a multi-task learning problem. We introduce a novel transformer based architecture, Sign Language Transformers, that jointly learn CSLR and SLT while being trainable in an end-to-end manner. This is achieved by using a Connectionist Temporal Classification (CTC) loss to bind the recognition and translation problems into a single unified architecture. This joint approach does not require any ground-truth timing information, simultaneously solving two co-dependant sequence-to-sequence learning problems and leads to significant performance gains. We report state-of-the-art CSLR and SLT results achieved by our Sign Language Transformers. Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models, in some cases more than doubling the performance of Neural Sign Language Translation (Sign2Text configuration - 9.58 vs. 21.80 BLEU-4 Score).
Models we introduce in both first and second contribution chapters heavily rely on gloss information, either in the form of direct supervision or for pretraining. To realize large scale sign language translation, that is on par with their spoken/written language counterparts, we require more parallel datasets. However, annotating sign glosses is a laborious task and acquiring such annotations for large datasets is infeasible. To address this issue, in our last contribution chapter we propose modelling SLT based on sign articulators instead of glosses. Contrary to previous research, which mainly focused on manual features, we incorporate both both manual and non-manual features of the sign. We utilize hand shape, mouthings and upper body pose representations to model sign in a holistic manner.
We propose a novel transformer based architecture, called Multi-Channel Transformers, aimed at sequence-to-sequence learning problems where the source information is embedded over several channels.
This approach allows the networks to model both the inter and the intra relationship between asynchronous source channels. We also introduce a channel anchoring loss to help our models preserve channel specific information while also regulating training against overfitting.
We apply multi-channel transformers to the task of SLT and realize the first multi-articulatory translation approach. Our experiments on PHOENIX14T demonstrate that our approach achieves on par or better translation performance against several baselines, overcoming the reliance on gloss information which underpin previous approaches. Now we have broken the dependency upon gloss information, future work will be to scale learning to larger datasets, such as broadcast footage, where gloss information is not available.