news
Published: 04 July 2022

CTS research presented at LREC 2022

From 20 to 25 June 2022, the Centre for Translation Studies (CTS) was represented by some of its members in the Language Resources and Evaluation Conference (LREC 2022), which this year was in Marseille, France.

Research carried out in the Centre for Translation Studies in collaboration with Surrey Institute for People-Centred AI was presented at 13th edition of the Language Resources and Evaluation Conference in Marseille, France. Two of the papers presented developed the largest resources of their kind:

  • Leonardo Zilio, Hadeel Saadany, Prashant Sharma, Diptesh Kanojia and Constantin Orăsan (2022) PLOD: An Abbreviation Detection Dataset for Scientific Documents. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), Marseille, France, pp. 680‑688. [PDF] [Presentation] [Recording]

This paper presents PLOD, a large-scale dataset for abbreviation detection and extraction which can be used to identify both short forms and long forms of abbreviations. Such a resource is very important for developing automatic methods for abbreviation detection which can help translators and interpreters in their daily tasks. The PLOD dataset, codebase, and models are available at https://github.com/surrey-nlp/PLOD-AbbreviationDetection.

  • Rudra Murthy, Pallab Bhattacharjee, Rahul Sharnagat, Jyotsana Khatri, Diptesh Kanojia and Pushpak Bhattacharyya (2022) HiNER: A large Hindi Named Entity Recognition Dataset. n Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), Marseille, France, pp. 4467–4476. [PDF] [Poster] [Recording

Named entity recognition is a fundamental step in any Natural Language Processing (NLP) application as it identifies names of persons, organisations, locations, etc., but languages like Hindi do not have the necessary resources to develop such systems.This paper presents the largest dataset in Hindi for this task while evaluating the model exhaustively with the help of various language models. The HiNER dataset, codebase, and models are available at https://github.com/cfiltnlp/HiNER.

An evaluation of interlingual communication workflows is presented in:

  • Tomasz Korybski, Elena Davitti, Constantin Orasan and Sabine Braun (2022) A Semi-Automated Live Interlingual Communication Workflow Featuring Intralingual Respeaking: Evaluation and Benchmarking. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), Marseille, France, pp. 4405–4413. [PDF] [Presentation] [Recording

The paper compares a semi-automated workflow which involves intralingual respeaking and machine translation with a traditional workflow which relies on professional interpreters. The paper shows that the semi-automated workflow is capable of generating outputs that are similar in terms of accuracy and completeness to the outputs produced in the benchmarking workflow.