Leading open data practices in linguistics

The Surrey Morphology Group (SMG) is a world-leading research group, and creator for 30 years of open data documenting the world's languages. In the past few years, we have poured our efforts into leading improved open data practices in our field.

Sacha Beniamine

Open research practices

Our approach to modernise data practices in linguistic morphology has consisted in data-rescue operations for existing datasets; the creation of new data principles; a data standard; and a set of open source tools.

Data rescue

Many databases created decades ago are at risk of disappearing. They may be trapped in legacy software (Flash player), or are simply not being maintained. To remedy this situation, we salvage, reorganise, normalise, standardise, and release the vital data openly. The two first examples of this effort are the Romance Verbal Inflection Dataset 2.0 (Beniamine et al 2020, http://dx.doi.org/10.5281/zenodo.3552167) and the Surrey Morphological Complexity Database (http://dx.doi.org/10.15126/SMG.23/1).

Data principles

Data principles, such as FAIR (namely, that data be Findable, Accessible, Interoperable and Reusable) and CARE (namely, Collective benefit, Authority to control, Responsibility & Ethics) are crucial for open scientific data. Complementing these, we devised the novel DeAR principles in order to support researchers facing the colossal task of scientific documentation. We encourage a Decentralised strategy, based on international cooperation rather than centralizing data in a single institution; Automated validation and verification of data to ensure its quality; and Revisable pipelines to make it easy to re-generate data presentations (such as websites) when the scientific data is updated.

Data standard

Targeting future datasets, we created a new data standard (Paralex, https://paralex-standard.org/, Beniamine et al. 2023). A standard is crucial for datasets to fit the FAIR and DeAR principles, and pivotal in enabling cross-language comparative work. Paralex promotes clean organisation of data and the adoption of common formal conventions. It utilises other initiatives, such as the Ontolex-Lemon model for ontologies and the frictionless standard for metadata.

Data tools

Finally, we are producing a suite of tools to facilitate the use of the Paralex standard, the user-friendly manipulation of datasets, the automatic generation of interactive static websites, and automated releases of open data to archival sites (gitlab2zenodo).

Barriers and challenges

There are many challenges. First, standardisation runs the risk of hemming in scientific decisions. To avoid this, we ensured our standard is strict with regard to form, but flexible regarding content. Second, scientific data must be long lasting, but is created through short-term funding. Our solution is to simplify the technical infrastructure, relying on archiving (such as zenodo or the Surrey Open Research Repository) and version management (gitlab) services.

Benefits

The main benefit of our work is the availability of high-quality datasets and an improvement in data practices and longevity. Salvaged datasets have supported publications (e.g. Cathcart 2022, Herce 2023), and international coordination from its inception has led to world-wide uptake of the Paralex standard. The DeAR principles are being taught in data management graduate classes from Canberra to Uppsala. Finally, the gitlab2zenodo tool has been very widely used, with 1,670 downloads to date. It is used by scientists from various disciplines beyond linguistics, such as immunology and agriculture.

Conclusion

We have led the open data transformation in our field through data-rescue operations; the creation of new data principles and standard; and the publication of open source tools. These initiatives have garnered positive feedback from the community and many early adopters. This work taught us that the adoption of sound data practices crucially hinges on good documentation and incentives. Moreover, we learnt the importance of international cooperation: there is no inter-operability if research groups follow independent standards, and documenting the 7,000 languages of the world is not an achievement that can be attained by a single team.

Authors

Sacha Beniamine, Surrey Morphology Group, SLL; FASS. ORCID: 0000-0003-2584-3576
Erich Round, Surrey Morphology Group, SLL; FASS. ORCID: 0000-0002-7533-8052
Helen Sims-Williams, Surrey Morphology Group, SLL; FASS. ORCID: 0000-0001-9895-5435
Greville Corbett, Surrey Morphology Group, SLL; FASS. ORCID: 0000-0002-2667-9870
Matthew Baerman, Surrey Morphology Group, SLL; FASS. ORCID: 0000-0002-3060-0643
Matteo Pellegrini, U. of Milano. ORCID: 0000-0003-4378-5824
- Visiting Fellow in the Surrey Morphology Group, SLL; FASS.

Mae Carroll, University of Melbourne. ORCID: 0000-0002-8419-0539
- Visiting Fellow in the Surrey Morphology Group, SLL; FASS.

References

Sacha Beniamine, Martin Maiden, and Erich Round. 2020. Opening the Romance Verbal Inflection Dataset 2.0: A CLDF lexicon. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3027–3035, Marseille, France. European Language Resources Association.
Beniamine, S., Anderson, C., Carroll, M., Matías Guzmán Naranjo, B., Herce, M., Pellegrini, E., Round, H., Sims-Williams, T., & Tresoldi (2023). Paralex: a DeAR standard for rich lexicons of inflected forms. In Presentation at International Symposium of Morphology. Nancy. https://www.paralex-standard.org. URL https://ismo2023.ovh/fichiers/abstracts/4_ISMO_2023_Paralex.pdf 
Baerman, Matthew, Dunstan Brown, Roger Evans, Greville G. Corbett, Lynne Cahill & Sacha Beniamine. 2023. Surrey Morphological Complexity Database. University of Surrey. http://dx.doi.org/10.15126/SMG.23/1
Cathcart, C., Herce, B., & Bickel, B. (2022). Decoupling speed of change and long-term preference in language evolution: insights from Romance verb stem alternations.
Herce Calleja, B., & Cathcart, C. (2023). Short vs long stem alternations in Romance verbal inflection: the S-morphome. Transactions of the Philological Society, Epub-ahead.
https://www.smg.surrey.ac.uk/approaches/open-research/

Share what you've read?

Featured Academics

Dr Sacha Beniamine

Leverhulme Early Career Fellow