Leading open data practices in linguistics
The Surrey Morphology Group (SMG) is a world-leading research group, and creator for 30 years of open data documenting the world's languages. In the past few years, we have poured our efforts into leading improved open data practices in our field.
Open research practices
Our approach to modernise data practices in linguistic morphology has consisted in data-rescue operations for existing datasets; the creation of new data principles; a data standard; and a set of open source tools.
Data rescue
Many databases created decades ago are at risk of disappearing. They may be trapped in legacy software (Flash player), or are simply not being maintained. To remedy this situation, we salvage, reorganise, normalise, standardise, and release the vital data openly. The two first examples of this effort are the Romance Verbal Inflection Dataset 2.0 (Beniamine et al 2020, http://dx.doi.org/10.5281/zenodo.3552167) and the Surrey Morphological Complexity Database (http://dx.doi.org/10.15126/SMG.23/1).
Data principles
Data principles, such as FAIR (namely, that data be Findable, Accessible, Interoperable and Reusable) and CARE (namely, Collective benefit, Authority to control, Responsibility & Ethics) are crucial for open scientific data. Complementing these, we devised the novel DeAR principles in order to support researchers facing the colossal task of scientific documentation. We encourage a Decentralised strategy, based on international cooperation rather than centralizing data in a single institution; Automated validation and verification of data to ensure its quality; and Revisable pipelines to make it easy to re-generate data presentations (such as websites) when the scientific data is updated.
Data standard
Targeting future datasets, we created a new data standard (Paralex, https://paralex-standard.org/, Beniamine et al. 2023). A standard is crucial for datasets to fit the FAIR and DeAR principles, and pivotal in enabling cross-language comparative work. Paralex promotes clean organisation of data and the adoption of common formal conventions. It utilises other initiatives, such as the Ontolex-Lemon model for ontologies and the frictionless standard for metadata.
Data tools
Finally, we are producing a suite of tools to facilitate the use of the Paralex standard, the user-friendly manipulation of datasets, the automatic generation of interactive static websites, and automated releases of open data to archival sites (gitlab2zenodo).
Barriers and challenges
There are many challenges. First, standardisation runs the risk of hemming in scientific decisions. To avoid this, we ensured our standard is strict with regard to form, but flexible regarding content. Second, scientific data must be long lasting, but is created through short-term funding. Our solution is to simplify the technical infrastructure, relying on archiving (such as zenodo or the Surrey Open Research Repository) and version management (gitlab) services.
Benefits
The main benefit of our work is the availability of high-quality datasets and an improvement in data practices and longevity. Salvaged datasets have supported publications (e.g. Cathcart 2022, Herce 2023), and international coordination from its inception has led to world-wide uptake of the Paralex standard. The DeAR principles are being taught in data management graduate classes from Canberra to Uppsala. Finally, the gitlab2zenodo tool has been very widely used, with 1,670 downloads to date. It is used by scientists from various disciplines beyond linguistics, such as immunology and agriculture.
Conclusion
We have led the open data transformation in our field through data-rescue operations; the creation of new data principles and standard; and the publication of open source tools. These initiatives have garnered positive feedback from the community and many early adopters. This work taught us that the adoption of sound data practices crucially hinges on good documentation and incentives. Moreover, we learnt the importance of international cooperation: there is no inter-operability if research groups follow independent standards, and documenting the 7,000 languages of the world is not an achievement that can be attained by a single team.
Authors
- Sacha Beniamine, Surrey Morphology Group, SLL; FASS. ORCID: 0000-0003-2584-3576
- Erich Round, Surrey Morphology Group, SLL; FASS. ORCID: 0000-0002-7533-8052
- Helen Sims-Williams, Surrey Morphology Group, SLL; FASS. ORCID: 0000-0001-9895-5435
- Greville Corbett, Surrey Morphology Group, SLL; FASS. ORCID: 0000-0002-2667-9870
- Matthew Baerman, Surrey Morphology Group, SLL; FASS. ORCID: 0000-0002-3060-0643
- Matteo Pellegrini, U. of Milano. ORCID: 0000-0003-4378-5824
- Visiting Fellow in the Surrey Morphology Group, SLL; FASS.
- Mae Carroll, University of Melbourne. ORCID: 0000-0002-8419-0539
- Visiting Fellow in the Surrey Morphology Group, SLL; FASS.
References
- Sacha Beniamine, Martin Maiden, and Erich Round. 2020. Opening the Romance Verbal Inflection Dataset 2.0: A CLDF lexicon. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3027–3035, Marseille, France. European Language Resources Association.
- Beniamine, S., Anderson, C., Carroll, M., Matías Guzmán Naranjo, B., Herce, M., Pellegrini, E., Round, H., Sims-Williams, T., & Tresoldi (2023). Paralex: a DeAR standard for rich lexicons of inflected forms. In Presentation at International Symposium of Morphology. Nancy. https://www.paralex-standard.org. URL https://ismo2023.ovh/fichiers/abstracts/4_ISMO_2023_Paralex.pdf
- Baerman, Matthew, Dunstan Brown, Roger Evans, Greville G. Corbett, Lynne Cahill & Sacha Beniamine. 2023. Surrey Morphological Complexity Database. University of Surrey. http://dx.doi.org/10.15126/SMG.23/1
- Cathcart, C., Herce, B., & Bickel, B. (2022). Decoupling speed of change and long-term preference in language evolution: insights from Romance verb stem alternations.
- Herce Calleja, B., & Cathcart, C. (2023). Short vs long stem alternations in Romance verbal inflection: the S-morphome. Transactions of the Philological Society, Epub-ahead.
- https://www.smg.surrey.ac.uk/approaches/open-research/