case study
Published: 13 May 2024

Leading open data practices in linguistics

The Surrey Morphology Group (SMG) is a world-leading research group, and creator for 30 years of open data documenting the world's languages. In the past few years, we have poured our efforts into leading improved open data practices in our field. 

Sacha Beniamine

Open research practices 

Our approach to modernise data practices in linguistic morphology has consisted in data-rescue operations for existing datasets; the creation of new data principles; a data standard; and a set of open source tools. 

Data rescue 

Many databases created decades ago are at risk of disappearing. They may be trapped in legacy software (Flash player), or  are simply not being maintained. To remedy this situation, we salvage, reorganise, normalise, standardise, and release the vital data openly. The two first examples of this effort are the Romance Verbal Inflection Dataset 2.0 (Beniamine et al 2020, http://dx.doi.org/10.5281/zenodo.3552167) and the Surrey Morphological Complexity Database (http://dx.doi.org/10.15126/SMG.23/1). 

Data principles 

Data principles, such as FAIR (namely, that data be Findable, Accessible, Interoperable and Reusable) and CARE (namely, Collective benefit, Authority to control, Responsibility & Ethics) are crucial for open scientific data. Complementing these, we devised the novel DeAR principles in order to support researchers facing the colossal task of scientific documentation. We encourage a Decentralised strategy, based on international cooperation rather than centralizing data in a single institution; Automated validation and verification of data to ensure its quality; and Revisable pipelines to make it easy to re-generate data presentations (such as websites) when the scientific data is updated. 

Data standard 

Targeting future datasets, we created a new data standard (Paralex, https://paralex-standard.org/, Beniamine et al. 2023). A standard is crucial for datasets to fit the FAIR and DeAR principles, and pivotal in enabling cross-language comparative work. Paralex promotes clean organisation of data and the adoption of common formal conventions. It utilises other initiatives, such as the Ontolex-Lemon model for ontologies and the frictionless standard for metadata

Data tools 

Finally, we are producing a suite of tools to facilitate the use of the Paralex standard, the user-friendly manipulation of datasets, the automatic generation of interactive static websites, and automated releases of open data to archival sites (gitlab2zenodo). 

Barriers and challenges 

There are many challenges. First, standardisation runs the risk of hemming in scientific decisions. To avoid this, we ensured our standard is strict with regard to form, but flexible regarding content. Second, scientific data must be long lasting, but is created through short-term funding. Our solution is to simplify the technical infrastructure, relying on archiving (such as zenodo or the Surrey Open Research Repository) and version management (gitlab) services. 

Benefits 

The main benefit of our work is the availability of high-quality datasets and an improvement in data practices and longevity. Salvaged datasets have supported publications (e.g. Cathcart 2022, Herce 2023), and international coordination from its inception has led to world-wide uptake of the Paralex standard. The DeAR principles are being taught in data management graduate classes from Canberra to Uppsala. Finally, the gitlab2zenodo tool has been very widely used, with 1,670 downloads to date. It is used by scientists from various disciplines beyond linguistics, such as immunology and agriculture. 

Conclusion 

We have led the open data transformation in our field through data-rescue operations; the creation of new data principles and standard; and the publication of open source tools. These initiatives have garnered positive feedback from the community and many early adopters. This work taught us that the adoption of sound data practices crucially hinges on good documentation and incentives. Moreover, we learnt the importance of international cooperation: there is no inter-operability if research groups follow independent standards, and documenting the 7,000 languages of the world is not an achievement that can be attained by a single team.  

Authors 

  • Mae Carroll, University of Melbourne. ORCID: 0000-0002-8419-0539 
    • Visiting Fellow in the Surrey Morphology Group, SLL; FASS. 

References 

Share what you've read?