Dr Sacha Beniamine

Leverhulme Early Career Fellow

+441483 682619

https://sacha.beniamine.net

Academic and research departments

About

Biography

I am a Leverhulme Early Career Fellow in the Surrey Morphology Group (SMG), a research centre based in the School of Literature & Languages. Before joining Surrey in February 2012 as a British Academy Newton International Fellow, I was a post-doctoral researcher at the Department of Linguistic and Cultural Evolution (DLCE), of the Max Planck Institutes in Jena and Leipzig. I did my PhD in the Laboratoire de Linguistique Formelle (LLF) of the University of Paris.

Areas of specialism

Linguistics; Typology; Morphology; Computational linguistics; Language change

My qualifications

2012

BA in Language Sciences & Natural Language Processing

Université de Paris (Paris 7)

2014

MA in Language Sciences & Natural Language Processing

Université de Paris (Paris 7)

2018

PhD in Linguistics

Université de Paris (Paris 7)

News

13 MAY 2024

Leading open data practices in linguistics

Erich Round on the left and Sacha Beniamine on the right

19 JUL 2022

Two decades of Open Data for language diversity at the Surrey Morphology Group

Research

Research interests

My research in computational linguistics focuses on language evolution and typology. The title of my current project is: Solving the word puzzle: morphological analysis beyond stem and affixes. I see computational tools as an opportunity to systematize linguistic analyses, a solution to study precisely large amounts of data, and a necessary methodological step towards typological investigation.

Before joining the SMG, I was a post-doctoral researcher at the DLCE (MPG EVA), where I worked on inflectional lexicons, evolutionary models of inflectional paradigms and sound correspondence. During my PhD, I studied the typological variation of inflection classes (declensions or conjugations) using computational methods.

Research projects

Solving the word puzzle: morphological analysis beyond stem and affixes

In the few milliseconds necessary for speakers to say a word and for listeners to understand it, they both make several elaborate deductions. The internal structure of words can be a crucial source of information for these deductions, particularly when words have multiple grammatical forms, a process known as inflection. Across languages, the nature and number of contrasts expressed through inflection can vary greatly. While a language such as English has only a handful of grammatical distinctions, some languages can have up to thousands. Moreover, these distinctions can be manifested by diverse intricate sound contrasts. For example, the verbal system of English would be simple if all verbs conformed to the pattern of jump~jumped, which can be neatly segmented into a stem (jump) and affixes (-ed). But across languages, many words behave more like the pair think~thought which resist segmentation. In many languages, layers of regularity and idiosyncrasy further complicate the matter. Understanding the puzzling complexity of inflection is essential to explain the structure and evolution of the world's languages. Yet, linguistics still lacks a consistent, predictable methodology to study inflection.

To assess inflectional complexity across languages, this project investigates word structures across typologically diverse languages, using quantitative, computational tools.

Current studies in this area have two main – but related – shortcomings. First, they often start from pre-analysed paradigms, where forms have been segmented by hand into stems (removed from the data) and affixes. These affixal tables are not commensurate across languages. Second, studies focus on assessing how difficult it is for speakers to predict forms for a given meaning, and ignore the parallel problem of deducing the grammatical meaning of a given form. This question is key to automating word structure analysis.

the project remedies both by providing data, developing computational tools to analyse inflected words, and studying the organisation of inflectional exponence. We work on gathering, digitising, and standardising inflectional lexicons, coordinating with the international morphology community to spread the use of common standards and ensure interoperability. To solve the long standing Segmentation Problem, we write computational tools which focus on characterizing gradient information in words. Finally, our goal is to build a quantitative typology of inflected word structure.

Publications

Sacha Beniamine, Dunstan Patrick Brown, Matías Guzmán Naranjo, Andrea Sims (2025)Zalilex: Russian Nominal Paradigms Lexicon Zenodo

DOI: 10.5281/zenodo.15235589

This inflected lexicon is extracted from the digitized version of Zaliznyak's dictionary.

Cormac Anderson, Sacha Beniamine, Theodorus Fransen (2024)Goidelex: A Lexical Resource for Old Irish Zenodo

DOI: 10.5281/zenodo.10898227

Goidelex is an openly accessible relational database in CSV format, linked by formal relationships. The launch version documents 695 headwords with extensive linguistic annotations, including orthographic forms using a normalised orthography, automatically generated phonemic transcriptions, and information about morphosyntactic features, such as gender, inflectional class, etc. Metadata in JSON format, following the Frictionless standard, provides detailed descriptions of the tables and dataset. The database is designed to be fully compatible with the Paralex and CLDF standards and is interoperable with existing lexical resources for Old Irish such as CorPH and DIL. It is suited to both qualitative and quantitative investigation into Old Irish morphology and lexicon, as well as to comparative research.

Sacha Beniamine, Olivier Bonami, Maria Copot (2025)Morphologie implicative et conjugaison du français, In: Langue Française4/2025(228)pp. 23-39

This article documents the shift in studying French conjugation from Bonami & Boyé’s (2003) thematic approach to contemporary implicative approaches. Given the structural problems with the thematic approach, the implicative approach takes an abstractive and probabilistic perspective that quantifies predictive relationships between inflected forms. Two empirical studies validate this theoretical framework: a computational analysis using QUMIN and VLEXIQUE2 employs information theory to confirm linguistic intuitions about French paradigmatic organization, while a behavioral study shows that speakers are sensitive to the implicative relationships the theory predicts. The alignment between computational formalization, linguistic description, and behavioural evidence establishes the implicative approach as a unified framework for understanding morphological organization. Cet article documente l'évolution de l'étude de la conjugaison française, passant de l'approche thématique de Bonami et Boyé (2003) aux approches implicatives contemporaines. Face aux problèmes structurels de l'approche thématique, l'approche implicative adopte une perspective abstraite et probabiliste qui quantifie les relations prédictives entre les formes fléchies. Deux études empiriques valident ce cadre théorique : une analyse informatique, utilisant QUMIN et VLEXIQUE2, s'appuie sur la théorie de l'information pour confirmer les intuitions linguistiques concernant l'organisation paradigmatique du français ; une étude comportementale montre quant à elle que les locuteurs sont sensibles aux relations implicatives prédites par la théorie. La concordance entre la formalisation informatique, la description linguistique et les données comportementales établit l'approche implicative comme un cadre unifié pour la compréhension de l'organisation morphologique.

Mae Carroll, Sacha Beniamine (2025)Exponence and the theory of discriminative information in paradigms, In: Morphology35(2)pp. 227-269 Springer Nature

DOI: 10.1007/s11525-025-09437-2

Many linguistic theories focus on producing utterances from an economical set of abstract units and/or devices. There exists however a contrasting perspective: that of comprehension, where the key matter is to discriminate meanings from surface observations. In the case of inflectional morphology, we show that the two perspectives are substantively different: they face different challenges and result in different analyses. In comprehension, the question of exponence can be phrased as the Paradigm Cell Recognition Problem: what recurrent patterns in words could language users use to discriminate inflectional meanings? We provide a formal and implemented theory of exponence from this perspective. The resulting units (formatives) do not coincide with traditional morphemic segmentations, but rather, constitute the smallest discriminative cues which speakers could attend to for the comprehension task. We show that starting from the perspective of discriminative information, very simple principles lead to unambiguous analyses in terms of both segmentation and meaning. This new theory is especially promising for cross-linguistic research as it results in deterministic, comparable analyses across languages, and is defined independently of any model of cognition.

Emily Lindsay-Smith, Matthew Baerman, Sacha Beniamine, Helen Sims-Williams, Erich R. Round (2024)Analogy in Inflection, In: Annual review of linguistics10(1)

DOI: 10.1146/annurev-linguistics-030521-040935

Analogy has returned to prominence in the field of inflectional morphology as a basis for new explanations of inflectional productivity. Here we review the rising profile of analogy, identifying key theoretical and methodological developments, areas of success, and priorities for future work. In morphological theory, work within so-called abstractive approaches places analogy at the center of productive processes, though significant conceptual and technical details remain to be settled. The computational modeling of inflectional analogy has a rich and diverse history, and attention is now increasingly directed to understanding inflectional systems through their internal complexity and cross-linguistic diversity. A tension exists between the prima facie promise of analogy to lead to new explanations and its relative lack of theoretical articulation. We bring this to light as we examine questions regarding inflectional defectiveness and whether analogy is reducible to grammar optimization resulting from simplicity biases in learning and language use. Expected final online publication date for the Annual Review of Linguistics, Volume 10 is January 2024. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Sacha Beniamine, Mari Aigro, Matthew Baerman, Jules Bouton, Maria Copot (2024)Eesthetic: Estonian Paradigms in Phonemic Notation Zenodo

DOI: 10.5281/zenodo.8383522

Eesthetic is a collection of Estonian verbal and nominal paradigms, in phonemic and orthographic notation. They are suited for both computational and manual analysis. The dataset conforms to the Paralex standard

Erich Round, Sacha Beniamine, Louise Esher (2021)Spontaneous emergence of inflectional class systems via attraction–repulsion dynamics HAL CCSD

International audience; Inflectional classes are ubiquitous in the world’s inflectional systems, but where do they come from? We introduce a simple, computational iterated learning model in which inflectional classes emerge spontaneously; this contrasts with a prominent earlier model (Ackerman & Malouf 2015) which reliably evolves orderliness and eventual uniformity, but never stable inflectional classes.Inflectional classes are a ‘morphomic’ morphological-organisational structure, mediating the mapping between content and form in inflectional systems. In natural language, they are common, productive, psychologically real for speakers (Enger 2014, Maiden 2018), and limit the complexity of the inflectional system by increasing the systematicity of exponent distribution (cf. Carstairs-McCarthy 2010, Round 2015, Blevins 2016). We contribute to ongoing debate over the dynamics that could lead to inflectional class structure (Maiden 2018, Carstairs-McCarthy 2010), by identifying a key evolutionary ingredient: change based on dissimilarity. Since attraction-only models (in which lexemes only grow more similar to each other) inevitably remove all variation, they cannot evolve the stable, structured diversity characteristic of inflectional systems; by contrast, models with both an attraction and repulsion dynamic enable stable, morphome-like structure to emerge consistently.A model implementing a simple paradigm cell filling task (Ackerman, Blevins & Malouf 2009), is described in Ackerman and Malouf (2015) and illustrated in Figure 1. Within this attraction-only model, lexemes only ever change to be more like others. The core dynamic is one of preferential attraction towards exponents that are already more frequent than their competitors, ensuring that all lexemes eventually converge on a single class. Thus, the model exhibits self-organisation, but only of a radically homogenising kind. We investigated a family of minimally different dynamics, introducing modulable parameters for the paradigm cell filling task: prediction based on multiple forms (Stump & Finkel 2013, Bonami & Beniamine 2016); frequency weighting (Blevins, Milin et al 2016); and a repulsion dynamic, by which already dissimilar lexemes can increase in dissimilarity.

Micha Elsner, Sacha Beniamine (2024)Computational approaches to morphological typology, In: Journal of Language Modelling12(2)

DOI: 10.15398/jlm.v12i2.431

Introduction to the Special Issue.

Erich Round, Louise Esher, Sacha Beniamine (2024)The natural stability of autonomous morphology: how an attraction–repulsion dynamic emerges from paradigm cell filling, In: Morphology (Dordrecht)

DOI: 10.1007/s11525-024-09433-y

Abstract Autonomous morphology, such as inflection class systems and paradigmatic distribution patterns, is widespread and diachronically resilient in natural language. Why this should be so has remained unclear given that autonomous morphology imposes learning costs, offers no clear benefit relative to its absence and could easily be removed by the analogical forces which are constantly reshaping it. Here we propose an explanation for the resilience of autonomous morphology, in terms of a diachronic dynamic of attraction and repulsion between morphomic categories, which emerges spontaneously from a simple paradigm cell filling process. Employing computational evolutionary models, our key innovation is to bring to light the role of ‘dissociative evidence’, i.e., evidence for inflectional distinctiveness which a rational reasoner will have access to during analogical inference. Dissociative evidence creates a repulsion dynamic which prevents morphomic classes from collapsing together entirely, i.e., undergoing complete levelling. As we probe alternative models, we reveal the limits of conditional entropy as a measure for predictability in systems that are undergoing change. Finally, we demonstrate that autonomous morphology, far from being ‘unnatural’, is rather the natural (emergent) consequence of a natural (rational) process of inference applied to inflectional systems.

Olivier Bonami, Sarah Beniamine (2016)Joint predictiveness in inflectional paradigms, In: Word structure9(2)pp. 156-182 Edinburgh Univ Press

DOI: 10.3366/word.2016.0092

This paper contributes to addressing the Paradigm Cell Filling Problem (PCFP) in inflectional paradigms, as defined by Ackerman et al. (2009). We define a method for extending the use of conditional entropy to address the PCFP to prediction based on multiple paradigm cells. We apply this method to French and European Portugese and show that, on average, knowledge of multiple paradigm cells is dramatically more predictive than knowledge of a single cell. Moreover, this new entropy measure proves useful in studying principal parts systems, which correspond to sets of predictors yielding a null entropy. Using a graded measure allows us to highlight the relevance of non-categorical or "good enough" principal parts systems.

Erich R. Round, Jayden L. Macklin-Cordes, T. Mark Ellison, Sacha Beniamine (2020)Automated Parsing of Interlinear Glossed Text From Page Images of Grammatical Descriptions, In: N Calzolari, F Bechet, P Blache, K Choukri, C Cieri, T Declerck, S Goggi, H Isahara, B Maegaard, J Mariani, H Mazo, A Moreno, J Odijk, S Piperidis (eds.), PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020)pp. 2878-2883 European Language Resources Assoc-Elra

Linguists seek insight from all human languages, however accessing information from most of the full store of extant global linguistic descriptions is not easy. One of the most common kinds of information that linguists have documented is vernacular sentences, as recorded in descriptive grammars. Typically these sentences are formatted as interlinear glossed text (IGT). Most descriptive grammars, however, exist only as hardcopy or scanned pdf documents. Consequently, parsing IGTs in scanned grammars is a priority, in order to significantly increase the volume of documented linguistic information that is readily accessible. Here we demonstrate fundamental viability for a technology that can assist in making a large number of linguistic data sources machine readable: the automated identification and parsing of interlinear glossed text from scanned page images. For example, we attain high median precision and recall (>0.95) in the identification of example sentences in IGT format. Our results will be of interest to those who are keen to see more of the existing documentation of human language, especially for less-resourced and endangered languages, become more readily accessible.

Sacha Beniamine, Martin Maiden, Erich Round (2020)Opening the Romance Verbal Inflection Dataset 2.0: a CLDF Lexicon, In: N Calzolari, F Bechet, P Blache, K Choukri, C Cieri, T Declerck, S Goggi, H Isahara, B Maegaard, J Mariani, H Mazo, A Moreno, J Odijk, S Piperidis (eds.), PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020)pp. 3027-3035 European Language Resources Assoc-Elra

We introduce the Romance Verbal Inflection Dataset 2.0, a multilingual lexicon of Romance inflection covering 74 varieties. The lexicon provides verbal paradigm forms in broad IPA phonemic notation. Both lexemes and paradigm cells are organized to reflect cognacy. Such multi-lingual inflected lexicons annotated for two dimensions of cognacy are necessary to study the evolution of inflectional paradigms, and test linguistic hypotheses systematically. However, these resources seldom exist, and when they do, they are not usually encoded in computationally usable ways. The Oxford Online Database of Romance Verb Morphology provides this kind of information, however, it is not maintained anymore and is only available as a web service without interfaces for machine-readability. We collect its data and clean and correct it for consistency using both heuristics and expert annotator judgements. Most resources used to study language evolution computationally rely strictly on multilingual contemporary information, and lack information about prior stages of the languages. To provide such information, we augmented the database with Latin paradigms from the LatInFlexi lexicon. Finally, to make it widely avalable, the resource is released under a GPLv3 license in CLDF format.

SACHA BENIAMINE, Olivier Bonami, Ana R. Luís (2021)The fine implicative structure of European Portuguese conjugation, In: Isogloss. Open Journal of Romance Linguistics 7(9)pp. 1-35

DOI: 10.5565/rev/isogloss.109

Recent literature has highlighted the extent to which inflectional paradigms are organised into systems of implications allowing speakers to make full use of the inflection system on the basis of exposure to only a few forms of each word. The present paper contributes to this line of research by investigating in detail the implicative structure of European Portuguese verbal paradigms. After outlining the computational methods we use to that effect, we deploy these methods on a lexicon of about 5000 verbs, and show how the morphological and phonological properties of European Portuguese verbs lead to the observed patterns of predictability.

Olivier Bonami, SACHA BENIAMINE (2021)Leaving the stem by itself, In: All Things Morphology: Its independence and its interfacespp. 81-98 John Benjamins

DOI: 10.1075/cilt.353.05bon

Stem allomorphy plays a central role in the recent history of morphology, in no small part thanks to a research program initiated by Aronoff (1994). Yet, there is no agreed upon way of deciding whether some bit of form should be considered a proper part of a stem allomorph or an independent exponent. We explore the possibility of just doing away with the notion of stem allomorphy in inflection. We use computational methods to identify within each word a sequence of strings that do not take part in any alternation within that word’s paradigm. We then discuss the relationship of such sequences to the classical notion of a stem, and argue that discontinuous stems are both conceptually and empirically more satisfactory.

SACHA BENIAMINE (2021)One lexeme, many classes: Inflection class systems as lattices, In: One-to-many-relations in morphology, syntax, and semanticspp. 23-51 Language Science Press

This paper discusses the nature of inflection classes (ICs) and provides a fully im-plemented methodology to conduct typological investigations into their structure.ICs (conjugations or declensions) are sets of lexemes which inflect similarly. Theyare often described as partitioning the set of lexemes, but similarities across classeslead some authors to favor hierarchical descriptions. While some formalisms allowfor multiple inheritance, where one class takes after two or more others, it is usuallytaken as an exceptional situation.I submit that the structure of ICs is a typological property of inflectional systems.As a result, ICs are best modelled as semi-lattices, which by design capture non-canonical phenomena. I show how these monotonous multiple inheritance hierar-chies can be inferred automatically from raw paradigms using alternation patternsand formal concept analysis. Using quantitative measures of canonicity, I comparesix inflectional systems and show that multiple inheritance is in fact pervasiveacross inflectional systems.

Fernando Perdigão, SACHA BENIAMINE, Ana R. Luís, Olivier Bonami (2021)European Portuguese Verbal Paradigms in Phonemic Notation Zenodo

DOI: 10.5281/zenodo.5121543

This is a collection of European Portuguese verbal paradigms, in phonemic notation. They are suited for both computational and manual analysis.

SACHA BENIAMINE, Matías Guzmán Naranjo (2021)Multiple alignments of inflectional paradigms, In: Proceedings of the Society for Computation in Linguistic4

DOI: 10.7275/ymc0-p491

Most models of inflectional morphology rely at their core on the identification of recurrent and diverging material across inflected forms. Across theoretical frameworks, this can be expressed in terms of morpheme segmentation, rules, processes, patterns or analogies. Finding these recurrences in large structured lexicons is an important step in empirical computational morphology, where analyses are induced bottom-up from inflected forms. This can be done by aligning all the forms in each paradigm, a task of Multiple Sequence Alignments which is well known in other fields such as evolutionary biology and historical linguistics. In this paper, we present the specific problems which arise when aligning inflected forms, provide a simple alignment format, define evaluation measures and compare two implemented methods on 13 inflectional lexicons. Our intent is to provide the conditions for the inter-operability of future systems, and for incre-mental improvements in this fundamental step for quantitative morphology.

Sacha Beniamine, Olivier Bonami (2019)Segmentation in morphology: wh-en, wh-ere, how?

Erich R Round, Sacha Beniamine, Louise Esher The role of attraction-repulsion dynamics in simulating the emergence of inflectional class systems

DOI: 10.48550/arxiv.2111.08465

Dynamic models of paradigm change can elucidate how the simplest of processes may lead to unexpected outcomes, and thereby can reveal new potential explanations for observed linguistic phenomena. Ackerman & Malouf (2015) present a model in which inflectional systems reduce in disorder through the action of an attraction-only dynamic, in which lexemes only ever grow more similar to one another over time. Here we emphasise that: (1) Attraction-only models cannot evolve the structured diversity which characterises true inflectional systems, because they inevitably remove all variation; and (2) Models with both attraction and repulsion enable the emergence of systems that are strikingly reminiscent of morphomic structure such as inflection classes. Thus, just one small ingredient -- change based on dissimilarity -- separates models that tend inexorably to uniformity, and which therefore are implausible for inflectional morphology, from those which evolve stable, morphome-like structure. These models have the potential to alter how we attempt to account for morphological complexity.

Sacha Beniamine, Olivier Bonami (2022)Inflection class systems

Stephen Mann, Sacha Beniamine, Emily Lindsay-Smith, Louise Esher, Matt Spike, Erich Ross Round (2022)Cognition and the stability of evolving complex morphology: an agent-based model, In: The Evolution of Language: Proceedings of the Joint Conference on Language Evolution (JCoLE)pp. 635-642 Joint Conference on Language Evolution (JCoLE)

Cultural attractors enable evolving cultural traits to gain the stability that underpins cumulative cultural evolution, yet the conditions that support their existence are poorly understood. We examine conditions affecting the stability of a salient kind of complex cultural attractor in human language, known as inflectional classes. We present a model of the evolution of inflectional classes, as they are reconstructed across generations via a combination of direct transmission and analogical inference. Parameters examined pertain to diversity of the lexicon and the cog-nitive policies governing inferential reasoning. We discover that persistence of stable inflection classes interacts in complex ways with features which affect how inflection classes are inferred. Thus we contribute to a greater understanding of factors affecting cultural attractors' existence, and to insights into a widespread and complex trait of human language.

Additional publications

A full list of my publications is available on my personal site.