Dr Tom Thorne
Academic and research departments
Centre for Mathematical and Computational Biology, Computer Science Research Centre, School of Computer Science and Electronic Engineering.About
Biography
I am currently the Programme Lead of the MSc Data Science in the School of Computer Science and Electronic Engineering.
I joined Computer Science at the University of Surrey in June 2020 as a Senior Lecturer, having previously been a lecturer at the University of Reading. Prior to that I was a Safra research fellow in the Division of Brain Sciences at Imperial College London and a Chancellors Fellow in the School of Informatics at the University of Edinburgh. I originally studied Computer Science at King's College Cambridge before taking an MSc and PhD in Bioinformatics at Imperial College London.
My research applies statistical methods to analysing large biological data sets, especially in learning networks and studying their structure. You can read more about my current research interests under the research section.
Areas of specialism
University roles and responsibilities
- Programme lead of the MSc Data Science
I am open to applications from funded PhD students with interests in statistical methods in Systems Biology.
ResearchResearch interests
My research focusses on statistical methods in Systems Biology and Bioinformatics, including the inference of biological networks, and the statistical analysis of networks. I am also interested in the use of approximate Bayesian methods for complex models and large data sets, and the acceleration of statistical methods with GPUs.
Recently research has involved developing tools for the analysis of single cell transcriptomic data, which provides information about the heterogeneity of gene expression across a population of cells. Another area of interest is comparing data sets between different conditions to learn differential networks that jointly infer network structures across data sets but also allow differences to be identified.
Research interests
My research focusses on statistical methods in Systems Biology and Bioinformatics, including the inference of biological networks, and the statistical analysis of networks. I am also interested in the use of approximate Bayesian methods for complex models and large data sets, and the acceleration of statistical methods with GPUs.
Recently research has involved developing tools for the analysis of single cell transcriptomic data, which provides information about the heterogeneity of gene expression across a population of cells. Another area of interest is comparing data sets between different conditions to learn differential networks that jointly infer network structures across data sets but also allow differences to be identified.
Teaching
I am currently leading the teaching team of COMM054 Data Science Principles and Practices on the MSc Data Science at the University of Surrey. This module introduces the fundamental concepts in probability and statistics that provide a solid background for students studying data science.
Publications
Understanding how genetically encoded rules drive and guide complex neuronal growth processes is essential to comprehending the brain's architecture, and agent-based models (ABMs) offer a powerful simulation approach to further develop this understanding. However, accurately calibrating these models remains a challenge. Here, we present a novel application of Approximate Bayesian Computation (ABC) to address this issue. ABMs are based on parametrized stochastic rules that describe the time evolution of small components -- the so-called agents -- discretizing the system, leading to stochastic simulations that require appropriate treatment. Mathematically, the calibration defines a stochastic inverse problem. We propose to address it in a Bayesian setting using ABC. We facilitate the repeated comparison between data and simulations by quantifying the morphological information of single neurons with so-called morphometrics and resort to statistical distances to measure discrepancies between populations thereof. We conduct experiments on synthetic as well as experimental data. We find that ABC utilizing Sequential Monte Carlo sampling and the Wasserstein distance finds accurate posterior parameter distributions for representative ABMs. We further demonstrate that these ABMs capture specific features of pyramidal cells of the hippocampus (CA1). Overall, this work establishes a robust framework for calibrating agent-based neuronal growth models and opens the door for future investigations using Bayesian techniques for model building, verification, and adequacy assessment.
BackgroundGene interaction networks are graphs in which nodes represent genes and edges represent functional interactions between them. These interactions can be at multiple levels, for instance, gene regulation, protein-protein interaction, or metabolic pathways. To analyse gene interaction networks at a large scale, gene co-expression network analysis is often applied on high-throughput gene expression data such as RNA sequencing data. With the advance in sequencing technology, expression of genes can be measured in individual cells. Single-cell RNA sequencing (scRNAseq) provides insights of cellular development, differentiation and characteristics at the transcriptomic level. High sparsity and high-dimensional data structures pose challenges in scRNAseq data analysis.ResultsIn this study, a sparse inverse covariance matrix estimation framework for scRNAseq data is developed to capture direct functional interactions between genes. Comparative analyses highlight high performance and fast computation of Stein-type shrinkage in high-dimensional data using simulated scRNAseq data. Data transformation approaches also show improvement in performance of shrinkage methods in non-Gaussian distributed data. Zero-inflated modelling of scRNAseq data based on a negative binomial distribution enhances shrinkage performance in zero-inflated data without interference on non zero-inflated count data.ConclusionThe proposed framework broadens application of graphical model in scRNAseq analysis with flexibility in sparsity of count data resulting from dropout events, high performance, and fast computational time. Implementation of the framework is in a reproducible Snakemake workflow https://github.com/calathea24/ZINBGraphicalModel and R package ZINBStein https://github.com/calathea24/ZINBStein.
Approximate Bayesian computation (ABC) is a well-established family of Monte Carlo methods for performing approximate Bayesian inference in the case where an ``implicit'' model is used for the data: when the data model can be simulated, but the likelihood cannot easily be pointwise evaluated. A fundamental property of standard ABC approaches is that the number of Monte Carlo points required to achieve a given accuracy scales exponentially with the dimension of the data. Prangle et al. (2018) proposes a Markov chain Monte Carlo (MCMC) method that uses a rare event sequential Monte Carlo (SMC) approach to estimating the ABC likelihood that avoids this exponential scaling, and thus allows ABC to be used on higher dimensional data. This paper builds on the work of Prangle et al. (2018) by using the rare event SMC approach within an SMC algorithm, instead of within an MCMC algorithm. The new method has a similar structure to SMC$^{2}$ (Chopin et al., 2013), and requires less tuning than the MCMC approach. We demonstrate the new approach, compared to existing ABC-SMC methods, on a toy example and on a duplication-divergence random graph model used for modelling protein interaction networks.
Understanding how genetically encoded rules drive and guide complex neuronal growth processes is essential to comprehending the brain's architecture, and agent-based models (ABMs) offer a powerful simulation approach to further develop this understanding. However, accurately calibrating these models remains a challenge. Here, we present a novel application of Approximate Bayesian Computation (ABC) to address this issue. ABMs are based on parametrized stochastic rules that describe the time evolution of small components-the so-called agents-discretizing the system, leading to stochastic simulations that require appropriate treatment. Mathematically, the calibration defines a stochastic inverse problem. We propose to address it in a Bayesian setting using ABC. We facilitate the repeated comparison between data and simulations by quantifying the morphological information of single neurons with so-called morphometrics and resort to statistical distances to measure discrepancies between populations thereof. We conduct experiments on synthetic as well as experimental data. We find that ABC utilizing Sequential Monte Carlo sampling and the Wasserstein distance finds accurate posterior parameter distributions for representative ABMs. We further demonstrate that these ABMs capture specific features of pyramidal cells of the hippocampus (CA1). Overall, this work establishes a robust framework for calibrating agent-based neuronal growth models and opens the door for future investigations using Bayesian techniques for model building, verification, and adequacy assessment.
Altered lipid metabolism is a feature of chronic inflammatory disorders. Increased plasma lipids and lipoproteins have been associated with multiple sclerosis (MS) disease activity. Our objective was to characterise the specific lipids and associated plasma lipoproteins increased in MS and to test for an association with disability. Plasma samples were collected from 27 RRMS patients (median EDSS, 1.5, range 1–7) and 31 healthy controls. Concentrations of lipids within lipoprotein sub-classes were determined from NMR spectra. Plasma cytokines were measured using the MesoScale Discovery V-PLEX kit. Associations were tested using multivariate linear regression. Differences between the patient and volunteer groups were found for lipids within VLDL and HDL lipoprotein sub-fractions (p < 0.05). Multivariate regression demonstrated a high correlation between lipids within VLDL sub-classes and the Expanded Disability Status Scale (EDSS) (p < 0.05). An optimal model for EDSS included free cholesterol carried by VLDL-2, gender and age (R2= 0.38, p < 0.05). Free cholesterol carried by VLDL-2 was highly correlated with plasma cytokines CCL-17 and IL-7 (R2= 0.78, p < 0.0001). These results highlight relationships between disability, inflammatory responses and systemic lipid metabolism in RRMS. Altered lipid metabolism with systemic inflammation may contribute to immune activation
Pathogenic microbes exist in dynamic niches and have evolved robust adaptive responses to promote survival in their hosts. The major fungal pathogens of humans, Candida albicans and Candida glabrata, are exposed to a range of environmental stresses in their hosts including osmotic, oxidative and nitrosative stresses. Significant efforts have been devoted to the characterization of the adaptive responses to each of these stresses. In the wild, cells are frequently exposed simultaneously to combinations of these stresses and yet the effects of such combinatorial stresses have not been explored. We have developed a common experimental platform to facilitate the comparison of combinatorial stress responses in C. glabrata and C. albicans. This platform is based on the growth of cells in buffered rich medium at 30°C, and was used to define relatively low, medium and high doses of osmotic (NaCl), oxidative (H2O2) and nitrosative stresses (e.g., dipropylenetriamine (DPTA)-NONOate). The effects of combinatorial stresses were compared with the corresponding individual stresses under these growth conditions. We show for the first time that certain combinations of combinatorial stress are especially potent in terms of their ability to kill C. albicans and C. glabrata and/or inhibit their growth. This was the case for combinations of osmotic plus oxidative stress and for oxidative plus nitrosative stress. We predict that combinatorial stresses may be highly significant in host defences against these pathogenic yeasts.
Sentiment analysis is one of the key tasks of natural language understanding. Sentiment Evolution models the dynamics of sentiment orientation over time. It can help people have a more profound and deep understanding of opinion and sentiment implied in user generated content. Existing work mainly focuses on sentiment classication, while the analysis of how the sentiment orientation of a topic has been inuenced by other topics or the dynamic interaction of topics from the aspect of sentiment has been ignored. In this paper, we propose to construct a Gaussian Process Dynamic Bayesian Network to model the dynamics and interactions of the sentiment of topics on social media such as Twitter. We use Dynamic Bayesian Networks to model time series of the sentiment of related topics and learn relationships between them. The network model itself applies Gaussian Process Regression to model the sentiment at a given time point based on related topics at previous time.We conducted experiments on a real world dataset that was crawled from Twitter with 9.72 million tweets. The experiment demonstrates a case study of analysing the sentiment dynamics of topics related to the event Brexit.
Transcriptomic data quantifying gene expression states for single cells or cell populations at a genomic level is now readily available. Changes in transcriptional state during cell development and function are governed by gene regulatory networks, comprising a collection of genes and regulatory interactions between these genes (or gene products). Network inference algorithms aim to infer functional interactions between genes from experimentally observed expression profiles, and identify the structure of the underlying regulatory networks. Here we describe popular classes of network inference algorithms, highlighting their respective strengths and weaknesses, along with some general challenges faced by these methods. Analyzing inferred network structures can provide insight into the genes, transcriptional changes, and regulatory interactions that play key roles in biological and disease-related processes of interest.
Objective: To infer molecular effectors of therapeutic effects and adverse events for dimethyl fumarate (DMF) in patients with relapsing-remitting MS (RRMS) using untargeted plasma metabolomics. Methods: Plasma from 27 patients with RRMS was collected at baseline and 6 weeks after initiating DMF. Patients were separated into discovery (n = 15) and validation cohorts (n = 12). Ten healthy controls were also recruited. Metabolomic profiling using ultra-high-performance liquid chromatography mass spectrometry (UPLC-MS) was performed on the discovery cohort and healthy controls at Metabolon Inc (Durham, NC). UPLC-MS was performed on the validation cohort at the National Phenome Centre (London, UK). Plasma neurofilament concentration (pNfL) was assayed using the Simoa platform (Quanterix, Lexington, MA). Time course and cross-sectional analyses were performed to identify pharmacodynamic changes in the metabolome secondary to DMF and relate these to adverse events. Results: In the discovery cohort, tricarboxylic acid (TCA) cycle intermediates fumarate and succinate, and TCA cycle metabolites succinyl-carnitine and methyl succinyl-carnitine increased 6 weeks following treatment (q < 0.05). Methyl succinyl-carnitine increased in the validation cohort (q < 0.05). These changes were not observed in the control population. Increased succinylcarnitine and methyl succinyl-carnitine were associated with adverse events from DMF (flushing and abdominal symptoms). pNfL concentration was higher in patients with RRMS than in controls and reduced over 15 months of treatment. Conclusion: TCA cycle intermediates and metabolites are increased in patients with RRMS treated with DMF. The results suggest reversal of flux through the succinate dehydrogenase complex. The contribution of succinyl-carnitine ester agonism at hydroxycarboxylic acid receptor 2 to both therapeutic effects and adverse events requires investigation.
Differential networks allow us to better understand the changes in cellular processes that are exhibited in conditions of interest, identifying variations in gene regulation or protein interaction between, for example, cases and controls, or in response to external stimuli. Here we present a novel methodology for the inference of differential gene regulatory networks from gene expression microarray data. Specifically we apply a Bayesian model selection approach to compare models of conserved and varying network structure, and use Gaussian graphical models to represent the network structures. We apply a variational inference approach to the learning of Gaussian graphical models of gene regulatory networks, that enables us to perform Bayesian model selection that is significantly more computationally efficient than Markov Chain Monte Carlo approaches. Our method is demonstrated to be more robust than independent analysis of data from multiple conditions when applied to synthetic network data, generating fewer false positive predictions of differential edges. We demonstrate the utility of our approach on real world gene expression microarray data by applying it to existing data from amyotrophic lateral sclerosis cases with and without mutations in C9orf72, and controls, where we are able to identify differential network interactions for further investigation.
We present an analysis of protein interaction network data via the comparison of models of network evolution to the observed data. We take a bayesian approach and perform posterior density estimation using an approximate bayesian computation with sequential Monte Carlo method. Our approach allows us to perform model selection over a selection of potential network growth models. The methodology we apply uses a distance defined in terms of graph spectra which captures the network data more naturally than previously used summary statistics such as the degree distribution. Furthermore, we include the effects of sampling into the analysis, to properly correct for the incompleteness of existing datasets, and have analysed the performance of our method under various degrees of sampling. We consider a number of models focusing not only on the biologically relevant class of duplication models, but also including models of scale-free network growth that have previously been claimed to describe such data. We find a preference for a duplication-divergence with linear preferential attachment model in the majority of the interaction datasets considered. We also illustrate how our method can be used to perform multi-model inference of network parameters to estimate properties of the full network from sampled data.
Developing mechanistic models has become an integral aspect of systems biology, as has the need to differentiate between alternative models. Parameterizing mathematical models has been widely perceived as a formidable challenge, which has spurred the development of statistical and optimisation routines for parameter inference. But now focus is increasingly shifting to problems that require us to choose from among a set of different models to determine which one offers the best description of a given biological system. We will here provide an overview of recent developments in the area of model selection. We will focus on approaches that are both practical as well as build on solid statistical principles and outline the conceptual foundations and the scope for application of such methods in systems biology.
It has previously been shown that subnets differ from global networks from which they are sampled for all but a very limited number of theoretical network models. These differences are of qualitative as well as quantitative nature, and the properties of subnets may be very different from the corresponding properties in the true, unobserved network. Here we propose a novel approach which allows us to infer aspects of the true network from incomplete network data in a multi-model inference framework. We develop the basic theoretical framework, including procedures for assessing confidence intervals of our estimates and evaluate the performance of this approach in simulation studies and against subnets drawn from the presently available PIN network data in Saccaromyces cerevisiae. We then illustrate the potential power of this new approach by estimating the number of interactions that will be detectable with present experimental approaches in sfour eukaryotic species, inlcuding humans. Encouragingly, where independent datasets are available we obtain consistent estimates from different partial protein interaction networks. We conclude with a discussion of the scope of this approaches and areas for further research
Reconstructing continuous signals from discrete time-points is a challenging inverse problem encountered in many scientific and engineering applications. For oscillatory signals classical results due to Nyquist set the limit below which it becomes impossible to reliably reconstruct the oscillation dynamics. Here we revisit this problem for vector-valued outputs and apply Bayesian non-parametric approaches in order to solve the function estimation problem. The main aim of the current paper is to map how we can use of correlations among different outputs to reconstruct signals at a sampling rate that lies below the Nyquist rate. We show that it is possible to use multiple-output Gaussian processes to capture dependences between outputs which facilitate reconstruction of signals in situation where conventional Gaussian processes (i.e. this aimed at describing scalar signals) fail, and we delineate the phase and frequency dependence of the reliability of this type of approach. In addition to simple toy-models we also consider the dynamics of the tumour suppressor gene p53, which exhibits oscillations under physiological conditions, and which can be reconstructed more reliably in our new framework.
In the study of single-cell RNA-seq (scRNA-Seq) data, a key component of the analysis is to identify subpopulations of cells in the data. A variety of approaches to this have been considered, and although many machine learning-based methods have been developed, these rarely give an estimate of uncertainty in the cluster assignment. To allow for this, probabilistic models have been developed, but scRNA-Seq data exhibit a phenomenon known as dropout, whereby a large proportion of the observed read counts are zero. This poses challenges in developing probabilistic models that appropriately model the data. We develop a novel Dirichlet process mixture model that employs both a mixture at the cell level to model multiple populations of cells and a zero-inflated negative binomial mixture of counts at the transcript level. By taking a Bayesian approach, we are able to model the expression of genes within clusters, and to quantify uncertainty in cluster assignments. It is shown that this approach outperforms previous approaches that applied multinomial distributions to model scRNA-Seq counts and negative binomial models that do not take into account zero inflation. Applied to a publicly available data set of scRNA-Seq counts of multiple cell types from the mouse cortex and hippocampus, we demonstrate how our approach can be used to distinguish subpopulations of cells as clusters in the data, and to identify gene sets that are indicative of membership of a subpopulation.
For nearly any challenging scientific problem evaluation of the likelihood is problematic if not impossible. Approximate Bayesian computation (ABC) allows us to employ the whole Bayesian formalism to problems where we can use simulations from a model, but cannot evaluate the likelihood directly. When summary statistics of real and simulated data are compared-rather than the data directly-information is lost, unless the summary statistics are sufficient. Sufficient statistics are, however, not common but without them statistical inference in ABC inferences are to be considered with caution. Previously other authors have attempted to combine different statistics in order to construct (approximately) sufficient statistics using search and information heuristics. Here we employ an information-theoretical framework that can be used to construct appropriate (approximately sufficient) statistics by combining different statistics until the loss of information is minimized. We start from a potentially large number of different statistics and choose the smallest set that captures (nearly) the same information as the complete set. We then demonstrate that such sets of statistics can be constructed for both parameter estimation and model selection problems, and we apply our approach to a range of illustrative and real-world model selection problems.
When analysing gene expression time series data, an often overlooked but crucial aspect of the model is that the regulatory network structure may change over time. Although some approaches have addressed this problem previously in the literature, many are not well suited to the sequential nature of the data. Here, we present a method that allows us to infer regulatory network structures that may vary between time points, using a set of hidden states that describe the network structure at a given time point. To model the distribution of the hidden states, we have applied the Hierarchical Dirichlet Process Hidden Markov Model, a non-parametric extension of the traditional Hidden Markov Model, which does not require us to fix the number of hidden states in advance. We apply our method to existing microarray expression data as well as demonstrating is efficacy on simulated test data.
In vivo studies allow us to investigate biological processes at the level of the organism. But not all aspects of in vivo systems are amenable to direct experimental measurements. In order to make the most of such data we therefore require statistical tools that allow us to obtain reliable estimates for e. g. kinetic in vivo parameters. Here we show how we can use approximate Bayesian computation approaches in order to analyse leukocyte migration in zebrafish embryos in response to injuries. We track individual leukocytes using live imaging following surgical injury to the embryos' tail-fins. The signalling gradient that leukocytes follow towards the site of the injury cannot be directly measured but we can estimate its shape and how it changes with time from the directly observed patterns of leukocyte migration. By coupling simple models of immune signalling and leukocyte migration with the unknown gradient shape into a single statistical framework we can gain detailed insights into the tissue-wide processes that are involved in the innate immune response to wound injury. In particular we find conclusive evidence for a temporally and spatially changing signalling gradient that modulates the changing activity of the leukocyte population in the embryos. We conclude with a robustness analysis which highlights the most important factors determining the leukocyte dynamics. Our approach relies only on the ability to simulate numerically the process under investigation and is therefore also applicable in other in vivo contexts and studies.
The osmotic stress response signalling pathway of the model yeast Saccharomyces cerevisae is crucial for the survival of cells under osmotic stress, and is preserved to varying degrees in other related fungal species. We apply a method for inference of ancestral states of characteristics over a phylogeny to 17 fungal species to infer the maximum likelihood estimate of presence or absence in ancestral genomes of genes involved in osmotic stress response. The same method allows us furthermore to perform a statistical test for correlated evolution between genes. Where such correlations exist within the osmotic stress response pathway of S. cerevisae, we have used this in order to predict and subsequently test for the presence of physical protein-protein interactions in an attempt to detect novel interactions. Finally we assess the relevance of observed evolutionary correlations in predicting protein interactions in light of the experimental results. We do find that correlated evolution provides some useful information for the prediction of protein-protein interactions, but that these alone are not sufficient to explain detectable patterns of correlated evolution.
Background: Inference of gene regulatory network structures from RNA-Seq data is challenging due to the natureof the data, as measurements take the form of counts of reads mapped to a given gene. Here we present a model forRNA-Seq time series data that applies a negative binomial distribution for the observations, and uses sparse regressionwith a horseshoe prior to learn a dynamic Bayesian network of interactions between genes. We use a variationalinference scheme to learn approximate posterior distributions for the model parameters. Results: The methodology is benchmarked on synthetic data designed to replicate the distribution of real worldRNA-Seq data. We compare our method to other sparse regression approaches and find improved performance inlearning directed networks. We demonstrate an application of our method to a publicly available human neuronalstem cell differentiation RNA-Seq time series data set to infer the underlying network structure. Conclusions: Our method is able to improve performance on synthetic data by explicitly modelling the statisticaldistribution of the data when learning networks from RNA-Seq time series. Applying approximate inferencetechniques we can learn network structures quickly with only moderate computing resources.
Transcriptomic data quantifying gene expression states for single cells or cell populations at a genomic level is now readily available. Changes in transcriptional state during cell development and function are governed by gene regulatory networks, comprising a collection of genes and regulatory interactions between these genes (or gene products). Network inference algorithms aim to infer functional interactions between genes from experimentally observed expression profiles, and identify the structure of the underlying regulatory networks. Here we describe popular classes of network inference algorithms, highlighting their respective strengths and weaknesses, along with some general challenges faced by these methods. Analyzing inferred network structures can provide insight into the genes, transcriptional changes, and regulatory interactions that play key roles in biological and disease-related processes of interest.
Background: In the analysis of networks we frequently require the statistical significance of some network statistic, such as measures of similarity for the properties of interacting nodes. The structure of the network may introduce dependencies among the nodes and it will in general be necessary to account for these dependencies in the statistical analysis. To this end we require some form of Null model of the network: generally rewired replicates of the network are generated which preserve only the degree (number of interactions) of each node. We show that this can fail to capture important features of network structure, and may result in unrealistic significance levels, when potentially confounding additional information is available. Methods: We present a new network resampling Null model which takes into account the degree sequence as well as available biological annotations. Using gene ontology information as an illustration we show how this information can be accounted for in the resampling approach, and the impact such information has on the assessment of statistical significance of correlations and motif-abundances in the Saccharomyces cerevisiae protein interaction network. An algorithm, GOcardShuffle, is introduced to allow for the efficient construction of an improved Null model for network data. Results: We use the protein interaction network of S. cerevisiae; correlations between the evolutionary rates and expression levels of interacting proteins and their statistical significance were assessed for Null models which condition on different aspects of the available data. The novel GOcardShuffle approach results in a Null model for annotated network data which appears better to describe the properties of real biological networks. Conclusion: An improved statistical approach for the statistical analysis of biological network data, which conditions on the available biological information, leads to qualitatively different results compared to approaches which ignore such annotations. In particular we demonstrate the effects of the biological organization of the network can be sufficient to explain the observed similarity of interacting proteins.
The availability of large quantities of transcriptomic data in the form of RNA-seq count data has necessitated the development of methods to identify genes differentially expressed between experimental conditions. Many existing approaches apply a parametric model of gene expression and so place strong assumptions on the distribution of the data. Here we explore an alternate nonparametric approach that applies an empirical likelihood framework, allowing us to define likelihoods without specifying a parametric model of the data. We demonstrate the performance of our method when applied to gold standard datasets, and to existing experimental data. Our approach outperforms or closely matches performance of existing methods in the literature, and requires modest computational resources. An R package, EmpDiff implementing the methods described in the paper is available from: http://homepages.inf.ed.ac.uk/tthorne/software/packages/EmpDiff_0.99.tar.gz.
Sensing the environment and responding appropriately to it are key capabilities for the survival of an organism. All extant organisms must have evolved suitable sensors, signaling systems, and response mechanisms allowing them to survive under the conditions they are likely to encounter. Here, we investigate in detail the evolutionary history of one such system: The phage shock protein (Psp) stress response system is an important part of the stress response machinery in many bacteria, including Escherichia coli K12 . Here, we use a systematic analysis of the genes that make up and regulate the Psp system in E. coli in order to elucidate the evolutionary history of the system. We compare gene sharing, sequence evolution, and conservation of protein-coding as well as noncoding DNA sequences and link these to comparative analyses of genome/operon organization across 698 bacterial genomes. Finally, we evaluate experimentally the biological advantage/disadvantage of a simplified version of the Psp system under different oxygen-related environments. Our results suggest that the Psp system evolved around a core response mechanism by gradually co-opting genes into the system to provide more nuanced sensory, signaling, and effector functionalities. We find that recruitment of new genes into the response machinery is closely linked to incorporation of these genes into a psp operon as is seen in E. coli , which contains the bulk of genes involved in the response. The organization of this operon allows for surprising levels of additional transcriptional control and flexibility. The results discussed here suggest that the components of such signaling systems will only be evolutionarily conserved if the overall functionality of the system can be maintained.
After the completion of the human and other genome projects it emerged that the number of genes in organisms as diverse as fruit flies, nematodes, and humans does not reflect our perception of their relative complexity. Here, we provide reliable evidence that the size of protein interaction networks in different organisms appears to correlate much better with their apparent biological complexity. We develop a stable and powerful, yet simple, statistical procedure to estimate the size of the whole network from subnet data. This approach is then applied to a range of eukaryotic organisms for which extensive protein interaction data have been collected and we estimate the number of interactions in humans to be approximate to 650,000. We find that the human interaction network is one order of magnitude bigger than the Drosophila melanogaster interactome and approximate to 3 times bigger than in Caenorhabditis elegans.
We introduce a procedure for deciding when a mass-action model is incompatible with observed steady-state data that does not require any parameter estimation. Thus, we avoid the difficulties of nonlinear optimization typically associated with methods based on parameter fitting. Instead, we borrow ideas from algebraic geometry to construct a transformation of the model variables such that any set of steady states of the model under that transformation lies on a common plane, irrespective of the values of the model parameters. Model rejection can then be performed by assessing the degree to which the transformed data deviate from coplanarity. We demonstrate our method by applying it to models of multisite phosphorylation and cell death signaling. Our framework offers a parameter-free perspective on the statistical model selection problem, which can complement conventional statistical methods in certain classes of problems where inference has to be based on steady-state data and the model structures allow for suitable algebraic relationships among the steady-state solutions.
In the study of single cell RNA-seq data, a key component of the analysis is to identify sub-populations of cells in the data. A variety of approaches to this have been considered, and although many machine learning based methods have been developed, these rarely give an estimate of uncertainty in the cluster assignment. To allow for this probabilistic models have been developed, but single cell RNA-seq data exhibit a phenomenon known as dropout, whereby a large proportion of the observed read counts are zero. This poses challenges in developing probabilistic models that appropriately model the data. We develop a novel Dirichlet process mixture model which employs both a mixture at the cell level to model multiple populations of cells, and a zero-inflated negative binomial mixture of counts at the transcript level. By taking a Bayesian approach we are able to model the expression of genes within clusters, and to quantify uncertainty in cluster assignments. It is shown that this approach out-performs previous approaches that applied multinomial distributions to model single cell RNA-seq counts and negative binomial models that do not take into account zero-inflation. Applied to a publicly available data set of single cell RNA-seq counts of multiple cell types from the mouse cortex and hippocampus, we demonstrate how our approach can be used to distinguish sub-populations of cells as clusters in the data, and to identify gene sets that are indicative of membership of a sub-population. The methodology is implemented as an open source Snakemake pipeline available from https://github.com/ tt104/scmixture.
Recent work leading to new insights into the molecular architecture underlying complex cellular phenotypes enables researchers to investigate evolutionary processes in unprecedented detail. Protein interaction network data, which are now available for an increasing number of species, promise new insights and there have been many recent studies investigating evolutionary aspects of these interaction networks, from mathematical studies of growing networks to detailed phylogenetic surveys of proteins in their interaction network context. Here, we review the spectrum of such approaches, and assess issues associated with analyzing such data from an evolutionary perspective. Currently, such analyses are statistically challenging, but could link present initiatives in systems biology with results and methodologies that have developed in evolutionary biology over the past 60 years.
Here we present a novel statistical methodology that allows us to analyze gene expression data that have been collected from a number of different cases or conditions in a unified framework. Using a Bayesian nonparametric framework we develop a hierarchical model wherein genes can maintain a shared set of interactions between different cases, whilst also exhibiting behaviour that is unique to specific cases, sets of conditions, or groups of data points. By doing so we are able to not only combine data from different cases but also to discern the unique regulatory interactions that differentiate the cases. We apply our method to clinical data collected from patients suffering from sporadic Inclusion Body Myositis (sIBM), as well as control samples, and demonstrate the ability of our method to infer regulatory interactions that are unique to the disease cases of interest. The method thus balances the statistical need to include as many patients and controls as possible, and the clinical need to maintain potentially cryptic differences among patients and between patients and controls at the regulatory level.
Motivation: One of the challenging questions in modelling biological systems is to characterize the functional forms of the processes that control and orchestrate molecular and cellular phenotypes. Recently proposed methods for the analysis of metabolic pathways, for example, dynamic flux estimation, can only provide estimates of the underlying fluxes at discrete time points but fail to capture the complete temporal behaviour. To describe the dynamic variation of the fluxes, we additionally require the assumption of specific functional forms that can capture the temporal behaviour. However, it also remains unclear how to address the noise which might be present in experimentally measured metabolite concentrations. Results: Here we propose a novel approach to modelling metabolic fluxes: derivative processes that are based on multiple-output Gaussian processes (MGPs), which are a flexible non-parametric Bayesian modelling technique. The main advantages that follow from MGPs approach include the natural non-parametric representation of the fluxes and ability to impute the missing data in between the measurements. Our derivative process approach allows us to model changes in metabolite derivative concentrations and to characterize the temporal behaviour of metabolic fluxes from time course data. Because the derivative of a Gaussian process is itself a Gaussian process, we can readily link metabolite concentrations to metabolic fluxes and vice versa. Here we discuss how this can be implemented in an MGP framework and illustrate its application to simple models, including nitrogen metabolism in Escherichia coli.
Motivation Inferring the parameters of models describing biological systems is an important problem in the reverse engineering of the mechanisms underlying these systems. Much work has focused on parameter inference of stochastic and ordinary differential equation models using Approximate Bayesian Computation (ABC). While there is some recent work on inference in spatial models, this remains an open problem. Simultaneously, advances in topological data analysis (TDA), a field of computational mathematics, have enabled spatial patterns in data to be characterized. Results Here, we focus on recent work using TDA to study different regimes of parameter space for a well-studied model of angiogenesis. We propose a method for combining TDA with ABC to infer parameters in the Anderson–Chaplain model of angiogenesis. We demonstrate that this topological approach outperforms ABC approaches that use simpler statistics based on spatial features of the data. This is a first step toward a general framework of spatial parameter inference for biological systems, for which there may be a variety of filtrations, vectorizations and summary statistics to be considered. Availability and implementation All code used to produce our results is available as a Snakemake workflow from github.com/tt104/tabc_angio.
Glioblastoma (GBM) is an aggressive malignant primary brain tumor with limited therapeutic options. We show that the angiotensin II (AngII) type 2 receptor (AT2R) is a novel therapeutic target for GBM and that AngII, endogenously produced in GBM cells, promotes proliferation through AT2R. We repurposed EMA401, an AT2R antagonist originally developed as a peripherally restricted analgesic, for GBM and showed that it inhibits the proliferation of AT2R-expressing GBM spheroids and blocks their invasiveness and angiogenic capacity. The crystal structure of AT2R bound to EMA401 was determined and revealed the receptor to be in an active-like conformation with helix-VIII blocking G protein or β-arrestin recruitment. The architecture and interactions of EMA401 in AT2R differ drastically from complexes of AT2R with other relevant compounds. To enhance central nervous system (CNS) penetration of EMA401, we exploited the crystal structure to design an angiopep-2 tethered EMA401 derivative, A3E. A3E exhibited enhanced CNS penetration, leading to reduced tumor volume, inhibition of proliferation and increased levels of apoptosis in an orthotopic xenograft model of GBM.