Duong Vo
About
My research project
Statistical methods to the analysis of large scale single cell RNA-seq dataWith the advance in next-generation sequencing technologies, single-cell RNA sequencing (scRNA-seq) allows researchers to analyze transcriptomic information of individual cells. To dissect scRNA-seq data, computational methods are applied in several steps including mapping, quality control, quantification, clustering or differentially expressed gene analysis. There are challenges that remained in the computational analysis of scRNA-seq data. For example, high level of noise, sparsity and batch effects are some reported properties of scRNA-seq data. The importance of scRNA-seq analysis has been demonstrated in several studies, for example, from differentially expressed gene analysis of scRNA-seq data between circulating tumor cells and primary tumor cells of hepatocellular carcinoma patients, chemokine CCL5 was identified as the mediator for immune evasion of circulating tumor cells. Developing computational tools which allows to overcome challenges such as noise or batch effects in scRNA-seq data analysis based on gene regulatory network inference is the main goal of our research which can induce single-cell analyses such as cell-type, cell-state identification using scRNA-seq data.
Supervisors
With the advance in next-generation sequencing technologies, single-cell RNA sequencing (scRNA-seq) allows researchers to analyze transcriptomic information of individual cells. To dissect scRNA-seq data, computational methods are applied in several steps including mapping, quality control, quantification, clustering or differentially expressed gene analysis. There are challenges that remained in the computational analysis of scRNA-seq data. For example, high level of noise, sparsity and batch effects are some reported properties of scRNA-seq data. The importance of scRNA-seq analysis has been demonstrated in several studies, for example, from differentially expressed gene analysis of scRNA-seq data between circulating tumor cells and primary tumor cells of hepatocellular carcinoma patients, chemokine CCL5 was identified as the mediator for immune evasion of circulating tumor cells. Developing computational tools which allows to overcome challenges such as noise or batch effects in scRNA-seq data analysis based on gene regulatory network inference is the main goal of our research which can induce single-cell analyses such as cell-type, cell-state identification using scRNA-seq data.
Publications
BackgroundGene interaction networks are graphs in which nodes represent genes and edges represent functional interactions between them. These interactions can be at multiple levels, for instance, gene regulation, protein-protein interaction, or metabolic pathways. To analyse gene interaction networks at a large scale, gene co-expression network analysis is often applied on high-throughput gene expression data such as RNA sequencing data. With the advance in sequencing technology, expression of genes can be measured in individual cells. Single-cell RNA sequencing (scRNAseq) provides insights of cellular development, differentiation and characteristics at the transcriptomic level. High sparsity and high-dimensional data structures pose challenges in scRNAseq data analysis.ResultsIn this study, a sparse inverse covariance matrix estimation framework for scRNAseq data is developed to capture direct functional interactions between genes. Comparative analyses highlight high performance and fast computation of Stein-type shrinkage in high-dimensional data using simulated scRNAseq data. Data transformation approaches also show improvement in performance of shrinkage methods in non-Gaussian distributed data. Zero-inflated modelling of scRNAseq data based on a negative binomial distribution enhances shrinkage performance in zero-inflated data without interference on non zero-inflated count data.ConclusionThe proposed framework broadens application of graphical model in scRNAseq analysis with flexibility in sparsity of count data resulting from dropout events, high performance, and fast computational time. Implementation of the framework is in a reproducible Snakemake workflow https://github.com/calathea24/ZINBGraphicalModel and R package ZINBStein https://github.com/calathea24/ZINBStein.
In the study of single-cell RNA-seq (scRNA-Seq) data, a key component of the analysis is to identify subpopulations of cells in the data. A variety of approaches to this have been considered, and although many machine learning-based methods have been developed, these rarely give an estimate of uncertainty in the cluster assignment. To allow for this, probabilistic models have been developed, but scRNA-Seq data exhibit a phenomenon known as dropout, whereby a large proportion of the observed read counts are zero. This poses challenges in developing probabilistic models that appropriately model the data. We develop a novel Dirichlet process mixture model that employs both a mixture at the cell level to model multiple populations of cells and a zero-inflated negative binomial mixture of counts at the transcript level. By taking a Bayesian approach, we are able to model the expression of genes within clusters, and to quantify uncertainty in cluster assignments. It is shown that this approach outperforms previous approaches that applied multinomial distributions to model scRNA-Seq counts and negative binomial models that do not take into account zero inflation. Applied to a publicly available data set of scRNA-Seq counts of multiple cell types from the mouse cortex and hippocampus, we demonstrate how our approach can be used to distinguish subpopulations of cells as clusters in the data, and to identify gene sets that are indicative of membership of a subpopulation.
In the study of single cell RNA-seq data, a key component of the analysis is to identify sub-populations of cells in the data. A variety of approaches to this have been considered, and although many machine learning based methods have been developed, these rarely give an estimate of uncertainty in the cluster assignment. To allow for this probabilistic models have been developed, but single cell RNA-seq data exhibit a phenomenon known as dropout, whereby a large proportion of the observed read counts are zero. This poses challenges in developing probabilistic models that appropriately model the data. We develop a novel Dirichlet process mixture model which employs both a mixture at the cell level to model multiple populations of cells, and a zero-inflated negative binomial mixture of counts at the transcript level. By taking a Bayesian approach we are able to model the expression of genes within clusters, and to quantify uncertainty in cluster assignments. It is shown that this approach out-performs previous approaches that applied multinomial distributions to model single cell RNA-seq counts and negative binomial models that do not take into account zero-inflation. Applied to a publicly available data set of single cell RNA-seq counts of multiple cell types from the mouse cortex and hippocampus, we demonstrate how our approach can be used to distinguish sub-populations of cells as clusters in the data, and to identify gene sets that are indicative of membership of a sub-population. The methodology is implemented as an open source Snakemake pipeline available from https://github.com/ tt104/scmixture.