Dr Bingxin Lu
About
Biography
I am currently a Surrey Future Fellow at Section of Systems Biology University of Surrey.
I am setting up my research group and please do not hesitate to contact me if you are interested in my research. Please see details of potential projects and opportunities in my personal webpage.
Previously, I was a Postdoc in Chris Barnes’s group at Department of Cell and Developmental Biology University College London, where I have been working on dynamical modeling of chromosomal instability (CIN) in cancer genomes. Before this, I was a Postdoctoral Fellow at Genome Institute of Singapore (Weiwei Zhai’s Group), where I mainly developed pipelines and methods to analyse tumour heterogeneity and clonal evolution in liver and lung cancer genomes. I completed my PhD in Computational Biology under the supervision of Hon Wai Leong at School of Computing National University of Singapore, where I developed machine learning and phylogenetic methods for problems related to lateral gene transfer. I obtained my Master’s and Bachelor’s degree from Software Engineering Institute East China Normal University, where I led the development of platforms for high-throughput biological data analysis, including RNA-Seq and proteomic data.
ResearchResearch interests
My research is in the broad field of computational biology, which bridges software engineering, machine learning, algorithms, statistics, phylogenetics, population genetics, and omics. I am particularly interested in developing new computational methods and models to address important biological problems related to human health. My goal is to facilitate the mining of new knowledge from the accumulating huge amounts of data for the biological and biomedical community. I have developed several new methods and applied available methods to tackle basic questions arising in the study of species and cancer evolution. My current primary interests are evolutionary dynamics of cancer genomes, especially those driven by chromosome instability, which are still less well studied than point mutations but critical in tumorigenesis and patient treatment.
Research interests
My research is in the broad field of computational biology, which bridges software engineering, machine learning, algorithms, statistics, phylogenetics, population genetics, and omics. I am particularly interested in developing new computational methods and models to address important biological problems related to human health. My goal is to facilitate the mining of new knowledge from the accumulating huge amounts of data for the biological and biomedical community. I have developed several new methods and applied available methods to tackle basic questions arising in the study of species and cancer evolution. My current primary interests are evolutionary dynamics of cancer genomes, especially those driven by chromosome instability, which are still less well studied than point mutations but critical in tumorigenesis and patient treatment.
Publications
This repository contains the data used to generate the figures in paper: Cell-cycle dependent DNA repair and replication unifies patterns of chromosome instability.
Ovarian high-grade serous carcinoma (HGSC) originates in the fallopian tube, with secretory cells carrying a TP53 mutation, known as p53 signatures, identified as potential precursors. p53 signatures evolve into serous tubal intraepithelial carcinoma (STIC) lesions, which in turn progress into invasive HGSC, which readily spreads to the ovary and disseminates around the peritoneal cavity. We recently investigated the genomic landscape of early- and late-stage HGSC and found higher ploidy in late-stage (median 3.1) than early-stage (median 2.0) samples. Here, to explore whether the high ploidy and possible whole-genome duplication (WGD) observed in late-stage disease were determined early in the evolution of HGSC, we analysed archival formalin-fixed paraffin-embedded (FFPE) samples from five HGSC patients. p53 signatures and STIC lesions were laser-capture microdissected and sequenced using shallow whole-genome sequencing (sWGS), while invasive ovarian/fallopian tube and metastatic carcinoma samples underwent macrodissection and were profiled using both sWGS and targeted next-generation sequencing. Results showed highly similar patterns of global copy number change between STIC lesions and invasive carcinoma samples within each patient. Ploidy changes were evident in STIC lesions, but not p53 signatures, and there was a strong correlation between ploidy in STIC lesions and invasive ovarian/fallopian tube and metastatic samples in each patient. The reconstruction of sample phylogeny for each patient from relative copy number indicated that high ploidy, when present, occurred early in the evolution of HGSC, which was further validated by copy number signatures in ovarian and metastatic tumours. These findings suggest that aberrant ploidy, suggestive of WGD, arises early in HGSC and is detected in STIC lesions, implying that the trajectory of HGSC may be determined at the earliest stages of tumour development. © 2024 The Author(s). The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland.
Cancer is an evolutionary process involving the accumulation of diverse somatic mutations and clonal evolution over time. Phylogenetic inference from samples obtained from an individual patient offers a powerful approach to unraveling the intricate evolutionary history of cancer and provides insights that can inform cancer treatment. Somatic copy number alterations (CNAs) are important in cancer evolution and are often used as markers, alone or with other somatic mutations, for phylogenetic inferences, particularly in low-coverage DNA sequencing data. Many phylogenetic inference methods using CNAs detected from bulk or single-cell DNA sequencing data have been developed over the years. However, there have been no systematic reviews on these methods. To summarize the state-of-the-art of the field and inform future development, this review presents a comprehensive survey on the major challenges in inference, different types of methods, and applications of these methods. The challenges are discussed from the aspects of input data, models of evolution, and inference algorithms. The different methods are grouped according to the markers used for inference and the types of the reconstructed trees. The applications include using phylogenetic inference to understand intra-tumor heterogeneity, metastasis, treatment resistance, and early cancer development. This review also sheds light on future directions of cancer phylogenetic inference using CNAs, including the improvement of scalability, the utilization of new types of data, and the development of more realistic models of evolution. [Display omitted] •Phylogenetic trees built with copy number alterations are important for understanding the evolution of cancers.•Reconstructing phylogenetic trees with copy number alterations is challenging.•A phylogenetic tree can be built using only copy number alterations or by integrating them with other somatic variants.•Efficient and accurate phylogenetic inference can aid oncological research and offer insights into cancer treatment.
Phylogenetic trees based on copy number profiles from multiple samples of a patient are helpful to understand cancer evolution. Here, we develop a new maximum likelihood method, CNETML, to infer phylogenies from such data. CNETML is the first program to jointly infer the tree topology, node ages, and mutation rates from total copy numbers of longitudinal samples. Our extensive simulations suggest CNETML performs well on copy numbers relative to ploidy and under slight violation of model assumptions. The application of CNETML to real data generates results consistent with previous discoveries and provides novel early copy number events for further investigation.
The accurate detection of genomic islands (GIs) in microbial genomes is important for both evolutionary study and medical research, because GIs may promote genome evolution and contain genes involved in pathogenesis. Various computational methods have been developed to predict GIs over the years. However, most of them cannot make full use of GI-associated features to achieve desirable performance. Additionally, many methods cannot be directly applied to newly sequenced genomes. We develop a new method called GI-Cluster, which provides an effective way to integrate multiple GI-related features via consensus clustering. GI-Cluster does not require training datasets or existing genome annotations, but it can still achieve comparable or better performance than supervised learning methods in comprehensive evaluations. Moreover, GI-Cluster is widely applicable, either to complete and incomplete genomes or to initial GI predictions from other programs. GI-Cluster also provides plots to visualize the distribution of predicted GIs and related features. GI-Cluster is available at https://github.com/icelu/GI Cluster.
Clusters of genes acquired by lateral gene transfer in microbial genomes, are broadly referred to as genomic islands (GIs). GIs often carry genes important for genome evolution and adaptation to niches, such as genes involved in pathogenesis and antibiotic resistance. Therefore, GI prediction has gradually become an important part of microbial genome analysis. Despite inherent difficulties in identifying GIs, many computational methods have been developed and show good performance. In this mini-review, we first summarize the general challenges in predicting GIs. Then we group existing GI detection methods by their input, briefly describe representative methods in each group, and discuss their advantages as well as limitations. Finally, we look into the potential improvements for better GI prediction.
Analysis of live-cell imaging and single-cell genome sequencing data of colorectal cancer organoids identifies temporal dynamics of sub-chromosomal copy-number amplifications. Central to tumor evolution is the generation of genetic diversity. However, the extent and patterns by which de novo karyotype alterations emerge and propagate within human tumors are not well understood, especially at single-cell resolution. Here, we present 3D Live-Seq-a protocol that integrates live-cell imaging of tumor organoid outgrowth and whole-genome sequencing of each imaged cell to reconstruct evolving tumor cell karyotypes across consecutive cell generations. Using patient-derived colorectal cancer organoids and fresh tumor biopsies, we demonstrate that karyotype alterations of varying complexity are prevalent and can arise within a few cell generations. Sub-chromosomal acentric fragments were prone to replication and collective missegregation across consecutive cell divisions. In contrast, gross genome-wide karyotype alterations were generated in a single erroneous cell division, providing support that aneuploid tumor genomes can evolve via punctuated evolution. Mapping the temporal dynamics and patterns of karyotype diversification in cancer enables reconstructions of evolutionary paths to malignant fitness.
Lung cancer is the world's leading cause of cancer death and shows strong ancestry disparities. By sequencing and assembling a large genomic and transcriptomic dataset of lung adenocarcinoma (LUAD) in individuals of East Asian ancestry (EAS; n = 305), we found that East Asian LUADs had more stable genomes characterized by fewer mutations and fewer copy number alterations than LUADs from individuals of European ancestry. This difference is much stronger in smokers as compared to nonsmokers. Transcriptomic clustering identified a new EAS-specific LUAD subgroup with a less complex genomic profile and upregulated immune-related genes, allowing the possibility of immunotherapy-based approaches. Integrative analysis across clinical and molecular features showed the importance of molecular phenotypes in patient prognostic stratification. EAS LUADs had better prediction accuracy than those of European ancestry, potentially due to their less complex genomic architecture. This study elucidated a comprehensive genomic landscape of EAS LUADs and highlighted important ancestry differences between the two cohorts.
Motivation: Genetic material is transferred in a non-reproductive manner across species more frequently than commonly thought, particularly in the bacteria kingdom. On one hand, extant genomes are thus more properly considered as a fusion product of both reproductive and nonreproductive genetic transfers. This has motivated researchers to adopt phylogenetic networks to study genome evolution. On the other hand, a gene's evolution is usually tree-like and has been studied for over half a century. Accordingly, the relationships between phylogenetic trees and networks are the basis for the reconstruction and verification of phylogenetic networks. One important problem in verifying a network model is determining whether or not certain existing phylogenetic trees are displayed in a phylogenetic network. This problem is formally called the tree containment problem. It is NP-complete even for binary phylogenetic networks. Results: We design an exponential time but efficient method for determining whether or not a phylogenetic tree is displayed in an arbitrary phylogenetic network. It is developed on the basis of the so-called reticulation-visible property of phylogenetic networks.
Genomic islands (GIs) are clusters of functionally related genes acquired by lateral genetic transfer (LGT), and they are present in many bacterial genomes. GIs are extremely important for bacterial research, because they not only promote genome evolution but also contain genes that enhance adaption and enable antibiotic resistance. Many methods have been proposed to predict GI. But most of them rely on either annotations or comparisons with other closely related genomes. Hence these methods cannot be easily applied to new genomes. As the number of newly sequenced bacterial genomes rapidly increases, there is a need for methods to detect GI based solely on sequences of a single genome. In this paper, we propose a novel method, GI-SVM, to predict GIs given only the unannotated genome sequence. GI-SVM is based on one-class support vector machine (SVM), utilizing composition bias in terms of k-mer content. From our evaluations on three real genomes, GI-SVM can achieve higher recall compared with current methods, without much loss of precision. Besides, GI-SVM allows flexible parameter tuning to get optimal results for each genome. In short, GI-SVM provides a more sensitive method for researchers interested in a first-pass detection of GI in newly sequenced genomes.
The Summary: Simulating realistic clonal dynamics of tumors is an important topic in cancer genomics. Here, we present Phylogeny guided Simulator for Tumor Evolution, a tool that can simulate different types of tumor samples including single sector, multi-sector bulk tumor as well as single-cell tumor data under a wide range of evolutionary trajectories. Phylogeny guided Simulator for Tumor Evolution provides an efficient tool for understanding clonal evolution of cancer.
The earliest events during human tumor initiation are poorly characterized but may hold clues as to how to detect and prevent malignancy. Here we model this occult process by engineering TP53 deficiency in primary human gastric organoids and performing experimental evolution in multiple clonally derived cultures over two years, thereby defining causal relationships between this common initiating genetic lesion and resulting phenotypes. TP53 loss elicited progressive aneuploidy, including copy number alterations and complex structural variants that are common in gastric cancers and which follow preferred temporal orders. Longitudinal single cell sequencing of TP53 deficient gastric organoids similarly indicates progression towards malignant transcriptional programs. Moreover, lineage tracing with expressed cellular barcodes demonstrates reproducible dynamics whereby initially rare subclones with shared transcriptional programs repeatedly attain clonal dominance. This powerful platform for experimental evolution exposes stringent selection, clonal interference and a striking degree of phenotypic convergence in pre-malignant epithelial organoids, implying that the earliest stages of tumorigenesis may be predictable while illuminating evolutionary constraints and barriers to malignant transformation. Competing Interest Statement C.C. is an advisor and holds equity in Grail, Ravel, DeepCell and an advisor to Genentech and NanoString. All other authors declare no competing interests.
Transcriptome reconstruction is an important application of RNA-Seq, providing critical information for further analysis of transcriptome. Although RNA-Seq offers the potential to identify the whole picture of transcriptome, it still presents special challenges. To handle these difficulties and reconstruct transcriptome as completely as possible, current computational approaches mainly employ two strategies: de novo assembly and genome-guided assembly. In order to find the similarities and differences between them, we firstly chose five representative assemblers belonging to the two classes respectively, and then investigated and compared their algorithm features in theory and real performances in practice. We found that all the methods can be reduced to graph reduction problems, yet they have different conceptual and practical implementations, thus each assembly method has its specific advantages and disadvantages, performing worse than others in certain aspects while outperforming others in anther aspects at the same time. Finally we merged assemblies of the five assemblers and obtained a much better assembly. Additionally we evaluated an assembler using genome-guided de novo assembly approach, and achieved good performance. Based on these results, we suggest that to obtain a comprehensive set of recovered transcripts, it is better to use a combination of de novo assembly and genome-guided assembly.
Phylogenetic trees based on copy number alterations (CNAs) for multi-region samples of a single cancer patient are helpful to understand the spatio-temporal evolution of cancers, especially in tumours driven by chromosomal instability. Due to the high cost of deep sequencing data, low-coverage data are more accessible in practice, which only allow the calling of (relative) total copy numbers due to the lower resolution. However, methods to reconstruct sample phylogenies from CNAs often use allele-specific copy numbers and those using total copy number are mostly distance matrix or maximum parsimony methods which do not handle temporal data or estimate mutation rates. In this work, we developed a new maximum likelihood method based on a novel evolutionary model of CNAs, CNETML, to infer phylogenies from spatio-temporal samples taken within a single patient. CNETML is the first program to jointly infer the tree topology, node ages, and mutation rates from total copy numbers when samples were taken at different time points. Our extensive simulations suggest CNETML performed well even on relative copy numbers with subclonal whole genome doubling events and under slight violation of model assumptions. Theapplication of CNETML to real data from Barrett's esophagus patients also generated consistent results with previous discoveries and novel early CNAs for further investigations. Competing Interest Statement The authors have declared no competing interest.
Si-Wu-Tang (SWT) is a Traditional Chinese Medicine (TCM) formula widely used for the treatments of gynecological diseases. To explore the pharmacological mechanism of SWT, we incorporated microarray data of SWT with our herbal target database TCMID to analyze the potential activity mechanism of SWT's herbal ingredients and targets. We detected 2,405 differentially expressed genes in the microarray data, 20 of 102 proteins targeted by SWT were encoded by these DEGs and can be targeted by 2 FDA-approved drugs and 39 experimental drugs. The results of pathway enrichment analysis of the 20 predicted targets were consistent with that of 2,405 differentially expressed genes, elaborating the potential pharmacological mechanisms of SWT. Further study from a perspective of protein-protein interaction (PPI) network showed that the predicted targets of SWT function cooperatively to perform their multi-target effects. We also constructed a network to combine herbs, ingredients, targets and drugs together which bridges the gap between SWT and conventional medicine, and used it to infer the potential mechanisms of herbal ingredients. Moreover, based on the hypothesis that the same or similar effects between different TCM formulae may result from targeting the same proteins, we analyzed 27 other TCM formulae which can also treat the gynecological diseases, the subsequent result provides additional insight to understand the potential mechanisms of SWT in treating amenorrhea. Our bioinformatics approach to detect the pharmacology of SWT may shed light on drug discovery for gynecological diseases and could be utilized to investigate other TCM formulae as well.
The human reference genome is still incomplete and a number of gene sequences are missing from it. The approaches to uncover them, the reasons causing their absence and their functions are less explored. Here, we comprehensively identified and characterized the missing genes of human reference genome with RNA-Seq data from 16 different human tissues. By using a combined approach of genome-guided transcriptome reconstruction coupled with genome-wide comparison, we uncovered 3.78 and 2.37 Mb transcribed regions in the human genome assemblies of Celera and HuRef either missed from their homologous chromosomes of NCBI human reference genome build 37.2 or partially or entirely absent from the reference. We further identified a significant number of novel transcript contigs in each tissue from de novo transcriptome assembly that are unalignable to NCBI build 37.2 but can be aligned to at least one of the genomes from Celera, HuRef, chimpanzee, macaca or mouse. Our analyses indicate that the missing genes could result from genome misassembly, transposition, copy number variation, translocation and other structural variations. Moreover, our results further suggest that a large portion of these missing genes are conserved between human and other mammals, implying their important biological functions. Totally, 1,233 functional protein domains were detected in these missing genes. Collectively, our study not only provides approaches for uncovering the missing genes of a genome, but also proposes the potential reasons causing genes missed from the genome and highlights the importance of uncovering the missing genes of incomplete genomes.
Intra-tumor heterogeneity (ITH) is a key challenge in cancer treatment, but previous studies have focused mainly on the genomic alterations without exploring phenotypic (transcriptomic and immune) heterogeneity. Using one of the largest prospective surgical cohorts for hepatocellular carcinoma (HCC) with multi-region sampling, we sequenced whole genomes and paired transcriptomes from 67 HCC patients (331 samples). We found that while genomic ITH was rather constant across stages, phenotypic ITH had a very different trajectory and quickly diversified in stage II patients. Most strikingly, 30% of patients were found to contain more than one transcriptomic subtype within a single tumor. Such phenotypic ITH was found to be much more informative in predicting patient survival than genomic ITH and explains the poor efficacy of single-target systemic therapies in HCC. Taken together, we not only revealed an unprecedentedly dynamic landscape of phenotypic heterogeneity in HCC, but also highlighted the importance of studying phenotypic evolution across cancer types. Using a prospective cohort for Hepatocellular Carcinoma (the PLANET study), this work revealed a dynamic landscape of phenotypic intra-tumor heterogeneity, providing several novel approaches for patient treatment and prognosis prediction.
The earliest events during human tumour initiation, although poorly characterized, may hold clues to malignancy detection and prevention. Here we model occult preneoplasia by biallelic inactivation of TP53, a common early event in gastric cancer, in human gastric organoids. Causal relationships between this initiating genetic lesion and resulting phenotypes were established using experimental evolution in multiple clonally derived cultures over 2 years. TP53 loss elicited progressive aneuploidy, including copy number alterations and structural variants prevalent in gastric cancers, with evident preferred orders. Longitudinal single-cell sequencing of TP53-deficient gastric organoids similarly indicates progression towards malignant transcriptional programmes. Moreover, high-throughput lineage tracing with expressed cellular barcodes demonstrates reproducible dynamics whereby initially rare subclones with shared transcriptional programmes repeatedly attain clonal dominance. This powerful platform for experimental evolution exposes stringent selection, clonal interference and a marked degree of phenotypic convergence in premalignant epithelial organoids. These data imply predictability in the earliest stages of tumorigenesis and show evolutionary constraints and barriers to malignant transformation, with implications for earlier detection and interception of aggressive, genome-instable tumours.