Skip to content

SELECTED PUBLICATIONS

2025
Uncovering the regulatory landscape of early human B cell lymphopoiesis and its implications in the pathogenesis of B-ALL
ScienceAdvances, 10/2025. ScienceAdvances | Atlas

A multiomics atlas of chromatin accessibility and gene expression across early human B cell precursors reveals cell type–specific regulatory elements and reconstructs differentiation networks. Candidate regulons, such as ELK3, were validated using single-cell data, refining the regulatory landscape. This publicly available resource provides key insights into B cell development and disease, supporting studies of immunity and hematologic malignancies.

GeneSetCluster 2.0: a comprehensive toolset for summarizing and integrating gene-sets analysis
BMC Bioinformatics, 08/2025. BMC Bioinformatics | GitHub | ShinyApp

Gene-Set Analysis (GSA) often struggles with redundancies that complicate clustering and interpretation. GeneSetCluster 2.0 addresses this with improved methods for handling duplicated gene-sets, a seriation-based clustering algorithm, faster computation, and enhanced cluster annotations linking results to tissues and biological processes. A user-friendly web application and R package make the tool accessible to both programmers and non-programmers, enabling efficient and interpretable gene-set analyses.

SPELL: Spatial Prompting with Chain-of-Thought for Zero-Shot Learning in Spatial Transcriptomics
ICLR Singapur Conference, 04/2025. ICLR | Poster

SPELL introduces a zero-shot learning framework for cell-type classification in spatial transcriptomics, integrating spatial embeddings and chain-of-thought prompting. Using graph autoencoders and BART models, it achieves high accuracy (e.g., 64% on MERFISH) without task-specific fine-tuning. Spatial context significantly enhances performance, highlighting its critical role in biologically interpretable classification across diverse datasets.

stDiffusion: A Diffusion Based Model for Generative Spatial Transcriptomics
ICLR Singapur Conference, 04/2025ICLR | Poster

stDiffusion employs a denoising diffusion probabilistic model to generate spatial transcriptomics data, predicting unseen tissue slices. It learns 2D gene expression patterns and interpolates between finite ST slices, advancing AI-augmented spatial transcriptomics. The model sets the stage for predictive 3D tissue modeling from limited data.

Interpretable Causal Representation Learning for Biological Data in the Pathway Space
ICLR Singapur Conference, 04/2025ICLR | Poster

SENA-discrepancyVAE enhances causal representation learning by mapping latent factors to interpretable biological processes. It predicts the effects of genomic and drug perturbations while maintaining performance comparable to non-interpretable models. The SENA-δ encoder ensures biologically meaningful causal factors, improving therapy development.

NLRP3-mediated glutaminolysis controls microglial phagocytosis to promote Alzheimer’s disease progression
Immunity, 02/2025. Immunity

This study shows NLRP3 activation in Alzheimer’s disease triggers microglial glutaminolysis, enhancing Aβ clearance but promoting disease progression. Loss of NLRP3 boosts metabolic activity and Slc1a3 expression. Chronic NLRP3 inhibition mimics these effects, suggesting therapeutic potential.

2024
Reviewability and supportability: New complementary principles to empower research software practices
Computational and Structural Biotechnology Journal, 12/2024Computational and Structural Biotechnology Journal | GitHub

This review proposes reviewability and supportability as principles to enhance research software, complementing FAIR principles. It highlights software’s role in reproducibility and transparency in life sciences. The principles aim to improve peer review efficiency and guide scientists in developing robust research software.

ClustAll: An R package for patient stratification in complex diseases
PLOS Computational Biology, 12/2024PLOS Computational Biology | Bioconductor | GitHub

ClustAll is a Bioconductor R package for unsupervised patient stratification in complex diseases using clinical data. Built on a validated clustering framework, it handles mixed data types, missing values, and collinearity, identifying multiple robust stratifications within a population. It uses parallel computing and user-friendly tools, validated on public clinical datasets for personalized medicine.

A comparative analysis of blastoid models through single-cell transcriptomics
iScience, 11/2024iScience | Atlas

This study uses single-cell RNA sequencing to compare blastoid models with human blastocysts, assessing cell-type composition and lineage profiles. Blastoids from naive pluripotent stem cells resemble blastocysts more closely than those from extended pluripotent stem cells, which show higher primitive endoderm and ambiguous cells. Gene expression heterogeneity in starting cell lines influences blastoid lineage differentiation, aiding optimization of embryogenesis models.

Derivation of two iPSC lines (KAIMRCi004-A, KAIMRCi004-B) from a Saudi patient with Biotin-Thiamine-responsive Basal Ganglia Disease (BTBGD) carrying homozygous pathogenic missense variant in the SCL19A3 gene
Human Cell, 09/2024Human Cell

This study creates two iPSC lines from a BTBGD patient with a homozygous SLC19A3 mutation (c.1264A>G, p.Thr422Ala) to model the disease. It aims to investigate SLC19A3-related basal ganglia dysfunction mechanisms. The pluripotent iPSCs, differentiated into neural progenitors, enable exploration of disease pathways. This model supports new therapeutic development for BTBGD.

Competition shapes the landscape of X-chromosome-linked genetic diversity
Nature Genetics, 07/2024Nature Genetics

This study investigates how X-chromosome inactivation (XCI) creates clonal diversity in XX individuals, focusing on the STAG2 gene in mouse models. It finds that Stag2 variant clones are outcompeted by wild-type clones in lymphoid tissues due to continuous cellular competition, not just intrinsic defects. The research highlights how clone interactions shape X-linked genetic diversity in a cell-type-specific manner.

Global compositional and functional states of the human gut microbiome in health and disease
Genome Research, 06/2024Genome Research

This study analyzes 6,014 gut metagenome samples across 19 countries and 23 diseases to map microbial diversity and function. It identifies key bacteria like Fusobacterium nucleatum (enriched) and Anaerostipes hadrus (depleted) in disease cohorts, revealing distinct functional profiles in westernized and nonwesternized populations. The findings are accessible via the Human Gut Microbiome Atlas for exploring microbiota signatures.

An atlas of cells in the human tonsil
Immunity, 02/2024. Immunity | GitHub

This study creates a comprehensive tonsil cell atlas using 556,000 cells profiled via multi-modal single-cell and spatial transcriptomics. The atlas identifies 121 cell types and states, mapping developmental trajectories and functional units critical for immunological defense against pathogens. It defines immune cell functions, developmental trajectories, and validates findings in lymphoma, revealing age-related shifts in tonsillar composition.

2023
Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 12/2023. Conference | GitHub | Video

Whispering LLaMA introduces a cross-modal framework for generative error correction in speech recognition, using acoustic and linguistic data. It improves word error rate by 37.66% compared to n-best Oracle, leveraging pre-trained models. The open-source code encourages further research.

Reusability report: Learning the transcriptional grammar in single-cell RNA-sequencing data using transformers
Nature Machine Intelligence, 11/2023. Nature Machine Intelligence | GitHub

This report evaluates scBERT, a transformer-based model, for annotating cell types in single-cell RNA-seq data. It leverages pretraining and self-attention to learn transcriptional patterns but is sensitive to imbalanced cell-type distributions. Subsampling and oversampling techniques mitigate this, enhancing generalizability in single-cell genomics.

Gene therapy restores the transcriptional program of hematopoietic stem cells in Fanconi anemia
Haematologica, 10/2023. Haematologica

This study uses single-cell RNA sequencing to show that lentiviral gene therapy corrects the transcriptional defects in hematopoietic stem and progenitor cells (HSPCs) of Fanconi anemia patients. It demonstrates that corrected HSPCs resemble healthy cells, with downregulated TGF-β and p21 and upregulated DNA repair pathways, suggesting gene therapy can reverse molecular defects in Fanconi anemia HSPCs.

A second update on mapping the human genetic architecture of COVID-19
Nature, 09/2023. Nature

This genome-wide association study (GWAS) meta-analysis of 219,692 COVID-19 cases and over 3 million controls identifies 51 significant loci, adding 28 new loci since the prior release. It maps viral entry, airway defense, and type I interferon pathways, enhancing understanding of genetic factors for drug development.

LEP-AD: Language Embeddings of Proteins and Attention to Drugs predicts drug target interactions
ICLR Conference, 04/2023ICLR | GitHub

LEP-AD combines Evolutionary Scale Modeling (ESM-2) and Transformer-GCN to predict drug-target interactions, outperforming methods like SimBoost and DeepCPI. It achieves state-of-the-art binding affinity predictions using pre-trained protein language models across multiple datasets (e.g., Davis, KIBA). Pre-trained protein embeddings surpass AlphaFold 3D representations, scaling well with training data size.

Preclinical models for prediction of immunotherapy outcomes and immune evasion mechanisms in genetically heterogeneous multiple myeloma
Nature Medicine, 03/2023. Nature Medicine

This study develops 15 genetically diverse mouse models of multiple myeloma to study immunotherapy outcomes. A MAPK–MYC pathway accelerates progression, influencing immune evasion. Rapid MYC-driven progressors show high CD8+ T cell activation with low Treg cells, while slow progressors have higher Treg infiltration. High CD8+ T/Treg ratios predict immunotherapy response, guiding strategies to overcome resistance.

Translating single-cell genomics into cell types
Nature Machine Intelligence, 01/2023. Nature Machine Intelligence

This news piece discusses machine translation techniques that automatically classify cell types from single-cell transcriptomic data. It highlights potential for analyzing complex clinical samples like tumors at scale, advancing precision medicine.

2022
Data-driven bioinformatics to disentangle cells within a tissue microenvironment
Trends in Cell Biology, 06/2022Trends in Cell Biology

This spotlight showcases how machine learning advances molecular profiling of clinical tissues by enabling the deconvolution of mixed cell types and the identification of population shifts in response to infections or drug treatments. It emphasizes detecting cellular changes in response to infections or drugs, supporting precision medicine through molecular profiling.

Deconvolution of the hematopoietic stem cell microenvironment reveals a high degree of specialization and conservation
iScience, 04/2022. iScience

This study integrates single-cell RNA-seq datasets to map the hematopoietic stem cell microenvironment, identifying 14 endothelial and 11 mesenchymal cell states. It reveals high specialization and conserved regulatory features across species, advancing bone marrow microenvironment understanding.

2021
Mapping the human genetic architecture of COVID-19
Nature, 07/2021Nature | GitHub

This GWAS of 49,562 COVID-19 cases across 46 studies identifies 13 loci associated with SARS-CoV-2 infection and severity, implicating lung, autoimmune, and inflammatory pathways. Mendelian randomization supports smoking and BMI as causal risk factors. It identifies actionable mechanisms for therapeutic development and informs future genetic studies of pandemics

A robust machine learning framework to identify signatures for frailty: a nested case-control study in four aging European cohorts
GeroScience, 02/2021. GeroScience

This study uses machine learning to identify frailty biomarkers across four aging cohorts, analyzing genomic, proteomic, and metabolomic data. It finds protective (vitamin D3, lutein zeaxanthin, miRNA125b-5p) and risk (cardiac troponin T, pro-BNP, sRAGE) biomarkers. Oxidative stress, vitamin D, and cardiovascular markers vary by disability status, offering insights into multi-systemic pathological processes.

2020
Harmonization of quality metrics and power calculation in multi-omic studies
Nature Communications, 06/2020Nature Communications | GitHub

This MultiPower method harmonizes quality metrics across omic platforms, estimating optimal sample sizes for multi-omic experiments. Complemented by MultiML, it supports machine learning classification tasks, offering graphical tools for experimental design. The approach ensures robust multi-omic data analysis, enhancing the reliability of comprehensive cellular models in diverse experimental settings.

2019
2018
2017
2016
Building gene regulatory networks from scATAC-seq and scRNA-seq using Linked Self Organizing Maps
PLOS Computational Biology, 11/2019. PLOS Computational Biology

SOMatic uses self-organizing maps to integrate scATAC-seq and scRNA-seq, building gene regulatory networks. Applied to a B cell differentiation time-course with Ikaros overexpression, it recovers known interactions and predicts new Ikaros targets. The method overcomes challenges of sparse and noisy single-cell data, enabling integrative analysis of heterogeneous genomic datasets and advancing regulatory network discovery.

Multiple sclerosis genomic map implicates peripheral immune cells and microglia in susceptibility
Science, 09/2019Science

This study identifies over 200 risk loci for multiple sclerosis (MS) using genome-wide association studies, implicating immune cells and microglia in susceptibility. The analysis explains up to 48% of MS’s genetic contribution, highlighting immune pathways. It implicates immune cells and microglia, with enrichment in brain-resident immune cells, clarifying MS susceptibility mechanisms.

An Algorithmic Information Calculus for Causal Discovery and Reprogramming Systems
iScience, 09/2019iScience

This method uses algorithmic information content to control and reprogram systems via controlled interventions. Validated on cellular automata, graphs, and biological networks (e.g., E. coli, Th17 cells), it reconstructs phase spaces and predicts causal interactions for therapeutic applications.

Therapeutic efficacy of dimethyl fumarate in relapsing-remitting multiple sclerosis associates with ROS pathway in monocytes
Nature Communications, 07/2019Nature Communications

This study links dimethyl fumarate’s efficacy in multiple sclerosis to increased monocytic ROS, with changes in monocyte methylome and transcriptome preceding T cell effects. A NOX3 gene variant is linked to beneficial treatment response. Monocyte counts and redox state serve as potential biomarkers for DMF therapy decisions, implicating oxidative processes in autoimmune disease treatment.

Combining evidence from four immune cell types identifies DNA methylation patterns that implicate functionally distinct pathways during Multiple Sclerosis progression
EBioMedicine, 05/2019EBioMedicine

The omicsNPC framework integrates DNA methylation data from four immune cell types, identifying changes in relapsing-remitting (RRMS) and secondary progressive (SPMS) multiple sclerosis. RRMS shows lymphocyte signaling and T cell activation, while SPMS implicates myeloid metabolism and neuronal pathways. Shared methylation patterns co-localize with MS risk loci, offering insights into disease progression and pathogenesis

Causal deconvolution by algorithmic generative models
Nature Machine Intelligence, 01/2019Nature Machine Intelligence | GitHub

This study introduces a parameter-free, algorithmic probability-based method to deconvolve complex interactions into generative models, applied to bit strings, images, and networks. It successfully infers generative models for bit strings, images, and networks, complementing statistical approaches to tackle causation in complex systems.

DNA methylation as a mediator of HLA-DRB1*15:01 and a protective variant in multiple sclerosis
Nature Communications, 06/2018Nature Communications

The HLA-DRB1*15:01 haplotype, a major multiple sclerosis risk factor, is hypomethylated in monocytes, driving increased expression. A differentially methylated region in HLA-DRB1 exon 2 regulates expression, while a protective variant (rs9267649) increases methylation, reducing risk. Causal inference supports HLA variants’ role in MS via methylation changes, suggesting therapeutic strategies targeting epigenetic regulation.

A Decomposition Method for Global Evaluation of Shannon Entropy and Local Estimations of Algorithmic Complexity
Entropy, 04/2018Entropy

The Block Decomposition Method extends the Coding Theorem Method to estimate algorithmic complexity by decomposing objects into small programs. It performs well on low-complexity objects but aligns with Shannon entropy when less accurate, offering multi-dimensional applications.

Low-algorithmic-complexity entropy-deceiving graphs
Physical Review E, 07/2017Physical Review E

Using Borel-normal integer sequences, this study constructs recursive and nonrecursive graphs to expose limitations of entropy-based complexity measures. Different lossless descriptions of the same graph yield disparate entropy values, misrepresenting causal likelihood. The approach highlights the dependence of computable measures on object representation, advocating for algorithmic complexity metrics in graph analysis.

A survey of best practices for RNA-seq data analysis
Genome Biology, 01/2016Genome Biology

This review outlines RNA-seq analysis steps, including experimental design, quality control, read alignment, and differential expression analysis. It addresses challenges in quantifying gene/transcript levels, detecting alternative splicing, and integrating with other genomics techniques. It discusses small RNA analysis and integration with other genomics techniques