David Novak, PhD
Hi there! I'm a bioinformatics and machine learning consultant working with Burns LSC and Ionic.
With both hands-on and leadership experience in bioinformatics, I focus on a wide array of data modalities, and their various combinations. These are, mostly but not exclusively, flow cytometry, CyTOF, CITE-seq, bulk and single-cell RNA-seq, and other NGS data.
I am excited about responsible and interpretable AI and machine learning applications in biology.
Are you having trouble with your flow/spectral/mass cytometry, bulk/single-cell sequencing, image, or spatial data analysis?
I can help.
Feel free to e-mail me, and we'll discuss what I can do for you.
During my MSc at Childhood Leukaemia Investigation Prague (Czech Republic) and PhD at Saeys Lab (Belgium), I advanced exploratory data analysis and statistical modelling of high-dimensional biological datasets.
I always put an emphasis on creating interactive solutions to keep domain experts in the loop.
I collaborate with immunologists, bioinformaticians, and computer scientists alike.
My background
-
A Biology undergrad, I shifted toward computer science & bioinformatics a year into my studies, completing a Bioinformatics MSc at Charles University.
-
My research started at Childhood Lekaemia Investigation Prague (CLIP), a clinical and research lab.
Focusing chiefly on flow & CyTOF data, I helped develop tviblindi: a human-in-the-loop trajectory inference framework powered by persistent homology.
This allowed us to build multi-organ models of human B-cell and T-cell development.
-
Having secured a personal FWO Strategic Basic Research grant, I accepted a PhD position at Saeys Lab, Center for Inflammation Research, VIB-UGent.
-
Heading a collab with colleagues at UCLouvain, I led the development of ViVAE and ViScore: a novel trustworthy dimension-reduction model with QC measures grounded in differential geometry, and a framework for robustly evaluating low-dimensional data embeddings.
-
I'm leading a collaborative project with immunologists from Mario Roederer's lab, Vaccine Research Center, NIH.
I designed iidx: an end-to-end workflow for large-scale statistical analysis of complex immunophenotype changes in cytometry data.
We managed to put together the largest high-dimensional cytometry map of immune system changes linked to age and sex to date, with a cohort of 2196 human donors.
-
I designed and taught practical sessions for over 200 machine learning students over 4 years at Ghent University, as well as guiding individuals and groups with their projects.
Additionally, I co-organised the inaugural Computational Cytometry Summer School, teaching statistical analysis for computational cytometry.
Blog posts
Some of my most interesting experiments and workflows: open-source, reproducible, and fully documented.
Here they are:
Portfolio
A selection of my projects (incl. collaborations) can be found below.
For most, code and documentation are up on GitHub.
In some cases, the release of all materials is pending journal publication.
Dimensionality reduction and structure learning
Framework for generating low-dimensional embeddings of single-cell genomics/cytometry datasets.
We show ViVAE to improve multi-scale structure preservation, interpretability, and QC mechanisms.
Using VAEs, a novel stochastic-MDS loss (based on SQuadMDS), and data denoising, we achieve a better balance of local and global structure preservation.
The model is equipped with a new algorithm for detecting latent space distortions (encoder indicatrices) and integrates with FlowSOM for exploratory analysis.
I am the first author of the associated manuscript, which I penned with my co-authors from Ghent University and UCLouvain (under review at Cell Systems; read current pre-print here).
The work was presented at CYTO 2024.
GroupEnc is a proof-of-concept project for parametric multi-dimensional scaling (MDS) on the GPU, which I presented at BNAIC/BeNeLearn 2023.
Check out the conference paper here.
Topological trajectory inference
tviblindi is a semi-supervised single-cell trajectory inference (TI) tool.
For my master thesis, I implemented parts of the topological data analysis (TDA) pipeline in C++ and created a method for clustering trajectories based on persistent homology, as well as a GUI implemented in R Shiny.
This allowed for a human-in-the-loop solution to interrogating developmental trajectories and building multi-organ models of B- and T-lymphopoiesis.
Check out the related publications pertaining, respectively, to B-cell and T-cell development.
Large-scale differential expression analysis
iidx is the most comprehensive workflow for pre-processing and differential expression analysis in large cytometry cohorts to date.
Thomas Liechti and I presented this work at CYTO 2025.
The repository already contains the code for reproducing our analysis.
The data will be available once the manuscript (which is in preparation) is published.
tidycell is a basic differential expression analysis tool written in R for cytometry data.
I developed this ad hoc during my time at CLIP.
It has been applied to GvHD data and in a project on head & neck cancers at Biocev.
It is less elaborate than iidx, but it integrates CellCnn as an interesting approach to supervised feature extraction and addressing the multiple testing correction problem in smaller datasets.
This is done in addition to Wilcoxon ranked-sum testing of differential abundance.
Accelerating single-cell data annotation and exploration
SingleBench will get you from data to discovery faster.
It is an R framework for better interpretation of cytometry clustering, hyperparameter tuning & benchmarking.
In particular, it makes exploratory cluster analysis fast and clear.
It also allows you to test the influence of iterative data denoising (smoothing), which is poised to become more relevant as the dimensionality of cytometry data increases (with spectral and, to some extent, CyTOF).
Featured in my blog post on exploratory cluster analysis in cytometry.
cytoSNOW takes the standard FlowSOM protocol and speeds it up, to work fast with big data.
I'm interested in making computational cytometry accessible to anyone, even without fancy hardware--this is a step in that direction.
I wrote up a small blog post on my cytoSNOW workflow, showing how it gave a 4.6-fold speed-up in a large computational cytometry workflow on my laptop.
hloss is work that I presented at the ABLS 2022 bioinformatics conference.
It tackles the issue of evaluating cell type classification in single-cell data in a way that reflects known hierarchies and ontologies.
A novel scoring approach incorporates a biological prior to assess error based on degrees of relatedness.
Work in progress on alternatives to hierarchical metaclustering used in FlowSOM.
Clusters are merged so as to preserve reasonable signal distributions per channel.
In practice, this can be done through preserving unimodality of marker expression, especially for markers that denote cell types (for cytometry data).
This is an ongoing effort, since the requirement of preserving some distribution modalities in metaclustering arises now and then in different projects.
Evaluating single-cell data embeddings
ViScore is a collection of evaluation metrics for dimensionality reduction that address past problems with fairness and scalability.
Together with collaborators from UCLouvain, we put together a battery of both unsupervised and supervised evaluation algorithms and an extensible HPC benchmarking framework.
We build on RNX curves and the Neighbouhood Proportion Error to provide novel embedding-level and population-level scores.
This is described in our ViVAE pre-print.
We're incorporating some of the evaluation metrics from ViScore into TRACE, as presented at CYTO 2025 by Laura Hajzoková.
Miscellaneous
qctoy is an R package for simulating aberrances in flow cytometry measurements that are relevant in designing QC tools and pipelines.
I developed this small tool during a summer internship in 2019 at SaeysLab to help with designing the QC algorithm what eventually became PeacoQC.
auto_compensate is an automated pipeline for large-scale cytometry data compensation which I designed for CLIP.
RCondaRun is a tiny package for switching between Conda environments within a single R session when interfacing with Python.
hidden is a hidden Markov model simulator in R.
I wrote it because I find HMMs fun and wanted to understand them better.
CommandLineParser is a C#/.NET API for a command line interface that I co-wrote with Kačka Břicháčková.
This is a course project we teamed up for during our Master's in Bioinformatics at Charles University.
avl_tree is an Adelson-Velsky and Landis tree implementation in Pascal.
It's some of my earliest code, written during my Bachelor's in Biology during which I took elective computer science courses.