David Novak

davidnovak9000 at gmail dot com

CVGoogle Scholar pageGitHub page

Hi there! I'm a bioinformatics and machine learning consultant working with Burns LSC. I focus mostly but not exclusively on flow cytometry, CyTOF, bulk and single-cell RNA-seq, and other NGS data. I have a track record in algorithm development as well as more data-driven analytical workflows.

In my PhD at Saeys Lab, Flemish Institute of Biotechnology, I've advanced exploratory data analysis and statistical modelling of high-dimensional biological datasets. I always put an emphasis on creating interactive solutions to keep domain experts in the loop. I collaborate with immunologists, bioinformaticians, and computer scientists alike.

I'm open to work, looking for interesting positions primarily (but not exclusively) in Canada.

I am excited about responsible and interpretable AI and machine learning in biology. At Ghent University, I have designed and taught practical sessions for over 200 post-grad students over 4 years, as well as guiding individuals and groups with their ML projects. I also co-organised the inaugural Computational Cytometry Summer School, guiding participants on statistical analysis within computational cytometry.

My background
A biology undergrad, I shifted toward bioinformatics a year into my studies, and went on to do a Master's and PhD in it. My research started out at Childhood Lekaemia Investigation Prague (CLIP), where I worked with flow & CyTOF data, helping to develop tviblindi: an interactive trajectory inference tool powered by persistent homology. This allowed to build multi-organ models of human B-cell development (here) and T-cell development (here). I started a collaboration with UCLouvain to I develop ViVAE and ViScore: a novel VAE-based dimension-reduction model with QC measures grounded in differential geometry, and a framework for evaluating embeddings of single-cell datasets. Our manuscript (here) is under review now. Working with immunologists from the NIH, I designed and validated iidx: an end-to-end pipeline for large-scale statistical analysis of complex age- and sex-associated immunophe- notype changes, and put it to use with a 2196-donor flow cytometry data cohort.

My projects

I list some of my projects, including collaborations, below. They are sorted into categories.

Dimensionality reduction and structure learning

ViVAE

Lower-dimensional mbedding framework that demonstrably improves structure preservation, interpretability and QC in scRNA-seq dimensionality reduction. Using VAEs, a novel stochastic-MDS loss (based on SQuadMDS) and data de-noising, we achieve a better balance of local and global structure preservation with scRNA-seq data. Additionally, the model is equipped with a novel and generalisable algorithm for detecting latent space distortions (encoder indicatrices) and integrates with FlowSOM. I am the first author the associated manuscript, penned with my co-authors from Ghent University and UCLouvain (read current pre-print here). The work was presented at CYTO 2024.

GroupEnc

GroupEnc is a proof-of-concept project for parametric multi-dimensional scaling (MDS) on GPU, presented at BNAIC/BeNeLearn 2023. Check out the conference paper here.
Topological trajectory inference

tviblindi

tviblindi is a semi-supervised single-cell TI tool that uses TDA and persistent homology to work with high-dimensional data. For my master thesis, I implemented parts of the TDA pipeline in C++ and created a method for clustering trajectories based on persistent homology, as well a GUI in Shiny. The tool has since been applied successfully to create multi-organ models of development that refine descriptions of human B-cell and T-cell development.
Differential expression analysis

iidx: interpretable and interactive differential expression in cytometry

iidx is the most comprehensive workflow for pre-processing and differential expression analysis in large cytometry cohorts to date. The work will be presented by me & Thomas Liechti at CYTO 2025.

tidycell

tidycell is a basic differential expression analysis tool written in R for cytometry data. I developed during my time at CLIP. It has been applied on GvHD data and in a project on head & neck cancers at Biocev.
Accelerating discovery in cytometry data

SingleBench

SingleBench will get you from data to discovery quicker. It is an R framework for better interpretation of cytometry clustering, hyperparameter tuning & benchmarking.
Semi-automated single-cell data annotation

hloss

hloss is a work I presented at the ABLS 2022 bioinformatics conference. It tackles the issue of evaluating cell type classification in single-cell data in a way that reflects known hierarchies and ontologies. A novel scoring approach incorporates a biological prior to assess error based on degrees of relatedness.

SplitScore

Work in progress on alternatives to hierarchical metaclustering done by FlowSOM. Clusters are merged so as to preserve reasonable signal distributions per channel. In practice, this can be done through preserving unimodality of marker expression (for cytometry data). This is an ongoing effort, since the requirement of preserving some distribution modalities in metaclustering arises now and then in different projects.
Evaluation & benchmarking of dimension reduction

ViScore

ViScore is a collection of evaluation metrics for dimensionality reduction that address past problems with fairness and scalability. Together with collaborators from UCLouvain, we have released a battery of both unsupervised and supervised evaluation algorithms and an extensible HPC benchmarking framework. We build on RNX curves and the Neighbouhood Proportion Error to provide novel embedding-level and population-level scores. This is described in our ViVAE pre-print. We're incorporating some of the evaluation metrics from ViScore into TRACE, which will be presented at CYTO 2025 by Laura Hajzoková.
Utilities for computational cytometry

qctoy

qctoy is an R package for simulating aberrances in flow cytometry measurements that are relevant in designing QC tools and pipelines. I developed this small tool during a summer internship at Saeys Lab to help with designing the QC algorithm what eventually became PeacoQC.

auto_compensate

auto_compensate is an automated pipeline for large-scale cytometry data compensation I designed for CLIP.
Miscellaneous

hidden

hidden is a hidden Markov model simulator in R. I wrote it to understand HMMs better.

CommandLineParser

CommandLineParser is a C#/.NET API I co-wrote with Kačka Břicháčková. This is a course project we teamed up for during our Master's in Bioinformatics at Charles University.

avl_tree

avl_tree is an Adelson-Velsky and Landis tree implementation in Pascal. It's some of my earliest code, written during my Bachelor's in Biology during which I took elective comp sci courses.

RCondaRun

RCondaRun is a tiny package for switching between Conda environments within a single R session when interfacing with Python.