Our paper has been accepted in RECOMB 2025


Our paper titled “A k-mer-based maximum likelihood method for estimating distances of reads to genomes enables genome-wide phylogenetic placement” has been accepted in RECOMB 2025. I’m looking forward to my presentation in Seoul, which will be my first visiting East Asia.

We introduced a technique to estimate read-to-genome distances from k-mer hits. The idea is based on the search for matching k-mers up to a certain Hamming distance in a colored k-mer index, and then finding the maximum likelihood distance based on k-mer matches and corresponding distances for each hitting reference. These distances approximate alignment Hamming distance at the read level, and 1-ANI between a query genome and the matching reference(s) on average across all reads. We then propose an intuitive and accurate heuristic to place reads on an existing backbone phylogeny in a principled way using a likelihood ratio test.

Abstract: Consider comparing a sequencing read of unknown origin to a set of reference genomes. This problem underlines many applications, including metagenomic analyses. The exact genome generating the read is not in the reference set but may be evolutionarily related to some references. Ideally, we need not just the identity of the closest references to the read but also their distance to the read. The distances can help us identify the read at the right taxonomic level and, more ambitiously, place it on a reference phylogeny. Aligning reads to reference genomes, the only available approach for computing such distances, becomes impractical for very large reference sets. It is also not effective at higher distances when used with efficient indexes (e.g., Bowtie2). While k-mers can create scalable indexes, existing k-mer-based methods are incapable of distance calculation. Thus, estimating distances between short reads and large, diverse reference sets remains challenging and seldom used. We introduce a method called krepp that combines four ideas to solve this challenge and to further enable placing reads on a reference phylogeny. We use i) locality-sensitive hashing to find inexact k-mer matches, ii) a phylogeny-guided colored k-mer index to map each k-mer to all references containing it, iii) a maximum likelihood framework to estimate read-genome distances using k-mer matches, and iv) an extension of distances to clades of the reference tree, which enables placement using a likelihood ratio test. We show that krepp matches true distances using a fraction of time compared to alignment, extends to higher distances, and accurately places short reads coming from any part of the genome (not just marker genes) on the reference phylogeny. We demonstrate that krepp easily extends to databases with tens of thousands of reference genomes and performs well in characterizing real microbial samples.

Availability: The tool is available on GitHub, and under active development. All results, auxiliary data, and scripts used in the analyses can be found here.