Our paper titled “Memory-bound k-mer selection for large evolutionary diverse reference libraries” has been accepted to RECOMB 2024. Hopefully, I’ll be in Cambridge (April 29 - May 2, 2024).
We introduced a tool, called KRANK, to sample k-mers from a large set of reference genomes for taxonomic identification or profiling purposes. The abstract of the paper is below.
Using long k-mers to find sequence matches is increasingly used in many bioinformatic applications, including metagenomic sequence classification. The accuracy of these downstream applications relies on the density of the reference databases, which, luckily, are rapidly growing. While the increased density provides hope for dramatic improvements in accuracy, scalability is a concern. The k-mers are kept in the memory during the query time, and saving all k-mers of these ever-expanding databases is fast becoming impractical. Several strategies for subsampling k-mers have been proposed, including minimizers and finding taxon-specific k-mers. However, we contend that these strategies are inadequate, especially when reference sets are taxonomically imbalanced, as are most microbial libraries. In this paper, we specifically ask the question: Given limited memory, what is the best strategy to select a subset of k-mers from an ultra-large dataset to include in a library such that the classification of reads suffers the least? We make an attempt to make this goal more formal and show a set of experiments demonstrating the limitations of existing approaches, especially for novel and poorly sampled groups. We propose a library construction algorithm called KRANK (k-mer RANKer) that combines several components, including a hierarchical selection strategy with adaptive size restrictions and a min-max coverage strategy. We implement KRANK in highly optimized code and combine it with the locality-sensitive-hashing classifier CONSULT-II. Our method is able to reduce the memory consumption from roughly 144Gb down to 6, 12, or 24Gb, with only a 3.8%, 2.5%, or 0.5% loss in F1 score. We show in extensive analyses that KRANK outperforms alternatives in both taxonomic classification and taxonomic profiling, using reasonable memory sizes.
MuDCoD has been published in Bioinformatics
CONSULT-II has been (finally) published in Bioinformatics