Binning mitochondrial reads from skimming data using krepp


Installing krepp

We start by installing krepp, the easiest way is through bioconda.

1conda install bioconda::krepp
2krepp --help

To perform any kind of query, you need to either build an index from reference genomes or download a public krepp index from one of our servers. It is quite straightforward to download an index and get started (jump to Downloading an existing index). Alternatively, if you have your own reference mitogenomes, you can index them from scratch (see Indexing mitogenomes). Once you do either of these, the next step is querying your skimming data (e.g., a FASTA or FASTQ file), as we will discuss in detail.

Downloading an existing index

Download a relatively up-to-date mitogenome index by running the below wget command (or curl if wget is not available). Then, untar and decompress it.

1wget --no-check-certificate https://ter-trees.ucsd.edu/data/krepp/index_mitochondrion-RefSeq20240913.tar.gz
2tar -xzvf index_mitochondrion-RefSeq20240913.tar.gz

The resulting directory index_mitochondrion-RefSeq20240913 stores the index that we can perform queries against. You can jump to the next step.

Indexing mitogenomes

For example, we will use a recent RefSeq release of all available mitochondrial genomes. The very first step is to download all the mitogenomes available on RefSeq as of 2024/09/13, and decompress the downloaded file.

1# 2024-09-13
2wget https://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/mitochondrion.1.1.genomic.fna.gz
3gunzip -d mitochondrion.1.1.genomic.fna.gz

Convert multi-line FASTA to 2-line FASTA for easier processing.

1seqtk seq -l0 mitochondrion.1.1.genomic.fna > /tmp/mitochondrion_l0.1.1.genomic.fna && mv /tmp/mitochondrion_l0.1.1.genomic.fna mitochondrion.1.1.genomic.fna

Since krepp requires one reference sequence per file, we have to split concatenated mitogenomes into separate files:

1mkdir mitochondrion_genomic_peracc
2cd mitochondrion_genomic_peracc
3cat ../mitochondrion.1.1.genomic.fna | sed 's/^>/>>/' | awk 'BEGIN{RS=">>";FS="\n"} NR>1{split($0,acc," "); fnme=acc[1]".fasta"; print ">" $0 > fnme; close(fnme);}'
4cd ../

Then, list all sequence identifiers in another file and create a mapping between accession IDs and each file corresponding to a reference sequence.

1grep "^>" mitochondrion.1.1.genomic.fna | sed 's/^>//' > seqids_mitochondrion.txt
2cut -f1 -d' ' seqids_mitochondrion.txt | awk '{print $1"\t./mitochondrion_genomic_peracc/"$1".fasta";}' > input_map-mitochondrion.tsv

This file (input_map-mitochondrion.tsv) is going to be used as the input to krepp.

We are almost there. Now that we have a mapping, we can index mitogenomes by simply running the following krepp index command.

1seq 0 7 | xargs -I{} bash -c "krepp index -i input_map-mitochondrion.tsv -o index_mitochondrion-RefSeq20240913 -k 29 -w 32 -h 13 -m 8 -r {} --no-frac --sdust-t 0 --num-threads 24"

Here, index_mitochondrion-RefSeq20240913 is the name of the directory where we store the index. For all other parameters, you can refer to the tutorial in the GitHub repository.

Querying your skimming data

We will use the dist subcommand of krepp via krepp dist. Suppose you have a FASTQ file named xyz.fq and an index in a directory index_mitochondrion-RefSeq20240913. To compute the distance between each read in xyz.fq and all sufficiently close mitogenomes indexed, run the command below.

1# Compute distances and filter any mapping above 15% sequence similarity
2krepp dist -i index_mitochondrion-RefSeq20240913 -q xyz.fq --dist-max 0.15 --filter

This will output estimated distances to stdout in a tab-separated format. You can save this to a file with the -o option (e.g., add -o xyz-distances.tsv to the command above). The first 10 lines would look something like this.

1# Print distances to stdout and get the first 10 lines
2krepp dist -i index_mitochondrion-RefSeq20240913 -q xyz.fq --dist-max 0.15 --filter | head

Output:

1#software: krepp  #version: v0.5.1  #invocation :krepp dist -i index_mitochondrion-RefSeq20240913 -q xyz.fq --dist-max 0.15 --filter
2SEQ_ID	REFERENCE_NAME	DIST
3AG-359-G18_1-1273	NaN	NaN
4AG-359-G18_1-1270	NC_065012.1 Amynthas deogyusanensis mitochondrion	2.00371e-05
5AG-359-G18_1-1272	NaN	NaN
6AG-359-G18_1-1269	NC_065012.1 Amynthas deogyusanensis mitochondrion	0.0137122
7AG-359-G18_1-1268	NC_065012.1 Amynthas deogyusanensis mitochondrion	0.00638889
8AG-359-G18_1-1267	NC_065012.1 Amynthas deogyusanensis mitochondrion	0.0163656
9AG-359-G18_1-1271	NaN	NaN

The first column is for the read ID, the second column is for the accession ID of the references the corresponding read maps to, and the third column is for the distances between the reads and the mapped references. Notice that we used a threshold of 15% sequence similarity (--dist-max 0.15). This means that any read without a hit below 15% similarity will have NaN in the second and third columns. These are most likely non-mitochondrial reads (e.g., bacterial). You can simply filter such reads using their IDs and move forward with the rest. You may want to change the threshold value (e.g., --dist-max 0.1 for a more aggressive filtering or --dist-max 0.2 for a more conservative filtering), but 15% should do a decent job for this purpose.

To test these commands, I created a toy example of a skimming file consisting of 500 mitochondrial reads (generated from a randomly selected set of mitogenomes) and 500 bacterial reads. When I run the below command, it outputs 500 mitochondrial read IDs, meaning that we filter all bacterial reads and retain all mitochondrial reads, which is the desired output.

1krepp dist -i index_mitochondrion-RefSeq20240913 -q xyz.fq --dist-max 0.15 --filter | grep -v NaN | cut -f1 | uniq

Good luck with your data!