Mapping mitochondrial baits to taxonomic groups using KRANK


We are starting from scratch by creating a directory for the project:

1
2
mkdir BaitsProject-ONR
cd BaitsProject-ONR

The very first step is to download all the mitochondrion genomes available on RefSeq as of 2024/09/13, we also retrieve the corresponding catalog to later map each genome to a taxonomic group.

1
2
3
# 2024-09-13
wget https://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/RefSeq-release226.catalog.gz
wget https://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/mitochondrion.1.1.genomic.fna.gz

Next step is to simply decompress everything we downloaded:

1
2
gunzip -d mitochondrion.1.1.genomic.fna.gz
gunzip -d RefSeq-release226.catalog.gz

Convert multi-line FASTA to 2-line FASTA for easier processing:

1
2
seqtk seq -l0 mitochondrion.1.1.genomic.fna > /tmp/mitochondrion_l0.1.1.genomic.fna && \
  mv /tmp/mitochondrion_l0.1.1.genomic.fna mitochondrion.1.1.genomic.fna

Then list all sequence identifiers to another file:

1
 grep "^>" mitochondrion.1.1.genomic.fna | sed 's/^>//' > seqids_mitochondrion.txt

This part is somewhat idiosyncratic because of the input KRANK requires: a mapping between taxonomic IDs and files. So we have to split concatenated mitochondrion genomes into separate files:

1
2
3
4
5
6
mkdir mitochondrion_genomic_peracc
cd mitochondrion_genomic_peracc
cat ../mitochondrion.1.1.genomic.fna \
  | sed 's/^>/>>/' \
  | awk 'BEGIN{RS=">>";FS="\n"} NR>1{split($0,acc," "); fnme=acc[1]".fasta"; print ">" $0 > fnme; close(fnme);}'
cd ../

Now, we can map each mitochondrion sequence to a taxonomic group which will be found in RefSeq-release226.catalog. The file input_map-mitochondrion.tsv is just a mapping between taxonomic IDs and mitochondrion sequences we found in RefSeq which we have already split into separate files. This file is going to be the input to KRANK.

1
2
3
cut -f1 -d' ' seqids_mitochondrion.txt > acc_mitochondrion.txt
grep -f acc_mitochondrion.txt RefSeq-release226.catalog > metadata_mitochondrion.txt
cut -f1,3 metadata_mitochondrion.txt | awk '{print $1"\t./mitochondrion_genomic_peracc/"$2".fasta";}' > input_map-mitochondrion.tsv

We are almost there; KRANK also needs the corresponding taxonomy tree to relate different groups in their hierarchy. We download the most relevant NCBI taxonomy that we could find based on the release date of the catalog and the taxonomy:

1
2
3
4
wget https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_archive/taxdmp_2024-09-01.zip
mkdir taxdump && mv taxdmp_2024-09-01.zip taxdump
cd taxdump && unzip taxdmp_2024-09-01.zip&& rm taxdmp_2024-09-01.zip
cd ../

There is only one missing/conflicting entry in this taxonomy and it is somewhat irrelevant to our species of interest so we just exclude it:

1
2
# Exclude Erynnis montana (3342064), it is the only conflicting tax ID w/ the taxonomy we are using
grep -v -w 3342064 input_map-mitochondrion.tsv > .input_map.tsv && mv .input_map.tsv input_map-mitochondrion.tsv

Clone KRANK from its GitHub repository and compile it:

1
2
3
4
git clone https://github.com/bo1929/KRANK.git
cd KRANK
make
cd ..

After that, we can directly start by initializing a library which we will build using all mitochondrion genomic data, this might take a while reading all the data and extracting k-mers:

1
2
3
4
5
6
7
KRANK/krank --seed 0 build \
  -l mitochondrion_lib -t ./taxdump/ \
  -i ./input_map-mitochondrion.tsv  \
  -k 29 -w 30 -h 13 -b 32 -s 5 \
  --from-scratch --input-sequences \
  --kmer-ranking random --free-size --lca soft \
  --num-threads 16

Here, mitochondrion_lib is the name of the directory which we will save the library in. For all other parameters you can refer to the tutorial in the GitHub repository.

Now we can build the LSH tables, this will take quite a while so make sure to have a stable connecting (or use screen/tmux etc.) if you are on a remote server. The resource usage is also going to be high so using a machine with high-memory would help:

1
2
3
4
5
6
7
KRANK/krank --seed 0 build \
  -l mitochondrion_lib -t ./taxdump/ \
  -i ./input_map-mitochondrion.tsv  \
  -k 29 -w 30 -h 13 -b 32 -s 5 \
  --target-batch 0 --fast-mode --from-library --input-sequences \
  --kmer-ranking random --free-size --lca soft \
  --num-threads 16

If you fail to run this command due to insufficient memory; increase -s to 6 or 7 and don’t forget to initialize again!

We can finally query our baits:

1
KRANK/krank query -l mitochondrion_lib -q N_elephant_seal/angustirostris_mitogenome_chunks.fasta -o N_elephant_seal
1
KRANK/krank query -l mitochondrion_lib -q Humpback_whale/novaeangliae_mitogenome_chunks.fasta -o Humpback_whale/

Resulting file will associate each candidate sequence with a taxonomic group, choose higher ranks for increased sensitivity and lower ranks for better specificity.