We are starting from scratch by creating a directory for the project:
1mkdir BaitsProject-ONR
2cd BaitsProject-ONR
The very first step is to download all the mitochondrion genomes available on RefSeq as of 2024/09/13, we also retrieve the corresponding catalog to later map each genome to a taxonomic group.
1# 2024-09-13
2wget https://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/RefSeq-release226.catalog.gz
3wget https://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/mitochondrion.1.1.genomic.fna.gz
Next step is to simply decompress everything we downloaded:
1gunzip -d mitochondrion.1.1.genomic.fna.gz
2gunzip -d RefSeq-release226.catalog.gz
Convert multi-line FASTA to 2-line FASTA for easier processing:
1seqtk seq -l0 mitochondrion.1.1.genomic.fna > /tmp/mitochondrion_l0.1.1.genomic.fna && \
2  mv /tmp/mitochondrion_l0.1.1.genomic.fna mitochondrion.1.1.genomic.fna
Then list all sequence identifiers to another file:
1 grep "^>" mitochondrion.1.1.genomic.fna | sed 's/^>//' > seqids_mitochondrion.txt
This part is somewhat idiosyncratic because of the input KRANK requires: a mapping between taxonomic IDs and files. So we have to split concatenated mitochondrion genomes into separate files:
1mkdir mitochondrion_genomic_peracc
2cd mitochondrion_genomic_peracc
3cat ../mitochondrion.1.1.genomic.fna \
4  | sed 's/^>/>>/' \
5  | awk 'BEGIN{RS=">>";FS="\n"} NR>1{split($0,acc," "); fnme=acc[1]".fasta"; print ">" $0 > fnme; close(fnme);}'
6cd ../
Now, we can map each mitochondrion sequence to a taxonomic group which will be found in RefSeq-release226.catalog.
The file input_map-mitochondrion.tsv is just a mapping between taxonomic IDs and mitochondrion sequences we found in RefSeq which we have already split into separate files.
This file is going to be the input to KRANK.
1cut -f1 -d' ' seqids_mitochondrion.txt > acc_mitochondrion.txt
2grep -f acc_mitochondrion.txt RefSeq-release226.catalog > metadata_mitochondrion.txt
3cut -f1,3 metadata_mitochondrion.txt | awk '{print $1"\t./mitochondrion_genomic_peracc/"$2".fasta";}' > input_map-mitochondrion.tsv
We are almost there; KRANK also needs the corresponding taxonomy tree to relate different groups in their hierarchy. We download the most relevant NCBI taxonomy that we could find based on the release date of the catalog and the taxonomy:
1wget https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_archive/taxdmp_2024-09-01.zip
2mkdir taxdump && mv taxdmp_2024-09-01.zip taxdump
3cd taxdump && unzip taxdmp_2024-09-01.zip&& rm taxdmp_2024-09-01.zip
4cd ../
There is only one missing/conflicting entry in this taxonomy and it is somewhat irrelevant to our species of interest so we just exclude it:
1# Exclude Erynnis montana (3342064), it is the only conflicting tax ID w/ the taxonomy we are using
2grep -v -w 3342064 input_map-mitochondrion.tsv > .input_map.tsv && mv .input_map.tsv input_map-mitochondrion.tsv
Clone KRANK from its GitHub repository and compile it:
1git clone https://github.com/bo1929/KRANK.git
2cd KRANK
3make
4cd ..
After that, we can directly start by initializing a library which we will build using all mitochondrion genomic data, this might take a while reading all the data and extracting k-mers:
1KRANK/krank --seed 0 build \
2  -l mitochondrion_lib -t ./taxdump/ \
3  -i ./input_map-mitochondrion.tsv  \
4  -k 29 -w 30 -h 13 -b 32 -s 5 \
5  --from-scratch --input-sequences \
6  --kmer-ranking random --free-size --lca soft \
7  --num-threads 16
Here, mitochondrion_lib is the name of the directory which we will save the library in. For all other parameters you can refer to the tutorial in the GitHub repository.
Now we can build the LSH tables, this will take quite a while so make sure to have a stable connecting (or use screen/tmux etc.) if you are on a remote server. The resource usage is also going to be high so using a machine with high-memory would help:
1KRANK/krank --seed 0 build \
2  -l mitochondrion_lib -t ./taxdump/ \
3  -i ./input_map-mitochondrion.tsv  \
4  -k 29 -w 30 -h 13 -b 32 -s 5 \
5  --target-batch 0 --fast-mode --from-library --input-sequences \
6  --kmer-ranking random --free-size --lca soft \
7  --num-threads 16
If you fail to run this command due to insufficient memory; increase -s to 6 or 7 and don’t forget to initialize again!
We can finally query our baits:
1KRANK/krank query -l mitochondrion_lib -q N_elephant_seal/angustirostris_mitogenome_chunks.fasta -o N_elephant_seal
1KRANK/krank query -l mitochondrion_lib -q Humpback_whale/novaeangliae_mitogenome_chunks.fasta -o Humpback_whale/
Resulting file will associate each candidate sequence with a taxonomic group, choose higher ranks for increased sensitivity and lower ranks for better specificity.
basty for Behavioral Analysis of Sleep in *Drosophila melanogaster*
Binning mitochondrial reads from skimming data using krepp