We are starting from scratch by creating a directory for the project:
|
|
The very first step is to download all the mitochondrion genomes available on RefSeq as of 2024/09/13, we also retrieve the corresponding catalog to later map each genome to a taxonomic group.
|
|
Next step is to simply decompress everything we downloaded:
|
|
Convert multi-line FASTA to 2-line FASTA for easier processing:
|
|
Then list all sequence identifiers to another file:
|
|
This part is somewhat idiosyncratic because of the input KRANK requires: a mapping between taxonomic IDs and files. So we have to split concatenated mitochondrion genomes into separate files:
|
|
Now, we can map each mitochondrion sequence to a taxonomic group which will be found in RefSeq-release226.catalog
.
The file input_map-mitochondrion.tsv
is just a mapping between taxonomic IDs and mitochondrion sequences we found in RefSeq which we have already split into separate files.
This file is going to be the input to KRANK.
|
|
We are almost there; KRANK also needs the corresponding taxonomy tree to relate different groups in their hierarchy. We download the most relevant NCBI taxonomy that we could find based on the release date of the catalog and the taxonomy:
|
|
There is only one missing/conflicting entry in this taxonomy and it is somewhat irrelevant to our species of interest so we just exclude it:
|
|
Clone KRANK from its GitHub repository and compile it:
|
|
After that, we can directly start by initializing a library which we will build using all mitochondrion genomic data, this might take a while reading all the data and extracting k-mers:
|
|
Here, mitochondrion_lib
is the name of the directory which we will save the library in. For all other parameters you can refer to the tutorial in the GitHub repository.
Now we can build the LSH tables, this will take quite a while so make sure to have a stable connecting (or use screen
/tmux
etc.) if you are on a remote server. The resource usage is also going to be high so using a machine with high-memory would help:
|
|
If you fail to run this command due to insufficient memory; increase -s
to 6 or 7 and don’t forget to initialize again!
We can finally query our baits:
|
|
|
|
Resulting file will associate each candidate sequence with a taxonomic group, choose higher ranks for increased sensitivity and lower ranks for better specificity.