Relying on locality-sensitive hashing, CONSULT-II k-mers from a query set and tests whether they fall within a user-specified hamming distance of k-mers in the reference dataset. Using this invaluable information, it can remove contamination, i.e., reads that do not belong to the reference set, at least predicted to do so.
Building CONSULT
To compile, go to the directory where core programs for map construction and query search are located and run the below commands.
-
You can use
make
to compile CONSULT.1 2 3 4 5
make all # for all components of CONSULT # OR make minimize # for the minimization script make map # for consult_map to construct a library make search # for consult_search to query
-
Alternatively, you can run
g++
directly.1 2 3
g++ minimize.cpp -std=c++11 -o minimize # for the minimization script g++ consult_map.cpp -std=c++11 -O3 -o consult_map # for consult_map to construct a library g++ consult_search.cpp -std=c++11 -fopenmp -O3 -o consult_search # for consult_search to query
Constructing a CONSULT reference library
- Our custom CONSULT libraries constructed using different genomic reference sets:
- Alternatively our custom CONSULT libraries can be downloaded from drive:
To construct a reference library, go to the place where CONSULT is compiled and use the following command to run consult_map
.
|
|
Alternatively, one can use long command-line arguments:
|
|
Description of command-line arguments for consult_map
:
-
-i
or--input-fasta-file
: input.fasta
file of list of k-mers to construct library. -
-o
or--output-library-dir
: output path to the directory that will store the CONSULT library. -
-t
or--tag-size
: (optional) number of bits to be used as tag, default value is 2. -
-p
or--distance-threshold
: (optional) Hamming distance threshold value for a match to be counted, default value is 3. -
-h
or--number-of-postitions
: (optional) number of randomly positioned bits to compute LSH. -
-l
or--number-of-tables
: (optional) number of tables, i.e., number of hash functions. -
-b
or--column-per-tag
: (optional) number of columns per each tag partition, i.e., number of k-mers each encoding can map to.
Default parameter values of optional arguments are determined by a heuristic using the genome size given with the -i
argument.
Searching Queries against a Reference Library
To query a set of sequences against a reference, go to the directory where executables are and execute the consult_search
command:
|
|
Alternatively, one can use long command-line arguments:
|
|
Description of command-line arguments for consult_search
:
-
-i
orinput-library-dir
: directory of the CONSULT library that will be used as the reference database. -
-q
or--query-path
: the path to the query file, or to the directory containing query files. -
-o
oroutput-result-dir
: (optional) directory in which classified reads, unclassified reads and matched k-mer counts will be saved. Default is the current working directory. -
-c
or--number-of-matches
: (optional) the minimum number of matched k-mers that is required to call sequencing read classified. For instance, if at least one k-mer match is enough to classify a read (default setting mentioned in a paper),-c
should be set to 1 in the software. If at least two k-mer matches are required to call the entire read a match,-c
should be set to 2. Default value is 1. -
--thread-count
: (optional) number of threads to be used, default value is 1. -
--unclassified-out
: to output reads that are unclassified in a file with a name query file name prefixed with “unclassified-seq_”. This is given by default. -
--classified-out
: to output reads that are classified in a file with a name query file name prefixed with “classified-seq_”.
Input
The files containing query sequences to be classified should be located in /path/to/queries
and be in a FASTQ format (one uncompressed .fq
/.fastq
file per each sample).
The path /path/to/queries
can be a directory or a .fastq
.
If it is a directory, each query file in the directory will be queried against the library, and separate outputs will be generated for each.
FASTA format is not supported at the moment.
Quality factors are not being utilized by CONSULT but FASTQ labels will be used to identify the sequences in the output file.
Output
CONSULT is designed for filtering out contaminants from sequencing reads.
So, its default output is a FASTQ file that contains unclassified reads and their corresponding sequence IDs, obtained from the input FASTQ headers.
Files are stored in the directory given in --output-library-dir
(or -o
), and the default is where software is run.
Every sample retains its original file name prefixed with “unclassified-seq_”.
CONSULT also is able to generate a file that contains the classified reads, in the same format with unclassified described above, and the output file name will be prefixed with “classified-seq_”.
To make CONSULT behave this way, give the --classified-out
flag.