CONSULT for Contamination Removal


Relying on locality-sensitive hashing, CONSULT-II k-mers from a query set and tests whether they fall within a user-specified hamming distance of k-mers in the reference dataset. Using this invaluable information, it can remove contamination, i.e., reads that do not belong to the reference set, at least predicted to do so.

Building CONSULT

To compile, go to the directory where core programs for map construction and query search are located and run the below commands.

Constructing a CONSULT reference library

To construct a reference library, go to the place where CONSULT is compiled and use the following command to run consult_map.

./consult_map -i /path/to/file --output-library-dir /path/to/directory

Alternatively, one can use long command-line arguments:

./consult_map --input-fasta-file /path/to/file --output-library-dir /path/to/directory

Description of command-line arguments for consult_map:

Default parameter values of optional arguments are determined by a heuristic using the genome size given with the -i argument.

Searching Queries against a Reference Library

To query a set of sequences against a reference, go to the directory where executables are and execute the consult_search command:

 ./consult_search -i /path/to/directory -q /path/to/queries -o /path/to/directory

Alternatively, one can use long command-line arguments:

 ./consult_search --input-library-dir /path/to/directory --query-path /path/to/queries --output-library-dir /path/to/directory

Description of command-line arguments for consult_search:

Input

The files containing query sequences to be classified should be located in /path/to/queries and be in a FASTQ format (one uncompressed .fq/.fastq file per each sample). The path /path/to/queries can be a directory or a .fastq. If it is a directory, each query file in the directory will be queried against the library, and separate outputs will be generated for each. FASTA format is not supported at the moment. Quality factors are not being utilized by CONSULT but FASTQ labels will be used to identify the sequences in the output file.

Output

CONSULT is designed for filtering out contaminants from sequencing reads. So, its default output is a FASTQ file that contains unclassified reads and their corresponding sequence IDs, obtained from the input FASTQ headers. Files are stored in the directory given in --output-library-dir (or -o), and the default is where software is run. Every sample retains its original file name prefixed with “unclassified-seq_”. CONSULT also is able to generate a file that contains the classified reads, in the same format with unclassified described above, and the output file name will be prefixed with “classified-seq_”. To make CONSULT behave this way, give the --classified-out flag.