CONSULT for Contamination Removal | Ali Osman Berk Şapcı

Relying on locality-sensitive hashing, CONSULT-II k-mers from a query set and tests whether they fall within a user-specified hamming distance of k-mers in the reference dataset. Using this invaluable information, it can remove contamination, i.e., reads that do not belong to the reference set, at least predicted to do so.

Building CONSULT

To compile, go to the directory where core programs for map construction and query search are located and run the below commands.

You can use make to compile CONSULT.

1
2
3
4
5


make all # for all components of CONSULT
# OR
make minimize # for the minimization script
make map # for consult_map to construct a library
make search # for consult_search to query

Alternatively, you can run g++ directly.

1
2
3


g++ minimize.cpp -std=c++11 -o minimize # for the minimization script
g++ consult_map.cpp -std=c++11 -O3 -o consult_map  # for consult_map to construct a library
g++ consult_search.cpp -std=c++11 -fopenmp -O3 -o consult_search # for consult_search to query

Constructing a CONSULT reference library

Our custom CONSULT libraries constructed using different genomic reference sets:
Alternatively our custom CONSULT libraries can be downloaded from drive:

To construct a reference library, go to the place where CONSULT is compiled and use the following command to run consult_map.

1

./consult_map -i /path/to/file --output-library-dir /path/to/directory

Alternatively, one can use long command-line arguments:

1

./consult_map --input-fasta-file /path/to/file --output-library-dir /path/to/directory

Description of command-line arguments for consult_map:

-i or --input-fasta-file: input .fasta file of list of k-mers to construct library.
-o or --output-library-dir: output path to the directory that will store the CONSULT library.
-t or --tag-size: (optional) number of bits to be used as tag, default value is 2.
-p or --distance-threshold: (optional) Hamming distance threshold value for a match to be counted, default value is 3.
-h or --number-of-postitions: (optional) number of randomly positioned bits to compute LSH.
-l or --number-of-tables: (optional) number of tables, i.e., number of hash functions.
-b or --column-per-tag: (optional) number of columns per each tag partition, i.e., number of k-mers each encoding can map to.

Default parameter values of optional arguments are determined by a heuristic using the genome size given with the -i argument.

Searching Queries against a Reference Library

To query a set of sequences against a reference, go to the directory where executables are and execute the consult_search command:

1

 ./consult_search -i /path/to/directory -q /path/to/queries -o /path/to/directory

Alternatively, one can use long command-line arguments:

1

 ./consult_search --input-library-dir /path/to/directory --query-path /path/to/queries --output-library-dir /path/to/directory

Description of command-line arguments for consult_search:

-i or input-library-dir: directory of the CONSULT library that will be used as the reference database.
-q or --query-path: the path to the query file, or to the directory containing query files.
-o or output-result-dir: (optional) directory in which classified reads, unclassified reads and matched k-mer counts will be saved. Default is the current working directory.
-c or --number-of-matches: (optional) the minimum number of matched k-mers that is required to call sequencing read classified. For instance, if at least one k-mer match is enough to classify a read (default setting mentioned in a paper), -c should be set to 1 in the software. If at least two k-mer matches are required to call the entire read a match, -c should be set to 2. Default value is 1.
--thread-count: (optional) number of threads to be used, default value is 1.
--unclassified-out: to output reads that are unclassified in a file with a name query file name prefixed with “unclassified-seq_”. This is given by default.
--classified-out: to output reads that are classified in a file with a name query file name prefixed with “classified-seq_”.

Input

The files containing query sequences to be classified should be located in /path/to/queries and be in a FASTQ format (one uncompressed .fq/.fastq file per each sample). The path /path/to/queries can be a directory or a .fastq. If it is a directory, each query file in the directory will be queried against the library, and separate outputs will be generated for each. FASTA format is not supported at the moment. Quality factors are not being utilized by CONSULT but FASTQ labels will be used to identify the sequences in the output file.

Output

CONSULT is designed for filtering out contaminants from sequencing reads. So, its default output is a FASTQ file that contains unclassified reads and their corresponding sequence IDs, obtained from the input FASTQ headers. Files are stored in the directory given in --output-library-dir (or -o), and the default is where software is run. Every sample retains its original file name prefixed with “unclassified-seq_”. CONSULT also is able to generate a file that contains the classified reads, in the same format with unclassified described above, and the output file name will be prefixed with “classified-seq_”. To make CONSULT behave this way, give the --classified-out flag.

Next:
basty for Behavioral Analysis of Sleep in *Drosophila melanogaster*