Template Programming in Cpp

1. Programming Languages and Type Systems

Definition: A type system is a logical system comprising a set of rules that assigns a property called a type (for example, integer, floating point, string) to every term (a word, phrase, or other set of symbols).

Type systems are built into interpreters (Python, R) or compilers (C/C++, Rust).
Type checking can happen statically (at compile time, e.g., C), dynamically (at run time, e.g., Python), or as a combination of both.

2. Dynamic types and duck typing

In dynamically typed languages, values, rather than variables, have types and type checking happens at run time. Python uses duck typing and dynamic types, so we have typed objects but untyped variable names.

Any variable can be assigned to any value:

{type_system.py 2}

var1 = "This a string!"
print("Is var1 a string?", isinstance(var1, str))
var1 = 9
print("Has var1 become an integer?", isinstance(var1, int))

Added to in section 2

Output:

Is var1 a string? True

Has var1 become an integer? True

A function can return any type:

{type_system.py 2} +=

def max(a, b):
  return a if a>b else b
var2 = 9.1
print("Is var1 a string?", isinstance(var2, float))
print("max(var1, var2) = ", max(var1, var2))
print("Can max return a float?", isinstance(max(0.2, 0.3), float))
print("Can max return a string?", isinstance(max("ABC", "XYZ"), str))

Added to in section 2

Output:

Is var2 a float? True

max(var1, var2) = 9.1

Can max return a float? True

Can max return a string? True

As long as all operations are defined on the value types, everything will work just fine:

{type_system.py 2} +=

var3 = "This is a string!"
try:
  print("max(var1, var3) =", max(var1, var3))
except TypeError as err:
  print("This is a TypeError: ", err)
var4 = "This is another string!"
print("max(var3, var4) = ", max(var3, var4))

Added to in section 2

Output:

This is a TypeError: '>' not supported between instances of 'int' and 'str'

max(var3, var4) = This is another string!

You do not need to worry about the types of your variable, the interpreter will take care.

3. Strongly typed languages and static type-checking

We do not have implicit type conversion and type checking is usually performed at compile time. Each variable has a type, and the return values of functions are also typed. Having a stricter typing system makes your code less flexible, but it is usually much more efficient both in terms of memory usage and running time.

{type_system.cpp 3}

#include <iostream>
#include <string>

{Define max1 to find the maximum value, 3}
{Define max2 to find the maximum value, 3}
{We need another function for finding the maximum string, 3}

int main(){
{Valid declaration, 3}
{Type casting, 3}
{Compare a float and an integer, 3}
{Find the maximum using max1, 3}
{Find the maximum using max2, 3}
{Comparing two strings, 3}
return 0;
}

You need to declare each variable type before using it.

{Valid declaration 3}

std::string s = "A simple string.";

If the assigned value is not suitable for casting, the compiler will give you an error.

{Invalid assignment 3}

std::string s = 9;

Some values and some types are compatible together for casting:

{Type casting 3}

int i = 2;
i = 2.1;

But if you should manage automatic casting carefully:

{Type casting 3} +=

if (i > 2.0)
  std::cout << "i is greater than 2.0!" << std::endl;
else
  std::cout << "i is not greater than 2.0!" << std::endl;

std::cout << "The value of i: " << i << std::endl;

Output:

i is not greater than 2.0!

The value of i: 2

You can apply operators/functions to any type(s) as long as it is defined:

{Compare a float and an integer 3}

float f = 2.1;
if (f > i) // > operator is defined for float and integer
  std::cout << "f is greater than i." << std::endl;

Output:

f is greater than i.

Note that there is no global < operator, there are multiple overloaded < operators and compiler picks which one to use for us.

If there is none, it would result in a compile error:

{Undefined operator for std::string and int 3}

std::string s = "A random string.";
if (f > s) // > operator is defined for std::string and integer
  std::cout << "s is greater than i." << std::endl;

Let's define a simple function which computes and returns the maximum of given two values:

{Define max1 to find the maximum value 3}

inline int const& max1 (int const& a, int const& b) {
  return a<b?b:a; // if a < b then use b else use a
}

As you can see, we defined two input arguments and the return type to be integer.

Let's use it with both integer and floating-point numbers and see what happens:

{Find the maximum using max1 3}

std::cout << "max1(2, 3) = " << max1(2, 3) << std::endl;
std::cout << "max1(2, 2.1) = " << max1(2, 2.1) << std::endl;
 // i (int) is 2 and f (float) is 2.1
std::cout << "max1(i, 1) = " << max1(i, 1) << std::endl;
std::cout << "max1(i, 2.1) = " << max1(i, 2.1) << std::endl;
std::cout << "max1(i, f) = " << max1(i, f) << std::endl;

Output:

max1(2, 3) = 3

max1(2, 2.1) = 2

max1(i, 1) = 2

max1(i, 2.1) = 2

max1(i, f) = 2

max1 fails to compute the maximum with floating point values due to automatic casting from float to integer, as expected.

Apparently, we need to define another function which we can use with floating point values:

{Define max2 to find the maximum value 3}

inline float const& max2 (float const& a, float const& b) {
  return a<b?b:a; // if a < b then use b else use a
}

Let's use it with the same set of values:

{Find the maximum using max2 3}

std::cout << "max2(2, 3) = " << max2(2, 3) << std::endl;
std::cout << "max2(2, 2.1) = " << max2(2, 3) << std::endl;
// i (int) is 2 and f (float) is 2.1
std::cout << "max2(i, 1) = " << max2(i, 1) << std::endl;
std::cout << "max2(i, 2.1) = " << max2(i, 2.1) << std::endl;
std::cout << "max2(i, f) = " << max2(i, f) << std::endl;

Output:

max2(2, 3) = 3

max2(2, 2.1) = 3

max2(i, 1) = 2

max2(i, 2.1) = 2.1

max2(i, f) = 2.1

It works just fine but, we still had to cast from integer to float just to compare two values. This might seem like a minor issue but this is just extra computation and not good for performance. If you want to avoid casting, then you have to implement four different functions to compare all possible pairs of int and float. We also have double, short int, long int, and others...

For instance, we can also compare two strings and find the lexicographical maximum:

{Comparing two strings 3}

std::string s1 =  "ABCDEFG";
std::string s2 =  "BBCDEFG";
std::string s3 = s1<s2?s1:s2; // if s1 < s2 then use s2 else use s2
std::cout << "The maximum of ABCDEFG and BBCDEFG: " << s3 << std::endl;

Output:

The maximum of ABCDEFG and BBCDEFG: ABCDEFG

So, let's implement one more function to compute the lexicographical maximum of two strings:

{We need another function for finding the maximum string 3}

inline std::string const& max3 (std::string const& a, std::string const& b) {
  return a<b?b:a; // if a < b then use b else use a
}

As you can see, this is very unpleasant and results in code redundancy. Luckily, there are ways to work around this issue.

Function Templates

4. Using templates for flexible functions

Function templates provide a functional behavior that can be called for different types. The representation looks similar to an ordinary function but some elements of the definition are left undetermined: types. These elements are parameterized.

{function_templates.cpp 4}

#include <iostream>
#include <string>

{First template function, 5}
{Template function with multiple types, 6}
{A function with a template return type, 7}

int main(){
{Using our first template function, 5}
{Be careful about the return type, 7}
{How to specify the return type, 7}
return 0;
}

5. A simple template function

The following is a function that returns the maximum of the values:

{First template function 5}

template <typename T>
inline T const& max_t1 (T const& a, T const& b) {
return a<b?b:a; // if a < b then use b else use a
}

Used in section 4

Here T stands for the type, and this function will compile and work as long as the < operator is defined for the given type.

We can have more than one parameter for type names, the overall form of the syntax looks like this:

{Template function syntax 5}

template <typename T1, typename T2, typename TN>

We can use max_t1 to compare two integers, doubles, or strings:

{Using our first template function 5}

int i = 42;
std::cout << "max_t1(7, i) = " << max_t1(7, i) << std::endl;

double f1 = 3.4;
double f2 = -6.7;
std::cout << "max_t1(f1, f2) = " << max_t1(f1, f2) << std::endl;

std::string s1 = "mathematics";
std::string s2 = "math";
std::cout << "max_t1(s1, s2) = " << max_t1(s1, s2) << std::endl;

Used in section 4

Output:

max_t1(7, i) = 42

max_t1(f1, f2) = 3.4

max_t1(s1, s2) = mathematics

Normally, templates are not compiled into single entities that can handle any type. Instead, different entities are generated from the template for every type for which the template is used. Thus, max_t1 is compiled for each of these three types.

An attempt to instantiate a template for a type that does not support all the operations used within it will result in a compile-time error.

{Failed instantiation 5}

std::complex<float> c1{0, 0}, c2{0, 0}; // does not provide operator <
max_t1(c1, c2); // ERROR at compile time

6. The need for multiple template arguments

Still, the compiler might fail to deduce the correct types when you use it ambiguously:

{Failed argument deduction 6}

max_t1(4, 7) // OK: T is int for both arguments
max_t1(4, 4.2) // ERROR: first T is int, second T is double

The second one will lead to a compile-time error.

There are three ways to handle this:

Cast arguments to the same type to match.

{Cast to the same type 6}

  max_t1(static_cast<double>(4), 4.2) // OK
  max_t1(4, static_cast<int>(4.2)) // OK, but you will lose decimal part

Explicitly state the type for the template, so that casting can be done.

{Explicitly specify 6}

  max_t1<double>(4, 4.2) // OK
  max_t1<int>(4, 4.2) // OK, but you will lose the decimal part

Specify that arguments can be of different types.

{Template function with multiple types 6}

  template <typename T1, typename T2>
  inline T1 max_t2 (T1 const& a, T2 const& b) {
  return a<b?b:a;
  }

Used in section 4

This is a better and more flexible solution. Note that, the return type is the same as the first argument's type.

7. Specifying the return type

max_t2 is OK, but the type of the first argument defines the return type.

{Be careful about the return type 7}

// return type of max_t2 will be the same with the first argument
if (max_t2(4, 4.2) != max_t2(4.2, 4))
  std::cout << "The order determines the return type." << std::endl;

Used in section 4

Output:

The order determines the return type.

You can simply introduce a third template argument, called RT, to define the return type of a function template:

{A function with a template return type 7}

template <typename RT, typename T1, typename T2>
inline RT max_t3 (T1 const& a, T2 const& b) {
return a<b?b:a;
}

Used in section 4

Note that parameter names (T1, T2, RT) are arbitrary, you can use anything.

Now, we can also specify the return type as follows:

{How to specify the return type 7}

max_t3<int,double,double>(4, 4.2); // OK, but tedious
max_t3<double>(4,4.2); // OK: return type is double
if (max_t3<double>(4, 4.2) == max_t3<double>(4.2, 4))
  std::cout << "The return type is double." << std::endl;

Used in section 4

Output:

The return type is double.

In the second call, max_t3<double> explicitly sets RT to double, but the parameters T1 and T2 are deduced to be int and double from the arguments

You may want to keep it simple: max_t1 might be sufficient most of the time.

Class Templates

8. Using parametric polymorphism for flexible data structures

Similar to functions, classes can also be parameterized with one or more types. A class template definition looks like a regular class definition, except it is prefixed by the keyword template.

We will examine class templates with an example.

Consider a class for k-mers (k \leq 32) that can provide the following:

Given a k-mer sequence as a string, it encodes it to an integer.
Given another instance of the same class, it can compute the Hamming distance.
In order to be memory efficient, our class will use 16bit encodings if 8 \leq k, 32bit encodings if 8 < k \leq 16, and 64bit encodings otherwise.

This is the overview of the class:

{class_templates.cpp 8}

#include <iostream>
#include <string>

template <typename ET>
class Tkmer {
  {Member variables, 8}

  public:
    {Class constructor, 8}

      {Encoding k-mers from sequences to integers, 9}
    
      {Computing the Hamming distance with a given k-mer, 10}

};

int main() {
    {Using our templated class for short k-mers, 10}
    {Using our templated class for long k-mers, 10}
}

ET is a type parameter and it can be any type. But all operations performed within the member functions must be supported by ET.

{Member variables 8}

ET enc; // this variable will store the k-mer encoding
uint8_t num_letter; // number of letters that we can encode using ET

We expect enc to be one of the following: uint32_t and uint64_t. Other types could result in a compile-error or unexpected behavior.

We have a simple constructor, that takes a string as the input, checks its length, and calls encode to set enc.

{Class constructor 8}

Tkmer(std::string& input_seq) {
  num_letter = (std::numeric_limits<ET>::digits / 2);

  if (input_seq.size() > num_letter)
    throw std::invalid_argument("Given k-mer is too long for this type!");

  encode(input_seq);
}

Now, we need to make member functions encode and hamming_distance work for different possible encoding types.

9. Encoding k-mers into integers

For each nucleotide, we will need 2 bits. Note that when we store billions of k-mers, this will make a huge difference, and memory usage will decrease when possible.

This function adds a different value to enc for different letters, and shifts it to the left for each letter in the sequence:

{Encoding k-mers from sequences to integers 9}

void encode(std::string& seq) {
    enc = 0;
    ET hv = 1;
    hv = hv << num_letter;
    for (unsigned int i = 0; i < seq.size(); i++) {
        enc = enc << 1;
        if (seq[i] == 'T') {
            enc += (hv + 1);
        } else if (seq[i] == 'G') {
            enc += hv;
        } else if (seq[i] == 'C') {
            enc += 1;
        } else {
            enc += 0;
        }
    }
}

Used in section 8

The resulting encoding stores the ith letter of the input sequence as two bits (stored at ith and i+\texttt{num_letter}th bits of the enc integer).

10. Computing the Hamming distance using the template argument

This function computes the Hamming distance between two k-mers:

{Computing the Hamming distance with a given k-mer 10}

unsigned int hamming_distance(const Tkmer<ET>& kmer) {
    ET z = enc ^ kmer.enc; // XOR
    return __builtin_popcountll((z | (z >> num_letter)) << num_letter); // count different positions
}

Used in section 8

Note that the type ET must be the same as input kmer's type.

An example with short k-mers with length 7:

{Using our templated class for short k-mers 10}

  std::string x_seq_short = "AATTAGT";
  std::string y_seq_short = "GCTTAGT";

    Tkmer<uint32_t> x_short(x_seq_short);
    Tkmer<uint32_t> y_short(y_seq_short);
    
    std::cout << "HD(x, y) = " << x_short.hamming_distance(y_short) << std::endl;

Used in section 8

Output:

HD(x, y) = 2

An example with log k-mers with length 31:

{Using our templated class for long k-mers 10}

  std::string x_seq_long = "TTCGCGCGTYTTAGCGCAAACGTTAAGGGCT";
  std::string y_seq_long = "TTCGCGCGTYTTAGCGCAAACGTTAAAAAAA";

    Tkmer<uint64_t> x_long(x_seq_long);
    Tkmer<uint64_t> y_long(y_seq_long);
    
    std::cout << "HD(x, y) = " << x_long.hamming_distance(y_long) << std::endl;

Used in section 8

Output:

HD(x, y) = 5

Our templated class makes sure that the Hamming distance between encodings with different types is not computed:

{This will not compile 10}

    std::cout << "HD(x, y) = " << x_long.hamming_distance(y_short) << std::endl; // ERROR: x_long uses uint64_t, y_short uses uint32_t

The compiler will fail to deduce ET: uint64_t or uint32_t?