Definition: A type system is a logical system comprising a set of rules that assigns a property called a type (for example, integer, floating point, string) to every term (a word, phrase, or other set of symbols).
In dynamically typed languages, values, rather than variables, have types and type checking happens at run time. Python uses duck typing and dynamic types, so we have typed objects but untyped variable names.
Any variable can be assigned to any value:
var1 = "This a string!" print("Is var1 a string?", isinstance(var1, str)) var1 = 9 print("Has var1 become an integer?", isinstance(var1, int))
Added to in section 2
Output:
Is var1 a string? True
Has var1 become an integer? True
A function can return any type:
def max(a, b): return a if a>b else b var2 = 9.1 print("Is var1 a string?", isinstance(var2, float)) print("max(var1, var2) = ", max(var1, var2)) print("Can max return a float?", isinstance(max(0.2, 0.3), float)) print("Can max return a string?", isinstance(max("ABC", "XYZ"), str))
Added to in section 2
Output:
Is var2 a float? True
max(var1, var2) = 9.1
Can max return a float? True
Can max return a string? True
As long as all operations are defined on the value types, everything will work just fine:
var3 = "This is a string!" try: print("max(var1, var3) =", max(var1, var3)) except TypeError as err: print("This is a TypeError: ", err) var4 = "This is another string!" print("max(var3, var4) = ", max(var3, var4))
Added to in section 2
Output:
This is a TypeError: '>' not supported between instances of 'int' and 'str'
max(var3, var4) = This is another string!
You do not need to worry about the types of your variable, the interpreter will take care.
We do not have implicit type conversion and type checking is usually performed at compile time. Each variable has a type, and the return values of functions are also typed. Having a stricter typing system makes your code less flexible, but it is usually much more efficient both in terms of memory usage and running time.
#include <iostream> #include <string> {Define max1 to find the maximum value, 3} {Define max2 to find the maximum value, 3} {We need another function for finding the maximum string, 3} int main(){ {Valid declaration, 3} {Type casting, 3} {Compare a float and an integer, 3} {Find the maximum using max1, 3} {Find the maximum using max2, 3} {Comparing two strings, 3} return 0; }
You need to declare each variable type before using it.
std::string s = "A simple string.";
If the assigned value is not suitable for casting, the compiler will give you an error.
std::string s = 9;
Some values and some types are compatible together for casting:
int i = 2; i = 2.1;
But if you should manage automatic casting carefully:
if (i > 2.0) std::cout << "i is greater than 2.0!" << std::endl; else std::cout << "i is not greater than 2.0!" << std::endl; std::cout << "The value of i: " << i << std::endl;
Output:
i is not greater than 2.0!
The value of i: 2
You can apply operators/functions to any type(s) as long as it is defined:
float f = 2.1; if (f > i) // > operator is defined for float and integer std::cout << "f is greater than i." << std::endl;
Output:
f is greater than i.
Note that there is no global <
operator, there are multiple overloaded <
operators and compiler picks which one to use for us.
If there is none, it would result in a compile error:
std::string s = "A random string."; if (f > s) // > operator is defined for std::string and integer std::cout << "s is greater than i." << std::endl;
Let's define a simple function which computes and returns the maximum of given two values:
inline int const& max1 (int const& a, int const& b) { return a<b?b:a; // if a < b then use b else use a }
As you can see, we defined two input arguments and the return type to be integer.
Let's use it with both integer and floating-point numbers and see what happens:
std::cout << "max1(2, 3) = " << max1(2, 3) << std::endl; std::cout << "max1(2, 2.1) = " << max1(2, 2.1) << std::endl; // i (int) is 2 and f (float) is 2.1 std::cout << "max1(i, 1) = " << max1(i, 1) << std::endl; std::cout << "max1(i, 2.1) = " << max1(i, 2.1) << std::endl; std::cout << "max1(i, f) = " << max1(i, f) << std::endl;
Output:
max1(2, 3) = 3
max1(2, 2.1) = 2
max1(i, 1) = 2
max1(i, 2.1) = 2
max1(i, f) = 2
max1
fails to compute the maximum with floating point values due to automatic casting from float to integer, as expected.
Apparently, we need to define another function which we can use with floating point values:
inline float const& max2 (float const& a, float const& b) { return a<b?b:a; // if a < b then use b else use a }
Let's use it with the same set of values:
std::cout << "max2(2, 3) = " << max2(2, 3) << std::endl; std::cout << "max2(2, 2.1) = " << max2(2, 3) << std::endl; // i (int) is 2 and f (float) is 2.1 std::cout << "max2(i, 1) = " << max2(i, 1) << std::endl; std::cout << "max2(i, 2.1) = " << max2(i, 2.1) << std::endl; std::cout << "max2(i, f) = " << max2(i, f) << std::endl;
Output:
max2(2, 3) = 3
max2(2, 2.1) = 3
max2(i, 1) = 2
max2(i, 2.1) = 2.1
max2(i, f) = 2.1
It works just fine but, we still had to cast from integer to float just to compare two values.
This might seem like a minor issue but this is just extra computation and not good for performance.
If you want to avoid casting, then you have to implement four different functions to compare all possible pairs of int
and float
. We also have double
, short int
, long int
, and others...
For instance, we can also compare two strings and find the lexicographical maximum:
std::string s1 = "ABCDEFG"; std::string s2 = "BBCDEFG"; std::string s3 = s1<s2?s1:s2; // if s1 < s2 then use s2 else use s2 std::cout << "The maximum of ABCDEFG and BBCDEFG: " << s3 << std::endl;
Output:
The maximum of ABCDEFG and BBCDEFG: ABCDEFG
So, let's implement one more function to compute the lexicographical maximum of two strings:
inline std::string const& max3 (std::string const& a, std::string const& b) { return a<b?b:a; // if a < b then use b else use a }
As you can see, this is very unpleasant and results in code redundancy. Luckily, there are ways to work around this issue.
Function templates provide a functional behavior that can be called for different types. The representation looks similar to an ordinary function but some elements of the definition are left undetermined: types. These elements are parameterized.
#include <iostream> #include <string> {First template function, 5} {Template function with multiple types, 6} {A function with a template return type, 7} int main(){ {Using our first template function, 5} {Be careful about the return type, 7} {How to specify the return type, 7} return 0; }
The following is a function that returns the maximum of the values:
template <typename T> inline T const& max_t1 (T const& a, T const& b) { return a<b?b:a; // if a < b then use b else use a }
Used in section 4
Here T
stands for the type, and this function will compile and work as long as the <
operator is defined for the given type.
We can have more than one parameter for type names, the overall form of the syntax looks like this:
template <typename T1, typename T2, typename TN>
We can use max_t1
to compare two integers, doubles, or strings:
int i = 42; std::cout << "max_t1(7, i) = " << max_t1(7, i) << std::endl; double f1 = 3.4; double f2 = -6.7; std::cout << "max_t1(f1, f2) = " << max_t1(f1, f2) << std::endl; std::string s1 = "mathematics"; std::string s2 = "math"; std::cout << "max_t1(s1, s2) = " << max_t1(s1, s2) << std::endl;
Used in section 4
Output:
max_t1(7, i) = 42
max_t1(f1, f2) = 3.4
max_t1(s1, s2) = mathematics
Normally, templates are not compiled into single entities that can handle any type.
Instead, different entities are generated from the template for every type for
which the template is used. Thus, max_t1
is compiled for each of these three
types.
An attempt to instantiate a template for a type that does not support all the operations used within it will result in a compile-time error.
std::complex<float> c1{0, 0}, c2{0, 0}; // does not provide operator < max_t1(c1, c2); // ERROR at compile time
Still, the compiler might fail to deduce the correct types when you use it ambiguously:
max_t1(4, 7) // OK: T is int for both arguments max_t1(4, 4.2) // ERROR: first T is int, second T is double
The second one will lead to a compile-time error.
There are three ways to handle this:
max_t1(static_cast<double>(4), 4.2) // OK max_t1(4, static_cast<int>(4.2)) // OK, but you will lose decimal part
max_t1<double>(4, 4.2) // OK max_t1<int>(4, 4.2) // OK, but you will lose the decimal part
template <typename T1, typename T2> inline T1 max_t2 (T1 const& a, T2 const& b) { return a<b?b:a; }
Used in section 4
This is a better and more flexible solution. Note that, the return type is the same as the first argument's type.
max_t2
is OK, but the type of the first argument defines the return type.
// return type of max_t2 will be the same with the first argument if (max_t2(4, 4.2) != max_t2(4.2, 4)) std::cout << "The order determines the return type." << std::endl;
Used in section 4
Output:
The order determines the return type.
You can simply introduce a third template argument, called RT
, to define the return type of a function template:
template <typename RT, typename T1, typename T2> inline RT max_t3 (T1 const& a, T2 const& b) { return a<b?b:a; }
Used in section 4
Note that parameter names (T1
, T2
, RT
) are arbitrary, you can use anything.
Now, we can also specify the return type as follows:
max_t3<int,double,double>(4, 4.2); // OK, but tedious max_t3<double>(4,4.2); // OK: return type is double if (max_t3<double>(4, 4.2) == max_t3<double>(4.2, 4)) std::cout << "The return type is double." << std::endl;
Used in section 4
Output:
The return type is double.
In the second call, max_t3<double>
explicitly sets RT
to double, but the
parameters T1
and T2
are deduced to be int
and double
from the arguments
You may want to keep it simple: max_t1
might be sufficient most of the time.
Similar to functions, classes can also be parameterized with one or more types. A class template definition looks like a regular class definition, except it is prefixed by the keyword template.
We will examine class templates with an example.
Consider a class for k-mers (k \leq 32) that can provide the following:
Given a k-mer sequence as a string, it encodes it to an integer.
Given another instance of the same class, it can compute the Hamming distance.
In order to be memory efficient, our class will use 16bit encodings if 8 \leq k, 32bit encodings if 8 < k \leq 16, and 64bit encodings otherwise.
This is the overview of the class:
#include <iostream> #include <string> template <typename ET> class Tkmer { {Member variables, 8} public: {Class constructor, 8} {Encoding k-mers from sequences to integers, 9} {Computing the Hamming distance with a given k-mer, 10} }; int main() { {Using our templated class for short k-mers, 10} {Using our templated class for long k-mers, 10} }
ET
is a type parameter and it can be any type.
But all operations performed within the member functions must be supported by ET
.
ET enc; // this variable will store the k-mer encoding uint8_t num_letter; // number of letters that we can encode using ET
We expect enc
to be one of the following: uint32_t
and uint64_t
.
Other types could result in a compile-error or unexpected behavior.
We have a simple constructor, that takes a string as the input, checks its length, and calls encode
to set enc
.
Tkmer(std::string& input_seq) { num_letter = (std::numeric_limits<ET>::digits / 2); if (input_seq.size() > num_letter) throw std::invalid_argument("Given k-mer is too long for this type!"); encode(input_seq); }
Now, we need to make member functions encode
and hamming_distance
work for different possible encoding types.
For each nucleotide, we will need 2 bits. Note that when we store billions of k-mers, this will make a huge difference, and memory usage will decrease when possible.
This function adds a different value to enc
for different letters, and shifts it to the left for each letter in the sequence:
void encode(std::string& seq) { enc = 0; ET hv = 1; hv = hv << num_letter; for (unsigned int i = 0; i < seq.size(); i++) { enc = enc << 1; if (seq[i] == 'T') { enc += (hv + 1); } else if (seq[i] == 'G') { enc += hv; } else if (seq[i] == 'C') { enc += 1; } else { enc += 0; } } }
Used in section 8
The resulting encoding stores the i
th letter of the input sequence as two bits (stored at i
th and i+\texttt{num_letter}
th bits of the enc
integer).
This function computes the Hamming distance between two k-mers:
unsigned int hamming_distance(const Tkmer<ET>& kmer) { ET z = enc ^ kmer.enc; // XOR return __builtin_popcountll((z | (z >> num_letter)) << num_letter); // count different positions }
Used in section 8
Note that the type ET
must be the same as input kmer
's type.
An example with short k-mers with length 7:
std::string x_seq_short = "AATTAGT"; std::string y_seq_short = "GCTTAGT"; Tkmer<uint32_t> x_short(x_seq_short); Tkmer<uint32_t> y_short(y_seq_short); std::cout << "HD(x, y) = " << x_short.hamming_distance(y_short) << std::endl;
Used in section 8
Output:
HD(x, y) = 2
An example with log k-mers with length 31:
std::string x_seq_long = "TTCGCGCGTYTTAGCGCAAACGTTAAGGGCT"; std::string y_seq_long = "TTCGCGCGTYTTAGCGCAAACGTTAAAAAAA"; Tkmer<uint64_t> x_long(x_seq_long); Tkmer<uint64_t> y_long(y_seq_long); std::cout << "HD(x, y) = " << x_long.hamming_distance(y_long) << std::endl;
Used in section 8
Output:
HD(x, y) = 5
Our templated class makes sure that the Hamming distance between encodings with different types is not computed:
std::cout << "HD(x, y) = " << x_long.hamming_distance(y_short) << std::endl; // ERROR: x_long uses uint64_t, y_short uses uint32_t
The compiler will fail to deduce ET
: uint64_t
or uint32_t
?