map – Simple dictionary in C++

map – Simple dictionary in C++

You can use the following syntax:

#include <map>

std::map<char, char> my_map = {
    { A, 1 },
    { B, 2 },
    { C, 3 }
};

If you are into optimization, and assuming the input is always one of the four characters, the function below might be worth a try as a replacement for the map:

char map(const char in)
{ return ((in & 2) ? x8a - in : x95 - in); }

It works based on the fact that you are dealing with two symmetric pairs. The conditional works to tell apart the A/T pair from the G/C one (G and C happen to have the second-least-significant bit in common). The remaining arithmetics performs the symmetric mapping. Its based on the fact that a = (a + b) – b is true for any a,b.

map – Simple dictionary in C++

While using a std::map is fine or using a 256-sized char table would be fine, you could save yourself an enormous amount of space agony by simply using an enum. If you have C++11 features, you can use enum class for strong-typing:

// First, we define base-pairs. Because regular enums
// Pollute the global namespace, Im using enum class. 
enum class BasePair {
    A,
    T,
    C,
    G
};

// Lets cut out the nonsense and make this easy:
// A is 0, T is 1, C is 2, G is 3.
// These are indices into our table
// Now, everything can be so much easier
BasePair Complimentary[4] = {
    T, // Compliment of A
    A, // Compliment of T
    G, // Compliment of C
    C, // Compliment of G
};

Usage becomes simple:

int main (int argc, char* argv[] ) {
    BasePair bp = BasePair::A;
    BasePair complimentbp = Complimentary[(int)bp];
}

If this is too much for you, you can define some helpers to get human-readable ASCII characters and also to get the base pair compliment so youre not doing (int) casts all the time:

BasePair Compliment ( BasePair bp ) {
    return Complimentary[(int)bp]; // Move the pain here
}

// Define a conversion table somewhere in your program
char BasePairToChar[4] = { A, T, C, G };
char ToCharacter ( BasePair bp ) {
    return BasePairToChar[ (int)bp ];
}

Its clean, its simple, and its efficient.

Now, suddenly, you dont have a 256 byte table. Youre also not storing characters (1 byte each), and thus if youre writing this to a file, you can write 2 bits per Base pair instead of 1 byte (8 bits) per base pair. I had to work with Bioinformatics Files that stored data as 1 character each. The benefit is it was human-readable. The con is that what should have been a 250 MB file ended up taking 1 GB of space. Movement and storage and usage was a nightmare. Of coursse, 250 MB is being generous when accounting for even Worm DNA. No human is going to read through 1 GB worth of base pairs anyhow.

Leave a Reply

Your email address will not be published.