Main Page   Modules   Namespace List   Class Hierarchy   Compound List   File List   Namespace Members   Compound Members   Examples  

A Simple Bioinformatics Generic C++ Toolkit

0.0.1

"In our struggle to develop quality software, If we must, let our code take hours to compile but seconds to run, for we have gained nothing if it takes seconds to compile and hours to run!"

-- General Template rallying his generic programming army before the battle of namespace bioinfo { }.

Introduction

This first phase of library development includes basic data types useful in a wide range of bioinformatics applications. That means no (even semi) sophisticated algorithms. The criteria for inclusion in this first phase are tools that rarely need to be written more then once. These are the things that one should feel silly about having to write, and you might say to yourself "surely, someone did this exact same simple thing already". As an example, things that will be included are reading and writing fasta files and simple string handling. Things that will not be included are statistical analysis algorithms like gc % counting. The focus on this libary should be doing the simple things really well. Once the dust settles on this phase, perhaps additional components will implement more sophisticated algorithms, based on the well written building blocks.

Software Design Principles

There are three basic principles, which should dictate the design and implementation of the library.

1) Employ software design principles that are well established and widely used. Take a generic programming approach.

In this regard look to the Standard Template Library design principles and employ design patterns where appropriate. The library should work with STL wherever possible.

2) Minimal commitment principle (modularity). If I want to make use of a particular object, it should require minimal effort to use it. Objects should be kept reasonably modular and separate, so that including one object does not require including 100 other objects (or if it does there is very good reason for it). This means using templates that allow me to use standard objects like char or string wherever possible (as an option). So that I can always do the "simple" things with minimal programming overhead. The idea is, if a programmer finds just one thing that they like in the library, they should only need to reference that part of the library without being forced to include a lot of code they don't need or want

3) Efficiency. There are as many ways to write inefficient C++ code as there are ways to write efficient code: know the differences and understand the trade-offs. This is a big challenge and will likely require serious attention.

One challenge to address in this library, is to develop various bit compressed representation of strings, compatible with the standard template library. A string that behaves like a "basic_string" (or std::string) but is implemented in specific ways that take into account the bioinformatics string domain.

This means, cases where we know there are only 4 bases at the very least, each character can take up 2 (or 3) bits and 4 bases can be compared at the same time. Some cases may require 16 codes, which still is 4 bits per character and amino acids should be able to be handled with 5 bits. More ambitious, would be to include a basic_string that uses suffix trees. The whole point here is to maintain the standard string interface so that the code doesn't have to change, it shouldn't matter what the implementation is, but if we put in the work to develop specialized string implementations that allow future programs to operate faster without the programmer having to do any extra work then great. Every thing should still work with basic_string to conform with goal #2.

The bottom line is there are enough bioinformatics problems that have some similiar underlying structure, which merrit different string implementations. k-mers for example, which will be the the first domain to implement.

This n-teen-th effort is encapsulated in a namespace called "bioinfo" to denote "Bioinformatic C++ Toolkit". I think I'll do better than my (n-1)-teen-th try - I'm an optimist!

References

Exceptional C++ (Sutter) More Exceptional C++ (Sutter) Effective C++ (Myers) More Effective C++ (Myers) Efficient C++ (...) The C++ Programming Language (Bjorne (sp)) Design Patterns (Gamma et. al) Refactoring (Fowler)

Other Bioinformatic C++ libraries to look at (from bioinformatics.org):

ALiBio: http://bioinformatics.org/ALiBio/project.html Appears to dive right in to complex problem domains requires boost/graph. Basically a slightly different set of goals. I want to start off very simple, just to handle reading and writing fasta, and string representation.

BTL (Bioinformatics Template Library) Another suspect library development attempt. #1 what's the justification for defining your own graph class over the boost/graph classes? Again doesn't emphasize the simple things I'm interested in. doesn't appear to be under active development.

libGenome: http://www.libgenome.org/documentation.html this looks like something similar, should I just use this? I get the feeling that this library does more than just focus on doing a few basic things really well. It appears to be reasonably portable, and uses configure.

NCBI C++ toolkit: http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/libs.html Definately important to look at. I still think we can do better.

Links: http://www.boost.org/more/generic_programming.html http://www.cs.utexas.edu/users/lavender/courses/c%2B%2B/cd2 http://www.research.att.com/~bs/C++.html

Updates

Wow, I think I'm getting basic_string to work with bitset don't forget crope anyone.

06/06/03 For now I'm sepeartating the library into seperate directories based on logical subunits, in accordance with principle 2. These directories should reflect typical cases where you would concievably only want to include one of these directories in a project.

The directory "string", refers to the string utilities I'm working on that define different instantiations of basic_string.

The directory "fasta", is totally seperate and if you just want to read and write fasta files you shouldn't need to include any additional crap, err code.

This idea could help with the multi development aspect, where different people can be responsible for managing different "directories" or components.