There are analogous problems in other disciplines, like recognizing the source of a text based upon grammatical and word-usage or determining the path of a video game; nevertheless, we’ll look at an example in molecular biology. So organisms have genomes and these genomes are made up of nucleotides. There are 4 (A, T, C & G). In other words, the alphabet of life is just four characters. Okay, so these nucleotides belong to one of two groups – purines (A & G) and pyrimidines (C &T) – which we often abbreviate in a 2 letter code as R and Y respectively. Let’s say that we have the complete genome sequence for some horrible virus strain. We then are sampling people showing symptoms and get fragments (aka words) of the virus which infects these. (Think the movie Contagion.) Anyways, our goal is to determine if the patient in fact has the terrible virus or just some boring one like a cold or the flu. We can solve this problem using a tree.
How??
Let’s see for example, if I have the words rrry, rryr and ryry, I could represent them with this tree: See image.
A few things to note. The link between parent and child implies that the letter of the child is seen subsequent to the letter of the parent in the word.
Secondly, notice that r followed by y occurs three times in the tree. A space efficient tree would link back to such occurrences. This is complicated. Let’s just be simple.
Let’s say I have the sequence rrry. Is it in the genome? See image.
Is rryy in the genome? See image.
To get you started, I’ve created two classes BTreeNode and QTreeNode.
class BTreeNode
{
public:
char Data;
BTreeNode *Lchild;
BTreeNode *Rchild;
};
class QTreeNode
{
public:
char Data;
QTreeNode *child1;
QTreeNode *child2;
QTreeNode *child3;
QTreeNode *child4;
};
I’ve also created two functions get_words(…) and get_reads(…) which will get the words from the genome file and reads from the file respectively. These functions are written to read in the files of a 4 character alphabet and either convert them to a two character alphabet (in the case that you will be using BTreeNode) or stay as a four character alphabet (in the case that you will be using QTreeNode).
Your task is to develop a class which either uses BTreeNode or QTreeNode that will take the words from the vector created by the get_words(…) function and create either a binary tree or a quaternary tree using BTreeNode or QTreeNode, respectively. This class will then have functionality to traverse particular paths through the tree, as specified given a read contained within the vector created by get_reads(…). The file will be generated as specified previously. The number of reads that map (ie, “Yes”) and the total number of reads will be reported to the user via the console.
In addition to your code file(s). Your submission must also include a 1/2 - 3/4 page report. In this report, you must clearly state: