The charm and grace of computing science lies not just in the theoretical aspect nor in the programming aspect. It is the mixture of both that makes computing science the most exquisite of all the sciences and this is best epitomised by the study of compiler theory and design.
A compiler for DNA? As we examine the genetic and molecular mechanisms involved in transcription and translation, we will observe that is exactly what happens in life. It scans, parses, and generates code in the form of proteins that can then be "run" to perform various tasks.
Justifying a claim that DNA is actually "compiled" in the cell to produce proteins that are "executed" is nontrivial. If we completely understood the process, then we would be able to compile DNA in a machine (which is part of our ultimate goal), but that is not the case. However, attempting to understand how DNA is transcribed and translated in terms of formal languages is extremely useful because it provides us with yet another means of representing the nucleic acid chain, and it also enables us to derive mRNA and protein structure from sequence.
The first operation that a compiler performs is scanning or lexical analysis. The input is in the form of tokens or lexicals that are examined by the scanner, an action may be performed, and the resulting token is passed on to the parser. The token is encoded into a unique integer so the parser and the scanner both recognise the same token. The parser then parses the tokens according to the rules specified by a grammar and performs certain semantic actions.
Lexical scanning is rather simple in this particular instance. David Searls, who has done research on the liguistics of DNA, has come up with a grammer that formalises a gene. If we try to write a compiler for a strand of DNA based on his grammar, then the only tokens that we need are the four nucleotide bases. Since Searls has already written a parser (a recogniser and a generator) for DNA sequences, I shall not attempt to reinvent the wheel and come up with a new grammar; instead, I shall attempt to write a grammar that will parse mRNA and produce output in the form of amino acid chains (next chapter). A device that examines its input, parses it, and writes output according to a fixed set of rules or productions is called a transducer. What Searls has done is to come up with a context-sensitive transducer (he also proves that DNA is inherently context-sensitive) which takes as input a strand of DNA and produces a strand of mRNA as its output. The syntax tree produced by the parser gives an indication (I am not sure how good this is) of the secondary structure of the mRNA strand. The interested reader will examine David Searls' work, in particular, The Linguistics of DNA.
A transducer for DNA transcription (such as the one constructed by Searls, mentioned above) will produce mRNA as final output. It will also enable us to determine the numbers and structures of tRNA and rRNA molecules coded for by a given DNA strand. Using Searls' parser (which also handles intron splicing) and redirecting its output as input for our mRNA -> protein transducer, we can can compile a strand of DNA into protein. Thus, the most essential components of the cell will have now been formalised in a computer: DNA, hnRNA, mRNA, tRNA, rRNA and proteins. The next crucial step is the prediction of the structure and function of the protein sequences generated from the mRNA sequences. The rest of the activity of the cell then involves protein and enzyme interaction, replication of DNA, and the division of the newly constructed ``cell''. All these issues will be addressed in later chapters.
The practical aspects of this undertaking must once again be emphasised, because what we have done in this stack frame is to attempt to come up with a way to derive protein structure from sequence using formal language theory; if successful, this will have enormous implications in the field of biochemical genetics.