DNA nucleotides are adenine, cytosine, guanine and thymine, usually represented as A, C, G, and T. RNA differs from DNA in that it contains the closely related base uracil (U) instead of thymine. The base pairing rules indicate that an adenine can form a triple-hydrogen bond only with a thymine in the complementary strand. It follows then that the nucleotide bases cytosine and guanine form a (double) hydrogen bond together. A and G are called purines, and C and T are called pyrimdines.
The lexical tokens that will be considered are the nucleotide
bases A, C, G, and U, since the primary concern is the translation
process. One could also consider the tokens to be codons instead of
nucleotides. This might prove to be a better approach, but
the former approach will be considered during the course of the
development of this compiler using LR(1) parsing and relying on some
context sensitivity provided by tools such as Lex and Yacc. It should
be noted that at this point that we are dealing with mRNA and
not hnRNA (which has to undergo post-transcriptional
modification before it can be translated). We therefore have no need
to formally specify intron splicing mechanisms. See mRNA.l
for the Lex specifications.
Once scanning is complete, the rest of the compilation process involves taking several things into consideration. Even though we are dealing with mRNA and not hnRNA (intron splicing is a problem in itself which will have to be dealt by another transducer specifically construced for that purpose), we have to consider the fact that the amino acid chains assume a 3D configuration and this is how they interact with other proteins. Protein secondary structure is derived from several sources: The sequence of the mRNA strand has an important role in determining primary peptide chain structure. The secondary structure of the mRNA strand also plays a definitive role. The structures of the various translatory mechanisms (ribosomes, tRNA, etc.) help in resolving the protein secondary structure. Finally, as the protein is being translated, the part that has undergone ribsomal processing takes on a 3D configuration and determines the final structure of the protein.
The idea is to come up with a syntax-directed translation that can produce code (polypeptide chains) by (i) looking at mRNA sequence and determining the amino acid chain, (ii) looking at the parse trees of the mRNA sequence and using that information during the encoding, (iii) taking into account several factors such as tRNA and rRNA structure, the wobble hypothesis, and maybe perform some sort of ``error checking'' to account for mutations, and ( iv) perform post-translational modifications as part of an "global optimisation" process.
It is not practical to do all of this in a single step. Rather, a multipass approach will have to be used. The first pass will examine a linear strand of mRNA and produce the corresponding polypeptide. This can be accomplished by specifying a formal grammar for translation.
The following set of productions describe how an amino acid chain can be obtained from an mRNA strand:
protein --> mRNA mRNA --> untranslated_region translated_region untranslated_region translated_region --> start_codon amino_acid_chain terminator start_codon --> met amino_acid_chain --> amino_acid_chain codon | null untranslated_region --> untranslated_region untranslated_codon | null codon --> hydrophobic_side_chain | charged_side_chain | polar_side_chain | gly hydrophobic_side_chain: ala | val | leu | ile | phe | pro | met charged_side_chain --> asp | glu | lys | arg polar_side_chain --> ser | tyr | cys | asn | gln | his | thr | trp terminator --> ochre | opal | amber untranslated_codon --> leu | phe | cys | trp | tyr | val | gly | ala | glu | asp | pro | arg | his | gln | ser | thr | ile | asn | lys | ochre | opal | amber lys --> A A purine asn --> A A pyrimidine ile --> A UR pyrimidine thr --> A C base met --> A UR G ser --> A G pyrimidine gln --> C A purine his --> C A pyrimidine arg --> C G base | A G purine pro --> C C base asp --> G A pyrimidine glu --> G A purine ala --> G C base gly --> G G base val --> G UR base tyr --> UR A pyrimidine trp --> UR G G cys --> UR G pyrimidine phe --> UR UR pyrimidine leu --> UR UR purine | C UR base amber --> UR A G ochre --> UR A A opal --> UR G A base --> purine | pyrimidine pyrimidine --> C | UR purine --> A | G
The file mRNA.y
contains the
Yacc specifications for the above grammar rules.
A representation (data structure) for the amino acids involved is essential to manage their interactions. After the first pass of the compiler, an abstract syntax tree (AST) of amino acids in the form of a linearly linked list is obtained. A ``symbol table'' will have to be developed to account for the number and type of tRNA molecules, ribosomes, and another organelles involved in translation. A study of how amino acids interact, and data provided by techniques such as x-ray diffraction and NMR, will enable us to add more semantic actions to the parser so we can change the structure of the linked list to reflect protein secondary structure (spatial relationships between amino acids). This step could be considering analogous to ``type matching'' in a programming language compiler. Further, a greater level of understanding of amino acid secondary structure interaction will possibly enable us to decipher tertiary and 3D protein structure, thus completing a compiler for mRNA translation.