Among the most mysterious features of evolving genomes are stretches of DNA that carry two or more kinds of information in a single sequence.
In the 1950s to 1970s, molecular biologists were sure that each DNA sequence could not encode more than one polypeptide chain. The reason was that a single mutation would then possibly alter two different proteins, greatly reducing the chance of a beneficial change happening for evolution. But in 1977 Fred Sanger and his colleagues sequenced the genome of the small bacterial virus ΦX174. They discovered overlapping coding regions where two different reading frames* were utilized simultaneously; a single DNA sequence encoded segments of two different proteins.
Initially the Sanger et al. discovery was attributed to the need to economize coding capacity in the genome of a small virus. Nonetheless, other coincident messages began to pop up all over the place as sequence data accumulated. For example, start sites for initiating and controlling RNA synthesis appeared repeatedly inside coding sequences for bacterial proteins. Once again, the need to maintain a streamlined genome size was invoked as an explanation for the unexpected results. But in the early 1980s, Antonio Cascino and his colleagues analyzed published sequences from mammalian genomes and showed that both strands of at least 50 genetic loci encoded proteins of over 100 amino acids. As Cascino said almost thirty years ago in a seminar about human DNA encoding hemoglobin, "If you did not already know the protein sequence, you couldn't tell which strand was the important one."
A recent paper has taken advantage of the alignment of related segments from 29 mammalian genomes to show that coincident messaging inside coding sequences is widespread. Lin et al. searched for shared coding regions where synonymous base substitutions (ones that would not alter the protein amino acid sequence) were exceptionally rare. Constraint on synonymous changes indicates that selection for some function other than protein coding maintains the nucleotide sequence unchanged. More than 25% of all protein-coding loci contained such "synonymous constraint" regions, and the selected function could often be identified. "...splicing regulatory elements, dual-coding genes, RNA secondary structures, microRNA target sites, and developmental enhancers. Our results show that overlapping functional elements are common in mammalian genes, despite the vast genomic landscape."
This multiplicity of coincident messages raises a number of intriguing questions. We can answer at least one of them: Is there something special about the triplet code for amino acids (sometimes erroneously referred to as "the genetic code") that allows multiple messages to overwrite protein sequence information? Yes, there is something special. Following a question I asked during a visit to the Weizmann Institute, Shalev Itzkowitz and Uri Alon analyzed the capacity to carry additional messages of the 13,824 possible triplet codes. They discovered that the triplet code used in living cells is one of only a dozen or so possible triplet codes that are optimal for overwriting additional sequences into the string of nucleotides encoding a defined protein segment.
Another question is harder to answer: How do multiple messages come to be inscribed in a single sequence in the course of evolution? This is an evolutionary mystery, especially when the second message has a complex structure. My own particular intellectual headache comes from structures called "shufflons" found in some bacteria that use them to diversify extracellular protein structures. Variability in surface proteins is advantageous in extending the range of specific cell-cell attachments for transfer of DNA and other macromolecules.
In a shufflon, the coding sequence contains two or more copies of the intricate signals required for a DNA rearrangement process known as "site-specific recombination." When a coding region carrying two or more recombination sites undergoes an inversion, the protein sequence changes because there is now a new string of triplet codons between the recombining sites. Some shufflons have up to seven different recombination sites embedded in the coding sequence. These structures are theoretically capable of generating over 100 different protein-coding DNA sequences (33 of which have actually been isolated from one shufflon).
Such remarkable protein diversifying systems in bacterial genomes pose a mystery. How do the recombination sites evolve within sequences encoding functional proteins? It does not make sense to argue that each one evolved by selection operating a few nucleotides at a time; there is no benefit until at least two complete recombination signals are present. Moreover, known mechanisms for duplicating and inserting copies of a complex DNA signal at multiple locations generally disrupt coding capacity. Further, as in mammalian dual-coding regions, we do not understand how both strands evolve simultaneously to encode functional protein segments.
At a time when we pride ourselves for being able to read DNA sequences with increasing speed, it is salutary to keep in mind that we are still far from knowing how to interpret the complex overlapping meanings contained in the genomic texts we store in our databases. DNA, like poetry, often has to be read in several ways.
* READING FRAME: Every DNA strand has three "reading frames" for translating the sequence into amino acids, which occurs three nucleotides (= 1 codon) at a time. Starting one nucleotide back or one nucleotide forward shifts the reading frame to a different sequence of codons. For example, the DNA sequence **AAGGCCAGCTGC** can be read as (AAG)-(GCC)-(AGC)-(TGC), (*AA)-(GGC)-(CAG)-(CTG)-(C**), or (**A)-(AGG)-(CCA)-(GCT)-(GC*) in each of the three reading frames.
Grindley, N. D., K. L. Whiteson, et al. (2006). "Mechanisms of site-specific recombination." Annu Rev Biochem 75: 567-605. http://www.ncbi.nlm.nih.gov/pubmed/16756503.
Gyohda, A., S. Zhu, et al. (2006). "Asymmetry of shufflon-specific recombination sites in plasmid R64 inhibits recombination between direct sfx sequences." J Biol Chem 281(30): 20772-20779. http://www.ncbi.nlm.nih.gov/pubmed/16723350.
Itzkovitz, S. and U. Alon (2007). "The genetic code is nearly optimal for allowing additional information within protein-coding sequences." Genome Res 17(4): 405-412. http://www.ncbi.nlm.nih.gov/pubmed/17293451.
Komano, T. (1999). "Shufflons: multiple inversion systems and integrons." Annu Rev Genet 33: 171-191. http://www.ncbi.nlm.nih.gov/pubmed/10690407.
Lin, M. F., P. Kheradpour, et al. (2011). "Locating protein-coding sequences under selection for additional, overlapping functions in 29 mammalian genomes." Genome Res 21(11): 1916-1928. http://www.ncbi.nlm.nih.gov/pubmed/21994248.
Sanger, F., G. M. Air, et al. (1977). "Nucleotide sequence of bacteriophage phi X174 DNA." Nature 265(5596): 687-695. http://www.ncbi.nlm.nih.gov/pubmed/870828.
Tramontano, A., V. Scarlato, et al. (1984). "Statistical evaluation of the coding capacity of complementary DNA strands." Nucleic Acids Res 12(12): 5049-5059. http://www.ncbi.nlm.nih.gov/pubmed/6547531.