Julian Gough, a bioinformaticist at Medical Research Council Laboratory for Molecular Biology (LMB), Cambridge, England believes in the tradition of open science – the free-gift-to-the-world kind of science that first brought us the World Wide Web developed at CERN. Gough’s gift to the world is SUPERFAMILY, a database all about proteins in genomes, which he created as part of his PhD a dozen or so years ago.
Plug certain information into SUPERFAMILY and it can analyze a vast assortment of genomes and assist you in building a Tree of Life using superfamilies — i.e., domains with an evolutionary relationship — and the conserved part of thousands and thousands of protein structures called protein domains. Everyone can access this robust repository online, and it looks like everyone is. Gough says the site currently gets one hit per second.
Julian John Thurstan Gough (not to be confused with Julian Gough, the Irish novelist and former singer with the band “Toasted Heretic”) received his PhD at the University of Cambridge in Theoretical and Computational Biology (“Hidden Markov models and their application to the genome analysis in the context of protein structure”). His doctoral advisor at Cambridge was Cyrus Chothia, and Gough says SUPERFAMILY grew in part from Chothia’s work and influence. Gough’s undergraduate degree is in Mathematics and Physics (joint honors) from the University of Bristol. His postdoctoral research was with Nobel laureate Michael Levitt at Stanford University and at LMB.
He’s been a research scientist at the RIKEN Genomic Sciences Centre, Japan and at the Pasteur Institute in Paris, a professor at Tokyo Medical and Dental University, and until his move to MRC Laboratory for Molecular Biology earlier this year, a professor at the University of Bristol in the UK.
I spoke recently by phone with Julian Gough at his lab at LMB, Cambridge about the wonders of SUPERFAMILY.
Suzan Mazur: What is your role there at LMB? You made the move from the University of Bristol fairly recently?
Julian Gough: Yes. I started in January. My job title is program leader. So I’m currently recruiting postdocs. I didn’t move any of my group from Bristol. They’re all finishing up. My role here is to run a scientific program.
Suzan Mazur: Program leader in what area?
Julian Gough: My two main chosen research directions at the moment are cell reprogramming, that is, computational prediction of cell reprogramming, and phenotype prediction. Those are my two main projects, but underlying that in everything I do, I have a history of developing algorithms and resources, and of course, that has to continue to support all the rest of the research that I do.
Suzan Mazur: So it’s bioinformatics.
Julian Gough: Yes, bioinformatics. I make these resources public and people around the world can use them for their research too.
Suzan Mazur: That’s wonderful. You’re the father of and principal architect of the SUPERFAMILY database.
Julian Gough: Yes. I think that’s fair to say. It was conceived while I was doing a PhD with Cyrus Chothia, and it’s highly dependent on the SCOP database [Structural Classification of Proteins], and the architect of that or the main person for the intellectual content, was Alexey Murzin. SUPERFAMILY grew from their influence and their work.
Suzan Mazur: SUPERFAMILY 1.75 was part of your doctoral dissertation at Cambridge.
Julian Gough: The original one was one of my outputs for my PhD -- not 1.75. I can’t remember which number it was on then. It’s since been updated many times and its current version is 1.75.
Suzan Mazur: Would you say essentially what the database is?
Julian Gough: Three-dimensional atomic resolution protein structures are solved using experimental techniques, such as crystallography, NMR [nuclear magnetic resonance] and cryo-electron microscopy and the coordinates of these structures get deposited in the Protein Data Bank (PDB).
The SCOP database that I alluded to before takes those structures and it breaks them into domains, which are globular units of evolution. It classifies them into evolutionary-related groups called Superfamilies. There’s actually a whole hierarchy in the classification, but the relevant level for the SUPERFAMILY database is unsurprisingly the Superfamily level.
What the SUPERFAMILY database does, using hidden Markov models and a couple of other techniques, is attempt to take those domains of known structure, as classified in the SCOP database and map them to genome sequences.
There are about 100,000 protein structures, probably more – getting on to 150,000 protein structures in the PDB.
Suzan Mazur: Who has actually done the protein x-ray crystallography in your database?
Julian Gough: The whole world has been working on it since the first protein structure by Max Perutz of myoglobin. Many, many labs – all structural biology labs around the world have been solving these structures and depositing them in the PDB.
Suzan Mazur: How can you be sure that the protein x-ray crystallography is solid? That the work previously done is accurate that you’ve got in your database?
Julian Gough: Some protein structures may be of low quality. They have to go through a great deal of validation steps to make it into the Protein Data Bank. Some of them may still contain errors. But I think it’s very rare and there are probably very few Superfamilies in the database that are due to errors in the solution of the structures.
A Superfamily will often have many structures representing it. So even if there is a mistake in one structure, once you accumulate a whole group of structures, it should be very clear that they are not all going to have the same mistake.
Suzan Mazur: Do protein Superfamilies represent the current limits of our ability to identify common ancestry?
Julian Gough: Yes. That is exactly what their definition is. So if you want to group two protein structural domains into the same Superfamily, the question that you ask is whether there is structural sequence and functional evidence for common evolutionary ancestry. So they’re classified based on that.
The most powerful part of that classification comes from the structure. Structure is far more conserved than sequence and so the knowledge of the structure allows you to classify very distantly related things that have no apparent or detectable similarity in sequence.
Julian Gough: Several people investigating evolution use the data, as they have. And you have, I think, in the Superfamily database the most -- via the SCOP classification and structures -- you have the most distant evolutionary classification that you can have at the moment mapped to all genomes.
Suzan Mazur: I see you are interested in viral evolution. Can you track viruses for a Tree of Life through your database? I understand the virus genome may be too small compared to that of a cellular organism to effectively include them at this point in a ToL.
Julian Gough: Viruses can be sequenced.
Suzan Mazur: But they move nonlinearly and consortially.
Julian Gough: That’s not my area of expertise.
Some viruses are very large, larger even than small bacteria. The majority of viruses are very small and some of them don’t even contain any protein or they contain one protein. So using this approach is insufficient.
Viruses also evolve orders of magnitude more quickly. Because they evolve so rapidly, within a population of viruses you have a diversity that is not comparable to the diversity you’d get within a population of cellular organisms.
Suzan Mazur: So their presence could really affect the Tree of Life, considering how they evolve? The recent Ebola and Zika epidemics, for instance. The way the viruses spread was off the chart.
Julian Gough: I’m not aware of many people who have attempted to reconstruct an evolutionary tree of viruses.
To put the evolution of viruses in context, you have different categories of viruses. You have double-stranded DNA, single-stranded DNA, double-stranded RNA, single-stranded RNA. They’re not even using the same way of storing genetic code. So to try to make an evolutionary tree of viruses, bringing these together, I think it goes too far back.
Suzan Mazur: But the viral content of the human genome is 10%, as well as the genome content of other animals, for example.
Julian Gough: Yes. So this crosstalk, I guess it can lead to horizontal gene transfer from cellular organisms to viruses and then transported back into cellular organisms. If you’re trying to look at resolving a new evolutionary tree of cellular organisms, viral transfers may add some noise to that. But I don’t think that they’re responsible for completely rewriting it.
Suzan Mazur: The SUPERFAMILY database is free. That’s great. How widespread is its use?
Julian Gough: There are various ways of looking at how widespread the use of SUPERFAMIY is. If you look at metrics, if you like that kind of thing, then the original paper has just passed 1,000 citations. But it also forms part of the InterPro database. So if you add up all the InterPro papers and SUPERFAMILY papers, it’s many thousands of citations. If you look at it in terms of how often the web site is accessed -- on average it’s about once per second. Thousands of different Internet addresses visit SUPERFAMILY every month.
Suzan Mazur: How does SUPERFAMILY 1.75 enable researchers? They go to your site and plug in what information to get results?
Julian Gough: The site has many different kinds of users. Someone may be using it for many different things. Some people are interested in a specific gene and they might go in and type in the code for that gene and see what the predicted domains are. Somebody else might be interested in families, so they might want to look at the list of proteins that contain a globin domain in the mouse genome. Or you get people who want to look at whole genomes. They may want to look at all kinds of domains. What is the most common domain in a particular bacterium. And there are lots of tools there for cross-comparing between organisms.
Power users, or people who are investigating evolution, can download the whole database or parts of it. For example, you might download all of the domains, all the Superfamilies and their arrangement into domain architectures in all of the genomes and try to do a huge mass comparison across all of this. You can download in computer-readable files.
Suzan Mazur: Who else is building a ToL using SUPERFAMILY besides Kurland and Harish?
Julian Gough: We have a paper on doing exactly that, although maybe not in the same way. But I think the first person to publish using the SUPERFAMILY database to attempt to build a tree was a student with Phil Bourne. That was more than 10 years ago.
Suzan Mazur: Do you see this as cutting edge, because most scientists still rely on gene sequence analysis for ToL.
Julian Gough: What defines cutting edge? I think if cutting edge is the best possible technique that you can apply to the problem -- in that sense it is. But, if you’re saying that because this hasn’t been used for more than 10 years, then it’s not cutting edge. Also, people were making gene sequence trees 10 years before that, so that’s even older.
Gene trees and trees derived from domain architectures give quite a different picture and they both have valid applications. But if you want to look at deep evolution, then using domain architectures will give you more information content on reaching deeper parts of the tree than you’d get from sequence alignments or gene trees.
Suzan Mazur: What do you say to those who say numbers are meaningless because we’re dealing with life regarding evolution?
Julian Gough: You could say that the whole field of evolution is not quantifiable and to think it’s not measurable because we were not there and so we really don’t know what happened. Putting numbers to things and including ranges of numbers and errors is required in science though, so you have to try.
Suzan Mazur: What are your thoughts about directed evolution? I see it’s one of your areas of research.
Julian Gough: Yes. We did some directed evolution experiments with yeast, but we’ve finished those experiments now.
Suzan Mazur: What are your thoughts about directed evolution?
Julian Gough: The way in which we were using it was to try to discover mechanisms by which multicellularity can evolve. I think for that and for other things too, it was a useful approach.
We also have a directed evolution project ongoing at the University of Bristol, but I’ve now left the project with the move to MRC. That project looks at directed evolution with some extremophile bacteria we will launch into space. We will allow the bacteria to evolve in space before sequencing them up there and beaming their evolved genomes back down to Earth to compare with parallel bacteria that have evolved on Earth.
It’s not beyond the reach of a university now to launch its own satellites. There are things called CubeSats and they’re about 30 centimeters by 10 centimeters by 10 centimeters. You can pay to piggyback on launches to put your experiments in orbit.
Suzan Mazur: Fascinating.
Julian Gough: The idea is to learn how organisms may have evolved ancient pathways to deal with space travel, if life did not originate on Earth. It’s a very far-fetched project really. It’s a teaching project -- more of an aspirational training exercise for students. It’s not a research project.
Suzan Mazur: What is your reaction to the scientific community regarding the Superfamilies protein database approach to mapping the Tree of Life? Investigators appear to be largely more comfortable with gene sequence analysis than with this more sophisticated approach. It seems to upset their scientific models, etc.
Julian Gough: In the field of studying ancient evolution, it’s very hard to falsify or verify. As such people tend to attach unreasonable weight to their beliefs of one possibility or another.
A lot of the objections I see are not so much to the approach but to the conclusions. But often the devil is in the details – i.e., whatever data you use as your starting point for whatever philosophical or technical approach you take.
However, if somebody comes up to you and says, I observe the Earth to be flat, and this is the experiment I did to test it -- to reject the measurement without looking at the evidence is not scientific. It’s taking a belief almost at face, as in religion, that because it goes against your belief that the Earth is round, it should not be examined. That’s not a scientific approach. Ironically, people in the past with this attitude would have been unable to discover that the Earth is round.
The scientific approach is to look at the evidence and the work that was done and try to come up with a better interpretation of it.