Dan's Dissertation

From GodOfDarkness
Jump to: navigation, search

Contents

General Topic: The mutational landscape of protein sequences

I am using this page to collect my thoughts about my dissertation. You are free to eavesdrop.

According to the central dogma of molecular biology, the behavior and health of the cell depend primarily on protein function. Protein function, in turn, depends on protein structure, which depends on protein sequence, which depends on genetic sequence. Therefore, if a genetic mutation modifies a cell's behavior or makes it sick, we should expect this to be because the mutation changes the sequence (and therefore the structure and function) of an important protein. Recent advances in biology have begun to uncover other mechanisms for mutations to cause disease, but in general we still believe that most damaging mutations are damaging because they alter the behavior of a protein by changing its sequence. While this general fact is accepted (for the most part), there is still no easy way to translate between a mutation's effect on protein sequence and structure and its effect on the health and behavior of the organism. My general goal is to shed some light on the question of how changes in protein sequence translate to changes in phenotype, with particular attention to the impact of protein structure.

I currently have three more specific topics of interest. Hopefully all of these will make it into my dissertation.

Subtopic 1: Interpreting the impact of sequence changes on phenotype

This is the area I have done the most work on to date. I discuss the general problem, though not so much my work on it, at length in my review (Jordan et al. 2010).

In principle, most phenotypes should arise from protein structure. As such, it should be possible to predict the effect of a mutation by predicting its effect on protein structure. This is, unfortunately, easier said than done, for three reasons:

  1. It is very difficult to predict the effect of a mutation on protein structure. In order to do so with any kind of accuracy, one would have to model correctly the entire process of protein folding, a process that is very poorly understood. Current efforts do okay but not great, and are able to predict mutations as stabilizing or destabilizing with about 70% accuracy.
  2. Even if you can predict the effect of a mutation on protein structure, it is not necessarily straightforward to relate structure to function. In general we are inclined to call destabilizing mutations "bad" and all other mutations "neutral" (actively beneficial mutations are considered very rare), but not all destabilizing mutations necessarily cause a phenotype (possibly due to the epistatic phenomenon described below), and non-destabilizing mutations can also be bad in some cases.
  3. Finally, even if you can successfully predict structure and function, we only know the structures of about 10% of proteins. Any general-purpose method for predicting function from structure cannot rely too heavily on a score that only works for this small proportion.

For these reasons, most tools for predicting the effects of mutations use a different approach. The most common approach is to use evolutionary history as a probe for phenotype. The general idea is that evolution will not allow a damaging mutation to reach a significant frequency in any population; therefore, if we see the mutation in any homologous sequence, we can usually conclude that the mutation is not damaging. The current generation of tools, in general, use extensions of this method to account for chemical similarities between amino acids, and for the possibility of compensatory changes in more distant species (see below for more about this).

One goal in this area is to create the next generation of these methods. My lab maintains PolyPhen, one of the more commonly used packages for this purpose, and I have been involved in some recent development on it. With the ever-increasing number of annotated human variants, and the similarly increasing number of available full genome sequences, it is becoming increasingly attractive to incorporate these sources of data into these tools. Particularly interesting is the availability of whole-genome alignments (like those made with the MultiZ method, Blanchette et al 2004) for large numbers of vertebrate species. I am involved in producing a new version of PolyPhen that incorporates these alignments, which enables new methods of interpreting sequence data.

Other research in this area focuses on more specific applications. One such application is medical: clinical geneticists require tools like PolyPhen to distinguish between causative and non-causative alleles. How well existing tools do at this task is unclear, since very few studies have been done with realistic clinical scenarios. I performed such a study testing the application of these tools to hypertrophic cardiomyopathy, a common genetic heart disease caused by a force transmission defect in the cardiac sarcomere. I found that PolyPhen-2 and similar tools were fairly bad at distinguishing pathogenic mutations from benign mutations out-of-the-box. However, the performance could be immensely improved by modifying the classifier to incoporate biological knowledge about the specific disease. The resulting classifier, called PolyPhen-HCM, is currently used in genetics clinics (Jordan, Kiezun, Baxter et al. 2011). On the other hand, a second study on Noonan syndrome, a kind of genetic dwarfism caused by a defect in the Ras/MAP kinase signaling pathway, found that PolyPhen-2 worked very well out-of-the-box, and an almost entirely unmodified version of PolyPhen-2 is now in clinical use for this disease (Ibid., unpublished data). It is unclear which of these cases is more typical; one possibility for future research would be conducting a large-scale study of multiple diseases to determine this.

Another specific application is to exome sequencing studies, such as those from the laboratory of Jay Shendure (Ng et al. 2009). These studies aim to identify rare causative alleles for uncharacterized diseases. They use tools like PolyPhen-2 to prioritize these alleles, filtering down from the thousands of coding variants found in a typical exome to a reasonable number for analysis. Some recent studies have bemoaned the lack of granularity of these tools — limiting to variants that PolyPhen-2 scores as damaging, after all, still leaves hundreds of varaints, and further division by the PolyPhen score does not appear to have much predictive value. It is possible that a tool designed specifically with this task in mind would perform better at it. This work is still at the speculative stage.

Subtopic 2: Compensatory changes and covariation

One of the basic assumptions of the methods described above is that homologous genes have the same fitness landscape — in other words, that a variant that is damaging to one gene in one species is equally damaging to any related gene. It has long been known that this assumption is not entirely correct (see, for example, Kondrashov et al. 2002), but it has been difficult to quantify how common this kind of relationship really is. Using the large datasets of human mutations and multiple sequence alignments described above, we can now begin to do this kind of study. My results show that 8.5% of human disease mutations are found in another vertebrate species, an alarmingly high number (Jordan et al. manuscript in preparation). In most such cases, there are quite a lot of differences between the human sequence and the sequence that contains the human disease variant. The numbers of differences in these cases are large compared to other protein sequences, and even large compared to other proteins in the same orthologous family. Physical modeling of the protein structures involved suggests that the large number of differences in each protein hides one or two important functional sites that actually compensate for the disease variant, causing it to be tolerated. I am currently working on developing a method to identify which changes are really compensatory, with the hope of getting experimental validation in one of any number of experimental systems.

Subtopic 3: The evolutionary landscape of protein structures

The methods described above aim to use evolutionary history to probe the functional effects of single amino acid changes. A somewhat related idea is to probe the functional effects of entire categories of changes. My preferred application is to use this method to probe protein structures — to test, for example, whether disrupting structural stability or biochemical function has a greater chance of causing disease, and how large the impact of each is. This kind of study has not been feasible until very recently, because the amount of genome data to test with has been very low. However, with the rise of widespread exome sequencing, we now have thousands of human exomes containing millions of protein-coding variants, and we can look in greater detail at the distribution of these variants between different functional and structural classes. My initial results show that mutations that disrupt biochemical function tend to have much larger effects than mutations that disrupt structural stability, but that mutations that disrupt stability are orders of magnitude more common and seem to be much more influential overall in the evolution of the genome (Tennessen et al., in press). I am looking into making more precise estimates of protein stability and determining the precise distribution of stability-effecting variants. I am also interested in using this information to estimate selection coefficients directly from protein structure, thus enabling me to model the evolution of protein structures by forward simulation.

Papers

So far, I have published two papers that will (hopefully) become part of my dissertation, both on the subject of interpreting the impact of structural changes on phenotype:

One further paper is in the final stages of peer review and editing, and is expected to be printed soon in Science:

  • Tennessen, J. A., Bigham, A. W., O'Connor, T. D., et al. Evolution and functional impact of rare coding variation from deep sequencing of 2,440 human exomes. Science, in press.
Personal tools