A Probabilistic Approach to In Silico Protein Design — ASN Events

A Probabilistic Approach to In Silico Protein Design (#356)

Liguang Zhu 1 , Benjamin Porebski 1 , Ashley Buckle 1 , Geoffrey Webb 1
  1. Monash University, Clayton, VIC, Australia
Proteins play an irreplaceable role in a wide variety of biochemical functions in living organisms.The increasing usage of enzymes and proteins in pharmacy, chemistry and agriculture industries has raised wide interest in protein engineering: to control and manipulate biological activities, and to improve stability and efficacy. For many years, successful protein designs have been relied on using rigid backbone structures as design templates, applying mutagenesis on existing proteins, or a combination of both approaches. We propose a novel protein design approach: by applying suitable probabilistic models built upon a well constructed multiple sequence alignment of the protein family. Such models characterize the correlation between the sites in a protein sequence, allowing us to find a protein sequence that is the most representative of its protein family. The underlying hypothesis is that the known examples of a protein family can be considered to have been drawn from a distribution, where the probability of the occurrence of a specific example is associated with its structural and functional efficacy.Moreover, if the hypothesis should hold, we could design proteins around specified region based on the probabilistic model of this protein family. One common probabilistic model is based upon the assumption of probabilistic independence of amino acids in the protein sequence, which results in the consensus sequence. Thus we could calculate the probability of the occurrence of an example.The consensus sequence is often used for conservation analysis as well as some protein design cases, in which some small number of sequences were used to construct the multiple sequence alignments123. Instead we used a large numbers of sequences to ensure the statistical significance of the amino acids distributions. In our case of Fibronectin type III (FN3) domain, we used over 2000 sequences. In the case of serpin superfamily, we used more than 200 sequences. The experimental results of our synthesized proteins show that the consensus model allowed us to create thermo-stable serpin and superstable FN3 which a Tm of 94.29°C. The Markov model we use weakens the assumption of amino acid independence. A site, in the Markov model, is considered conditionally dependent on its preceding site and its following site.The Markov model appeared to reduce thermal stability according to experimental results of FN3. The third model applies a model named Chordalysis4 which builds a chordal graph of sitesrepresenting the correlations between any statistical significant sites across the sequence. This model allow us to consider multiple sites correlations that arise from tertiary structure.
  1. Wang Q, Buckle AM, Foster NW, Johnson CM, Fersht AR(1999). Design of highly stable functional GroEL minichaperones. Protein Sci. 8:2186-2193.
  2. Lehmann M, Kostrewa D, Wyss M, Brugger R, D’Arcy A, Pasamontes L, van Loon APGM (2000). From DNA sequence to improved functionality: using protein sequence comparisons to rapidly design a thermostable consensus phytase.Protein Eng. 13:49-57.
  3. Jacobs SA, Diem MD, Luo J, Teplyakov A, Obmolova G, Malia T, Gilliland GL, O'Neil KT (2012). Design of novel FN3 domains with high stability by a consensus sequence approach.Protein Eng Des Sel. 25(3):107-17.
  4. Petitjean, F, Webb GI, and Nicholson AE. (2013) Scaling log-linear analysis to high-dimensional data. the 13th IEEE International Conference on Data Mining.