Monday, 25 January 2016

The Protein Folding Problem

What is a protein?

  Most people are familiar with the word protein because quite a lot of food has it and protein, as we are told, is good for muscular recovery. But proteins are much more than that. Without them our body couldn't function at all, they are the diverse workers which reside in our cells and allow our food to be digested properly, our brain to function to its full potential and to avoid our muscles wasting away.

Proteins are polymers, meaning they are strings of repeating subunits called monomers. Monomers are the basic units of proteins and consist of a class of 21 amino acids:

Table of Amino Acids.

These different amino acids bond together via N = H or C = O hydrogen bonding to form the lengthy protein chains. This bonding is due to a chemical reaction. As you can see, there are many amino acids so the natural question to ask is: does the ordering of amino acids matter ?  The answer to that question is a huge yes! It matters because different amino acid sequences lead to different folded states of the protein polymer chain. These different folded states result in the protein doing a different job. If proteins folded the same way irrespective of their amino acid sequencing life as we know it could not have evolved all those billions of years ago. 

It is compelling to note that DNA, the helix molecule which contains the important genetic code, has a structure which does NOT depend on the sequence of the bases inside the helix. This fact ensures the helix molecule can be used as an efficient information storage medium. 

So what's the problem?

So we  have gathered that proteins are important and their folded structure depends on the sequence of the amino acids along the chain. The problem here is what causes the protein, which is initially a floating spaghetti like structure in water, to fold? Most useful proteins in biological organisms contain very long chains of amino acids and have a huge amount of possible states that they could be in. Theoretically it would take the protein to make trial and error movements, testing each possible state, before finding the right 'native' state which it needs to be in to do its job.

This would require huge expanses of time, obviously evolution would have sieved this out in the very early stages of life.... and in a sense it did. It is experimentally known that proteins fold on a time scale of milliseconds to seconds, which presents a paradox...  how does the protein get into the 'native state' without testing all or most of the possible states. This is known as Levinthal's paradox.

It is assumed (yes only assumed) in the general statistical physics and chemical community that what helps proteins fold so quickly is thermodynamics. In particular the driving force for protein folding is entropy, a measure of disorder, and hence a minimization of a quantity known as the Gibbs free energy. The protein chain starts out as an ordered linear string and due to most of the bases being hydrophobic (repelling water) the water around it forms ice like cages which forces the protein into a more compact shape. This most compact and stable of these shapes is called the 'native state' and it is the state which has the lowest Gibbs free energy. 

However all is not simple. There is not just one simple minimum for proteins. The minima of the Gibbs free energy function form a rough landscape full of different wells of differing depth. As the protein is also undergoing brownian fluctuations but also being forced into compact globules by the hydrophobic bases it travels through this energy landscape. Sometimes its 'shape' gets stuck due to it being in a temporary minimmum well but this may not be the most stable and the shape is relaxed slightly until being forced into another compact shape determined by yet another minimum well.

In reality the forces causing the protein to fold have to interact on a atomistic level, the atoms of the water solution interact with the atoms of the bases of the protein. To simulate these atomistic collisions is an extremely computational problem, which requires taking into account the position of many many particles and computing huge amounts of tiny interactions. These simulations are called molecular dynamics and some ingenious approximations have been made but to simulate a whole protein fold solely based on molecular dynamics is.. quite simply... near impossible currently.

So there are many problems...
  1. How can we prove analytically that proteins really do fold because they want to minimize their Gibbs free energy?
  2. How much computer power does it take to fully simulate, at an atomic level, the protein during folding?
  3. Can we fully replicate the stability of the folded protein in real time?
  4. Can we form any program which... given the sequence of amino acids could determine an immediate folded state?
These are some of the pressing questions in the field of theoretical and computational protein folding. Is there another computationally inexpensive way to do this?

Taking a step back, then compute the acceptance probability

One way which is reviving the field is to not focus on the computationally heavy molecular dynamics.. but to attack the density of states directly. Since it is okay to assume the native state is at a minimum of some Gibbsian like energy function, we can simply sample the density of states and 'roam' the free energy landscape until we hit the deepest minimum.

This can be done using specialized Monte Carlo techniques e.g. umbrella sampling and Wang Landau algorithms. These techniques start from a partition function (which counts all possible states of a system) and begins to sample the density of states by making trial changes to the system i.e. moving a part of the protein to another point and seeing if it is energetically favorable according to some acceptance criterion. If you are interested in Monte Carlo algorithms I recommend reading the original paper by Metropolis, Rosentbluth and Teller on the Metropolis algorithm.

In the field today there are improvements to Monte Carlo algorithms which allow bigger systems to be simulated and a better sampling of the free energy landscape. 

All in all it is a very exciting area of research one which has huge biological significance and could impact our medical world.