Just before we look at the results, I need to point out that what we’re doing in this experiment isn’t actually the way most proteins are said to have evolved. In evolutionary theory, new things evolve mostly from older things. New functions are usually said to be adaptations of older functions, and the same is true of proteins. New ones are often said to have evolved from previous ones. However, some must have evolved from scratch, if life arose by itself, otherwise we get an infinite regress of proteins coming out of proteins, but never any actual beginning.
Anyway, let’s calculate the results of our experiment. I will be talking about ridiculously large numbers here, but I will put them into some kind of context immediately afterwards, so they make more sense.
First, we need to know how many permutations our colony has to search through, to find the correct 60 letter sequence. There are 4 possible letters (A, C, G and T) for each nucleotide of DNA in our experiment, and 60 letters in a block, so the number of possible permutations is 4 multiplied by itself 60 times, or 4 to the power of 60, which is about 1.3 multiplied by 10 to the power of 36. We need to divide this number by two since, as in our combination lock example, if we repeated the experiment endlessly, on average our colony can expect to find the right sequence after trying half of the possibilities. This gives us an average of about 6.5 multiplied by 10 to the power of 35.
However, when an individual bacterium replicates, most of the time the block won’t contain a mutation, since it’s only 60 letters long. To find out how many actual trials our colony would need to run, we need to divide the average we calculated above by the simplified probability of the block containing a mutation, which we assumed to be about 1/17,000,000. When we do this, we get about 10 to the power of 43 trials, which is 1 with 43 zeros after it. This is how many bacteria cells we would need over the one billion year length of our experiment, to have a reasonable chance of finding the first 60 letter block in our desired protein.
Due to its size, this number is pretty meaningless unless we can put it into some kind of context. As a useful comparison, it has been estimated that the number of single-celled organisms on Earth is about 5 multiplied by 10 to the power of 30, and all the bacteria that has ever existed, assuming the ordinary evolutionary timescale, probably has an upper limit of around 10 to the power of 40, or 1 with 40 zeros after it.1
In other words, our experiment as it currently stands would require a thousand times more bacteria than has ever supposedly existed on Earth, just to find a specific sequence of 60 letters coding for a chain of 20 amino acids, which is a small fraction of the size of a protein used by bacteria today!
Now, this doesn’t really prove or disprove anything. Ultimately, it’s just a mathematical model that is based on certain starting assumptions, and we can always adjust these to get a different outcome. However, the experiment is still useful, because it gives us a better sense of what may be needed for nature to find a particular sequence of amino acids from scratch.
The experiment, at least in the way we have currently set it up, implies that finding a specific sequence of even just 20 amino acids de novo is virtually impossible within the ordinary evolutionary timescale, at least using a natural search. Yet bacteria today have thousands of proteins available to them, averaging over 250 amino acids in length, so either they evolved them in spite of our thought experiment, or they were given them.
If bacteria really did evolve at least some of their proteins from scratch, then obviously something must be wrong with our experiment. Maybe we simplified it too much. Let’s first look at what we didn’t factor in and see whether these could help us evolve our protein.
We have ignored the fact that some codons code for the same amino acid, and that some amino acids in a protein can be swapped out for different ones, but the protein can still often perform the same core function.
If we factored this built-in redundancy into our experiment, it would probably make the evolution of a protein easier. On the other hand, we wanted the end result to be a specific chain of amino acids with the same letters as a protein used by bacteria today, not just one with the same function but different letters or amino acids.
We have focused on one type of mutation, where DNA letters get switched. This is the most common type, but there are others. Sometimes letters get added or deleted, or sequences of DNA get duplicated. These types of mutation could help, but they could just as easily break any progress made towards our new protein, so in the long run they might not actually help. I guess it depends on the type of protein we’re trying to evolve.
We have also ignored stop codons. These instruct the ribosome to stop making the protein. In the human genetic code, UAA, UAG, and UGA are stop codons. Stop codons could be a significant hindrance to long amino acid chains evolving. The longer a chain gets, the more likely it is that a random stop codon could appear in the sequence as a result of mutations, causing the evolution of our new protein to halt, at least until the stop codon mutates back into one that codes for an amino acid. If we factored these in, it could slow down protein evolution.
Could our assumptions also be part of the problem? We have assumed our colony is together for a billion years, and that once one of the cells arrives at an evolutionary milestone, all bacteria instantly adopt it, as a crude approximation of natural selection.
However, if the colony is dispersed at any time, it wouldn’t be possible for all of the cells to adopt it. In reality, bacteria are living organisms that might prefer not to stay together in the same colony for a billion years. In other words, this assumption is heavily skewed in favor of evolution.
We also assumed a mutation rate of about one in every billion base pairs. We could increase this, to try and speed up evolution, but then our colony would have to work harder to preserve any gains it had achieved, which might not actually help us in getting to our desired protein. If the mutation rate were too high, this would make the genomes of our bacteria unstable for later generations, resulting in far higher cell deaths.
Built into our experiment is also the underlying assumption that things can only get better. We have allowed the colony to keep an evolutionary milestone once found, and we have shielded that milestone from any further mutation. This allows the protein to evolve in cumulative steps. In evolutionary theory, these intermediate steps are preserved because they usually give the organism a survival or reproduction advantage over their fellow organisms, and its offspring inherit this advantage.
However, I think the most significant assumption we made at the start of our experiment is the number of intermediate steps allowed. We needed these steps, so our colony had a chance to evolve the protein. I called them “evolutionary milestones,” to remind us that each step would need to make a significant difference to the survival or replication ability of a cell, so the step could be preserved by natural selection.
In our experiment, the protein we were aiming for consisted of 240 amino acids taking up 720 letters of DNA. We allowed the colony to try and find this sequence in 12 intermediate steps, by breaking up the sequence into 12 blocks of 60 letters each.
The colony only needed to find the correct sequence for the first block, because if it could successfully do this, we can assume it could also find the others, given more time or a larger colony.
However, in our experiment, the colony couldn’t even find the first block of 60 letters of DNA, which would represent a sequence of 20 amino acids, even if we hired all the bacteria that has ever supposedly existed on Earth, suggesting that our desired protein couldn’t evolve from scratch, based on these initial assumptions.
The solution, at least as far as our model is concerned, is to allow for more intermediate steps. If we allowed 16 steps, the colony would only need to find blocks of 15 amino acids at a time, consisting of 45 letters of DNA. To find a specific sequence of 45 letters would require trials numbering about 10 to the power of 34, about a million times less than the estimated number of bacteria to have ever lived on Earth, according to the evolutionary timescale.2
If we allow 24 intermediate steps, our colony would only need to find 10 amino acid blocks at a time, made up of 30 letters of DNA. To find one block would require trials numbering around 10 to the power of 25, which is just a minuscule fraction of all the bacteria that has ever supposedly lived on Earth.3
In other words, in this model of protein evolution, the evolution of our desired protein consisting of 240 amino acids is at least theoretically possible, if we allow it to happen in much smaller steps.
If nothing else, this experiment has served one useful purpose. It suggests that if nature has to evolve at least some lengthy proteins from scratch, it probably needs to start with pieces much smaller than 20 amino acids in length, which could perhaps be building blocks for larger pieces. On the other hand, nature as we know it today prefers much longer proteins. As I said earlier, the average size of a protein in real bacteria is over 250 amino acids in length, because long proteins can perform functions that short ones can’t do.
Incidentally, this is why taking a line of random English letters, and turning it into a line from a Shakespeare play one letter at a time, is a poor analogy for what evolution has to do when it comes to evolving a protein.
A single “letter” change probably gives no survival or reproductive advantage to an organism, and the built-in redundancy of the genetic code ensures that the effects of this change are often minimized anyway. But if there is no advantage, then natural selection won’t preserve it. Changes that might confer an advantage probably need to be at the level of “words,” “sentences” or “paragraphs” most of the time.
Therefore, in the story of how a real protein of 250 amino acids in length evolved, any analogy involving lines from Shakespeare would need to end with a 750 letter section of his play. It would also need to show the gradual change of words and sentences into different words and sentences, and demonstrate how each of the pieces making up the section of his play could still be read, understood and make sense at each evolutionary milestone along the way.
1 Whitman, W B et al, “Prokaryotes: the unseen majority”, PNAS, 1998. 2 There are 445 permutations of 45 DNA letters, or about 1.2×1027. If we halve this we get 6×1026. With a one in a billion mutation rate, the simplified probability of a mutation occurring in 45 letters would 45 divided by 1 billion, or 0.000000045. Dividing 6×1026 by 0.000000045 gives us about 1.3×1034. The exact numbers would be a little different, but what matters here is the size of the number. 3 There are 430 permutations of 30 DNA letters, or about 1×1018. Halving this gives 5×1017. With a mutation rate of one in a billion, the simplified probability of a mutation occurring in 30 letters would be 0.00000003. Dividing 5×1017 by 0.00000003 gives us about 1.6×1025, roughly the number of trials that would need to be run.