As part of the biology degree at Georgia Tech, students must complete a senior thesis research project in order to graduate. Students can either independently join a research lab and conduct their own project or, if that's not an option, take BIOS 4590, a Research Project Lab where students conduct somewhat guided projects in a situation very similar to generic biology lab classes. Either way, students must also take a Communicating Biological Research course concurrently with their research. That way, we all get to see the research everyone is doing, regardless of whether the research is conducted in a lab or class.

In the spring months proceeding the 2015 summer semester, I started looking around at the biology labs at GT, thinking about which professors I wanted to work with and what fields of biology I was most interested in exploring more. Dr. Eric Gaucher co-taught my Evolutionary Biology class and gave one of the few lectures I still remember individually as a stand-out, mind-blowing class a decade later (it was about panspermia). It was just my luck that a graduate student in Eric's lab, Ryan Randall (she/her), was collecting a small army of undergraduates to tackle a rather large, multi-faceted research project at the same time I reached out to Eric. Two interviews later and I had the job! I started working in the lab to get my bearings in the last month of the spring semester (so, simultaneously with both the CTA project and the neutron star research), as this project was going to really ramp up in the summer.

Unlike the unfortunate situation with the CTA project where I didn't have any of the actual research materials to include and discuss, I have a presentation, poster, and the thesis I wrote as part of this project. Tons of material to remind me, 8 years later, what I did. Let's get to it!

The Short Version

How do we generate phylogenies or evolutionary trees of extinct organisms? One way is through the use of algorithms that can compare the DNA of the living organisms and work backwards to recreate the tree as accurately as possible. The question is, how accurate are these algorithms? This project aimed at comparing the accuracy of various phylogenetic reconstruction algorithms in inferring not just the genotypes (the DNA basis of our traits) of reconstructed ancestors but the phenotypic accuracy (the actual presentation of those traits) as well. Each of the five algorithms generated a phylogeny using tip sequences from an experimentally-derived mRFP (a type of fluorescent protein) phylogeny, allowing for complete comparisons of both genotypes and phenotypes (in this instance, the color and brightness of the fluorescence).

Working as part of a team of undergraduates, I ran SDS-PAGE gels for each node, comparing the protein structure of the true ancestors with each incorrectly inferred variant generated by the algorithms. We used these data to identify variants with substantial structural differences, providing rationale for significant differences among the other data collected by other team members for the ancestors and variants. In addition, I helped grow and maintain the E. coli colonies with all that that entails and performed preliminary analyses of all the data as part of my thesis.

Let's Talk Science

This section is all about covering background information to help make the discussion of this research project more accessible. I personally get the most excited about science when I have the information necessary to both understand and contextualize the research, and I want to make that possible for anyone reading these write-ups, regardless of your own background and training.

Phylogenies, Phenotypes, and Genotypes

When an evolutionary biologist casually says 'tree', they might not mean an actual tree with leaves and bark and whatnot. They could mean a phylogenetic tree, or phylogeny, which exhibit the branching, fractal patterns common to actual trees - hence the name. Phylogenetic trees visualize the evolutionary history between a set of organisms. What evolved into what? Who is more closely related to whom?

generlle_morpholige.webp
Haeckel's tree of life, published in 1866, showing the three main branches of life at the time, Plantae, Protista, and Animalia.

Most phylogenies made today are a bit more minimal in design than the tree above. The long straight lines of a phylogeny are called branches or lineages, the very ends of the tree where individual groups or species are located are called leaves or tips, and the bifurcating points of branches are called nodes. The branches grow through time - the closer to the tips a particular lineage originates, the more recent that group or species evolved. The closer to the base of the tree, or root, a branch originates, the older that particular lineage is.

fishphylo.webp
Modern phylogenetic design showing the evolution of some vertebrates.

Before genetic testing became all the rage, phylogenies were originally created by looking at the physical characteristics or traits of different organisms. "If it looks like a duck and quacks like a duck, it might be related to the duck." This is why swamp eels and electric eels are named such, despite not being actual eels. Or why jellyfish and starfish (which have since been rebranded as sea stars) have 'fish' in the name despite not even being vertebrates. Suffice it to say, this was an imperfect science. Genetic testing is far more robust in determining relatedness of organisms.

Another way of saying this is, today we prefer phylogenies based on genotypes over phenotypes.

For example, say we're curious about a trait in humans like eye color. We can very easily describe the different phenotypes - e.g. blue, green, hazel, brown. Phenotypes are simply the observable characteristics of a given trait. Even behavior can be phenotypical, like that exhibited by people with OCD or schizophrenia. Genotypes, on the other hand, are the genetic bases for these observable differences. A genotype can refer to either an organism's entire genetic material or just the genes underlying a particular phenotype (the more specific word for this is 'allele(s)').

Difference-Between-Genotype-and-Phenotype-Comparison-Summary-1.webp
Comparison of aspects of a genotype versus phenotype, such as heritability.

Connecting traits back to the phylogeny for a quick second, traits shared between branches are called 'ancestral traits' while traits specific to a particular group or branch are called 'derived'. Using mammals and primates as an example, all mammals have hair and milk production, but only primates have opposable thumbs. Therefore hair can be considered an ancestral trait for primates while opposable thumbs are a derived trait.

Interpreting phylogenies and especially relatedness can be very tricky. This is neither the time nor the place to cover these nuances and intricacies, but if you're curious, this article covers a lot of it. I highly recommend at least reading the misconceptions section ('How Not to Read Evolutionary Trees').

Let's say the gene (aka sequence) responsible for eye color looks like 'GATTACA' when the eyes are blue and 'GATTAGT' when eyes are green. (They don't, for the record. Genes for eye color are far more complicated than this, but roll with it for now.) The difference in the letters, or nucleotides (adenine, guanine, cytosine, and thymine), are the genotypic differences that result in phenotypic differences - green vs blue.

NB: I don't mean to imply that all phenotypic differences are due to genetic differences. Phenotypic plasticity is when an organism's environment influences the organism's appearance or behavior without requiring a genetic difference arise first. This is a growing field of research, especially as it relates to biomechanics. A somewhat superficial example is human hair color getting lighter in the summer months and darker in the winter months due to sun exposure. The genes responsible for our hair aren't subject to a biannual mutation cycle, but our phenotype still changes. Plasticity, baby.

Perhaps the gene for brown eyes is something like 'GATTGGT'. Can we infer anything about eye color evolution based on the fact that the genes for blue and green eyes are more similar to each other than to the gene for brown eyes? Before we find out, a very quick primer on nucleotides.

Nucleotides

Nucleotides are literally the building blocks of our DNA and fortunately for us, there is some very strict rhyme and reason to how the building blocks are arranged. Adenine always pairs with Thymine and Guanine always pairs with Cytosine. We call these the 'base pairs'. If you know one side of the equation, you know the other because you know what the nucleotides on the other side must be.

DNA.webp
Collage of diagrams showing the helical structure and base pair composition of DNA as well as the chemical structure of the four nucleotides found in DNA, adenine which pairs with thymine and guanine which pairs with cytosine.

DNA replication is a rather hefty topic and this is the cliff notes' cliff notes version. DNA is a helical structure composed of two backbones attached to each other via the base pairs. When DNA replicates itself, it does so by splitting into the two individual backbones, separating the base pair connections. Enzymes will attach themselves to the now single backbone strands and use the pairing system to rebuild the other side. At the end of the process, we have two, ideally identical, strands of DNA. But mistakes happen.

When you have a single nucleotide difference between genes, it's call a SNP - single nucleotide polymorphism. There are different types of SNPs. Let's take the sequence 'CATG' as our example.

  • Insertion: the enzyme introduces an extra nucleotide where previously there was none, so the outcome looks like "CATTG".

  • Deletion: the enzyme performs a magic trick and makes a nucleotide disappear! The outcome might look like "CAG".

  • Substitution: the enzyme confuses one nucleotide for another, so instead the outcome might look like 'CGTG'. Structurally, adenine and guanine have very similar make-ups to each other; same with cytosine and thymine. This similarity can make 'misinterpretations' common, and some types of misinterpretations more common than others. That is, it's 'harder' or less likely for an enzyme to misinterpret thymine as adenine than as cytosine because the structures between thymine and cytosine are more similar to each other than thymine is to adenine, despite the fact that they make up a base pair. But these harder substitutions can still happen.

Our cells have various ways of dealing with these mutations when they occur, but one pretty nifty method is simply building in redundancy into the transcription and translation system. DNA is transcribed into RNA, and then RNA is translated into proteins. Proteins are simply various amino acids (that the body either makes or collects from our diet) strung together into a chain. There are only 20 amino acids or so, but endless combinations of A, T, G, and C, especially if length doesn't matter. So, cleverly, translation is based on sets of 3 nucleotides, called codons.

transcription_and_translation_mrna-2198106625.jpg
Diagram of translation starting at mRNA and finishing with protein chains.

As the RNA is read by the translation enzymes, the enzymes read the nucleotides in sets of 3 and these codons correspond to a particular amino acid. Other enzymes then fetch that amino acid and chain it together with the preceding or following amino acids to create a protein chain.

You can see the issue here, I bet. SNPs could cause major havoc on this system. One letter out of place, the wrong amino acid gets attached to the protein chain, the finished protein doesn't do its job properly, if at all, and then you die. Okay, maybe not die, but nothing good comes from mutations that have a functional effect on your body's inner working. What's a body to do? Build in redundancy.

There are more 3 letter combinations than amino acids, so several different codons call for the same amino acid. Now, if a codon is particularly important, such as the codon that signals the start of the gene sequence (AUG), it's not redundant. If SNPs mess those suckers up, you want the machinery to stop working. Otherwise, there's some wiggle room!

Figure_15_02_05-3342436607.png
Table showing the different first, second, and third letter combinations and the amino acids, included as abbreviations, that they code for.

This redundancy means there is not a perfect one-to-one correlation between genotype and phenotype. You can have a couple of variations of a gene due to SNPs that all still ultimately do the same thing. Nifty.

Phylogenetic Inference

Phylogenetic inference is a field of computational biology where algorithms are used to reconstruct an unknown phylogeny. In the case of our eye color example, we know the sequences of the tips only - we don't know what the tree looks like. While it looks like blue and green might be more closely related to each other than either to brown, how do we know for certain without a phylogeny? How do we know which eye color is older than the others, if any (i.e. is one color more ancestral than the others or are they all about the same age)? We can use algorithms to reconstruct the tree using only the information given by the tips. This process is called ancestral sequence reconstruction (ASR).

Different algorithms use different guiding principles in order to make deductions about the structure of a tree.

  • Maximum Parsimony (MP), also known as "Occam's Razor", aims to simply minimize the total number of SNPs between genes. Regarding our eye sequences, there are 3 SNPs each between blue (GATTACA) and brown (GATTGGT) and green (GATTAGT) and brown, but only two between blue and green and one between brown and green. Therefore we might conclude that brown and green are more closely related to each other than to blue.

  • Phylogenetic Analysis by Maximum Likelihood (PAML) uses known information about things like SNP probabilities - some mutations are more likely than others - to make inferences. Taking another look at our eye sequences, we see what changed between blue and green are the last two bases, from 'CA' to 'GT' (two harder swaps); between blue and brown, it goes from 'ACA' to 'GGT' (one easy and two hard swaps); and between green and brown, from 'AGT' to 'GGT' (one easy swap). Based on this information, we might conclude that green and brown are more closely related to each other than either is to blue. (Again, this is all an oversimplification, but it gets the idea across.) And SNP probabilities are not the only aspects of a phylogeny that we have prior knowledge on. Branch lengths, i.e. how much time there is for SNP to happen (more time = more SNPs), also play a role.

  • Bayesian Reconstruction (PhyloBayes or PHYLO), like PAML, tries to use as much prior knowledge as possible to make calculations about the probability of the tree given the processes that must occur in order for that particular tree to exist (like SNP probabilities), calculates the likelihood that the tree is correct given the prior probabilities, and returns something called a 'posterior probability' that suggests which outcomes are more likely given the combination of prior probabilities and likelihood of tree correctedness. If your brain is hurting like mine does after trying to wrap it around bayesian statistics, check out this example using fair and loaded dice. PHYLO might take a look at our eye genes and conclude that not only are green and brown more similar genetically, but that brown is more likely to be the ancestral eye color, while blue and green are more recently mutated. It took one easy SNP to go from brown to green, and then two less likely SNPs to generate blue. PHYLO will weigh the probability of this tree against the exact opposite (two unlikely but possible SNP mutations to go from blue to green, and then just one easy random one to go from green to brown) and combine these likelihoods to generate a posterior probability of which option is favored statistically. With my very simple example, it's 50-50 odds, but once the tree gets way more complicated? Who knows. And source for the graphic below (contentful doesn't let me hyperlink in captions).

bayes.jpeg
Graphical illustration of bayesian inference results, from Huelsenbeck et al 2001.

There are many, many more algorithms, but these three represent the main categories used in the project and the different approaches to phylogenetic reconstruction. We'll throw a few more into the mix in a bit. Now that we know what phylogenies are and different ways to infer them from just tip information, what phylogeny are we reconstructing?

Fluorescent Proteins

In this project, we were reconstructing a phylogeny of the protein mRFP. mRFP stands for monomeric red fluorescent protein. We'll cover what 'monomeric' means in a bit, so for now let's focus on the 'fluorescent' part of the name.

In the CTA project background information section, I talked about how light is a type of electromagnetic radiation and how we can use the wavelength of this radiation to classify light across a spectrum, the electromagnetic spectrum. Now, let's talk about what happens when light hits something.

electromagnetic-spectrum-nasa.webp
Diagram of the EM spectrum.

What happens varies based on what the light hits, or more accurately, the atomic structure of what the light hits. All material is made of atoms and bonds. Taking the structure of salt as an example of a solid material, there are sodium atoms and chloride atoms and connecting them all together are chemical bonds. Even in something we perceive as 'solid', there is a lot of empty space. There is even more space between atoms in liquids and gases.

Sodium-Chloride-Crystal-Structure-700x490.webp
Schematic of the atomic structure of a salt crystal.

When light hits the surface of a material, four things can happen. First, it could hit something solid like an atomic nucleus and bounce back. This is called reflection. The color of the object plays a role as well. For example, a red shirt will absorb all light that has a wavelength of less than 700nm and the light that is reflected will have a wavelength of exactly 700nm - which reads as red to us. Whatever our eyes detect as color of the object is the wavelength of light not absorbed by that object. (So is the red shirt red or is it every color but red? ;) ) Reflected light normally bounces back at a angle similar to the angle of impact, but mirrored.

interference_refraction_reflection.webp
Diagram showing the different paths light can take through material before hitting our eye.

Second, the light could actually enter the object, moving through the space between the atoms into the inner layers of atoms. However, chances are high, especially in liquids, that the light will eventually hit something in the object and bounce back out again. This is called refraction. Think of it like pinball on the atomic level. The more it bounces around (and the more random the atomic arrangements are within the object), the more energy the light loses and the more random the angle of light exiting the object has compared to the angle at which it entered.

refraction-plastic-block.webp
Example of light refracting through a translucent plastic block. The atomic structure of this translucent block is rather organized and uniform, so the light follows a very predictable path through.

This is also how we get that famous illustration of light hitting a crystal prism creating a rainbow. White light looks white because, weirdly, it contains light of every visible frequency contained within it. When it hits the prism, those individual frequencies will refract at different angle, splitting the white beam of light into its individual parts.

7710-004-B20BB331.webp
White light refracting through a prism into the individual colors of the rainbow.

The third outcome is that light is absorbed entirely by the object it hits. This produces shadows, naturally, but also one other side effect. The light's energy hits the atom and all that extra energy vibrates the atom's bits and baubles. We interpret this vibration as 'heat'. This is why sitting in the sun warms you - you are absorbing the light's energy and it's vibrating your atomic bits and baubles, making you feel warmer. (Vibrate your bits and baubles too much and you get cancer. Sun tanning not recommended.)

Lastly, fluorescence is a fun combination of both absorption and reflection. The light is initially absorbed by the material it hits, but after some period of time (i.e. nanoseconds) the material radiates that energy back out. The light that comes out is different from the light that initially hit the material, though. It has lost whatever amount of energy went into vibrating the bits and baubles first. One of the more common examples of this is when ultraviolet light hits an object and is radiated away as less energetic visible light. You can thank this physics for every blacklight party or mini-golf venue you've ever been to!

IMG_20190727_200808.jpg
Me enjoying some blacklight mini golf with friends (not pictured)! Despite my expression, I am, in fact, having a blast.

Fluorescent proteins (FPs) as a stand-alone concept should hopefully be rather straightforward now. They are proteins capable of absorbing light and radiating it back out at a different wavelength. mRFP is one such example, and remembering that the 'R' here stands for 'red', you can probably guess which wavelength of light this protein emits. FPs are found all over the place in nature, such as GFP - Green Fluorescent Protein. GFP is found naturally in a jellyfish species and has become famous for its use in biological imaging methods. Have you ever seen an image of cell or tissue sample that was colored all sorts of funky colors? Fluorescent proteins!

GFP.webp
An image showcasing two different types of brain cells, one glowing green from GFP and the other red.

And of course researchers can't resist the urge to go all mad-scientist and tinker with the protein structure of naturally-derived FPs to make them brighter and perhaps even different, unnatural colors! mRFP is a mutated, smaller version of DsRed, a red FP found in a coral species. mRFP was made specifically for use in tissue tagging in 2002.

N.B.: Not all light produced by living organisms is 'fluorescence'. Bioluminescence and phosphorescence are two related concepts. Phosphorescence is like fluorescence but the excited state lasts much longer than nanoseconds. This stability leads to a slow and steady emission of light over time rather than dumping it all out as once like fluorescence does. Think glow-in-the-dark stickers and such. Bioluminescence is when a couple of chemicals are thrown together inside of a biological structure and light is a product of the chemical reaction. Think firefly or lanternfish.

Tags and Vectors

Tagging is more like a bonus topic that I'm opting to include because it's pretty freaking cool, but if you just want to read first three paragraphs of this section, that's all good.

Chances are good that you have probably encountered the idea that certain hormones are associated with certain organs. Insulin with the pancreas for example, or adrenaline with the adrenal glands sitting high atop the kidneys, or perhaps the sex hormones estrogen and testosterone with our respective reproductive organs. Have you ever wondered how we know that's where the hormones are made or used? Or how we know what types of cells make up....well, every single bit of our bodies? Or how we even know what proteins are responsible for converting DNA into more proteins? You can thank tagging.

Tagging first involves identifying a specific target like a protein and identifying the genes in DNA responsible for the creation of that protein. Once you know the target genes in the DNA, you can design a vector. Vectors are basically engineered viruses. You might recall that viruses work by inserting their own genes into our DNA. This way, when our DNA is transcribed (DNA -> RNA) and translated (RNA -> proteins) from genes to proteins, the viral genes are also expressed, thereby tricking a cell's protein making factory into making more viruses that can infect more cells.

OSC_Microbio_06_02_hiv.webp
Schematic of the life cycle of the virus, from attachment to cell wall to integration of virus DNA with the host DNA, and then creation of more viruses.

Vectors are engineering viruses designed to insert a particular sequence of genes into a particular spot in the DNA. Then, if we know which part of the DNA codes for the target protein, we can use the vector to insert the genes that code for our chosen FP right after the sequence that creates the protein. This way, when the genetic sequence for the protein is transcribed and translated, the FP is also made at the same time and becomes part of the final product, hanging off some part of the protein like a clothing tag someone forgot to remove.

Lastly, we can use imaging techniques that involve shooting light at tissue causing all the little FPs everywhere to fluoresce, resulting in a psychedelic image showing us where all the proteins with the FP tag are - where they're made, where they go in the body, what other cells and tissues they might interact with. Cool? Cool.

Fluorescent-labeling-of-proteins.webp
Examples of FP labeling.

The more colors you have, the more different stuff you can label simultaneously, which is why scientists went all "see the rainbow, taste the rainbow, mutate into the rainbow" on GFP (and other naturally-occurring FPs).

174-GFPLikeProteins_GFP-like_Proteins.webp
Variations of GFP that result in different colors.

"Does inserting a foreign protein onto the original protein impact normal processes that you're trying to visualize?" you ask. The answer is, it depends on the size of the tag. DsRed, for example, is quite large and does impact function, but mRFP is a fourth of the size and doesn't seem to impact functionality of the tagged whatever it is. But this leads us to the last topic - protein structures and sizes.

And one more fun fact: tissue tagging is how we got those "glow-in-the-dark" fish that were all the rage like a decade or so ago. They tagged a protein that happens to be expressed in every cell, so the entire animal re-emits light when exposed to UV wavelengths.

Proteins and Gels

Okay, we're getting close now to the end now. This is the last mini-topic I'm going to cover and is directly related to the work that I did in the project.

After RNA is translated into a string of amino acids, this chain undergoes a process called 'folding', taking it from a simple string to a complex 3D structure. A protein's structure is vitally important to all aspects of its functionally, including fluorescence. Protein folding is a massive field unto itself, including significant AI and citizen science efforts due to the sheer scale of the problems the field is facing. I highly recommend looking into it.

image39-3748058042.png
Schematic showing the folding process of a protein, from string to alpha helixes and pleated sheets, and finally it's complex 3D shape.

If a protein consists of many repeating units, it's called a polymeric protein, where each unit is individually a 'monomer'. If there are two, it's called a dimer. DsRed is a tetramer, while mRFP is, as covered, a monomeric version of DsRed - still capable of fluorescing (much more weakly) but a quarter of the size.

If you have a mutated protein, though, how do you know what structure it has? Visualizing an unknown protein's structure isn't easy, often involving x-ray crystallography, NMR spectroscopy, or at the very least a software program to provide an estimate of the shape based on the amino acid sequence. These are complicated techniques and honestly, overkill for our project. So we used SDS-PAGE gel electrophoresis.

Electrophoresis-Set-up-1-3053789344.jpg
Diagram of the gel electrophoresis setup.

Electrophoresis involves creating a spatially uniform electric field, positive at one end, negative at the other, within a gel matrix. The gel is like jell-o, providing a substrate that is penetrable but supportive. Then, you place whatever you want to assess - DNA, RNA, proteins - into little wells in the gel at the negative end of the field, often combined with a stain to provide contrast again the gel matrix. Because chemicals also have charges, they'll respond to the electric field and anything negatively charged will start moving towards the positive end.

As the sample moves through the gel, the gel helps break up the samples into substructures or even individual molecules. Here's the kicker - the smaller the pieces, the farther they'll move through the gel, while the large pieces will end up closer to the wells. Friction, baby. So, ultimately, gel electrophoresis is a way of determining the size of a protein or its parts. Alongside the samples, a protein ladder or size standard is often included. You run a sample containing pieces of known sizes and use their progression through the gel as a benchmark comparison for the unknown samples' sizes.

SDS (sodium dodecyl sulfate)-PAGE (polyacrylamide gel electrophoresis) binds the protein sample with SDS to eliminate the influences of charge and structure on the sample. Therefore the results are determined only by size of the amino acid chains.

SDS-PAGE gels are a cheap and precise way of separating protein samples, used in everything from preparation for other analytical methods to standard HIV tests. To underscore the impact and ubiquity of this methodology, here's one last fun fact: the 1970 publication where Dr. Ulrich Laemmli showcased the SDS-PAGE methodology is the second most-cited paper. Ever. It currently has over 300,000 citations.

And with that, I think I've covered just about everything. Let's talk about the research!


The Research Project

Of all the phylogenetic reconstruction algorithms out there, of which we described three in the sections above, which ones are most accurate? This was Ryan's question. To get at the answer, she devised an experiment where she would create an artificial phylogeny of something like a protein in the lab, feed the tip sequences to the various algorithms, and then compare the trees the algorithms produced with the actual tree. Simple and elegant, my favorite type of experiment. But a phylogeny of what?

Enter stage right, mRFP. E. coli made to express mRFP using vectors were grown in lab and allowed to evolve until enough mutations had accrued in the mRFP protein genes to result in a phenotypic difference. That is, mutations in the mRFP genes eventually changed the color of the fluorescence emitted by mRFP. The population with the mutated mRFP was split into two and both were allowed to evolve until more mutations occurred. Rinse and repeat a few more times and at the end of the process, Ryan had made herself a wonderful phylogeny of mRFP evolution.

FullPhylo.PNG
Screenshot of Figure 1 from Randall et al 2016. The caption: Figure 1 | Phylogram of the experimental phylogeny initiated from a single red FP gene. Scale bar represents amino acid replacements per site per unitevolutionary time. The colour of each branch reflects the colour-class phenotype (emission) of the node protein for internal branches or the leaf protein fortip branches (except for the branch connecting node 33 to leaf 7 that transitions through an orange intermediate). Nodes and tips are numbered forreference. Nonsynonymous and synonymous substitutions are shown along each branch, respectively. The experiment began near node 21 with a single redFP gene and proceeded by random-mutagenesis PCR.

The sequences of the leaves were fed to PAML, MP, PHYLO, automated PAML (LAZARUS), and Fast ML (FASTML). In each algorithm's reconstructed tree, how well do the reconstructed nodes' genotypes match the genotypes of the nodes in the original phylogeny (which we'll call the True Ancestors or TAs for brevity)? Sometimes they match and sometimes the generated genotypes that don't match the TA - we're going to call these Inferred Incorrect Variants, or IIVs.

However, we know that genotypes and phenotypes aren't exactly perfectly matched due to the redundancy built into the codon translation system. So, even if the reconstructed genotype doesn't match the TA's genotype exactly, perhaps the algorithms still managed to produce the correct phenotype. The question now is, how accurately do the algorithms reconstruct not just a genotypically-correct tree but a phenotypically-correct tree?

That is the question for which Ryan collected a gang of undergrads to grow glowing E. coli. Ryan created the original mRFP phylogeny and ran the algorithms. Once all the algorithms had produced their version of the mRFP phylogeny, the IIVs were expressed in E. coli using vectors and grown into small colonies in petri dishes. Genotypic accuracy was the first part of the project for which we could collect data. Once these colonies were well-established, Mark Leber purified the proteins from the E. coli cells, because we don't really care about the bacteria here. With the purified IIV proteins, we could start assessing the phenotypes.

Ryan cared deeply about characterizing the phenotypic differences between the IIVs and TAs both quantitatively and qualitatively. Quantitative data took shape in the form of Quantum Yield (QY) analysis, performed by Kelsey Roof, and Extinction Coefficient (EC) calculation, performed by Caelan Radford. Quantum Yield is the ratio of photons emitted by a protein to photons absorbed by the protein at a certain wavelength. It's a way of characterizing how efficient the protein is at converting absorbed UV light into visible light. The EC is a constant relating how much energy from an incoming light of a particular wavelength a fluorescent protein absorbs. (This analysis is done with a wavelength of 280nm to provide consistency and normalcy to ECs across different studies.) Multiply EC by QY and you have a way of quantifying how bright a fluorescent protein is. You don't have to resort to squinting at two petri dishes side by side and try to determine which one is brighter. Quantitative data, yeah!

Qualitative assessment on the other hand, is a little less straightforward, and the part that I was in charge of. I remember at the time being very confused by qualitative data's place in a scientific publication. How do you even analyze it, let alone relate it succinctly to others? But I learned as we went on that just because data doesn't make it to a publication doesn't mean it's not worthwhile to collect. Some data inform your experimental design and analysis, but isn't really useful to others, and that's okay.

I ran SDS-PAGE gels to compare the bits and bobs of the TA proteins against every IIV the algorithms generated for each TA. Additionally, because I was the only student in lab who was doing this work for their senior thesis, I tried my hand at analyzing the quantitative data as well. I knew my results would only be preliminary as Ryan was of course going to do the final analysis for the paper. However, by the end of the summer, only three nodes had been completely assessed. The remaining 13 were awaiting QY or EC analysis. Nevertheless, it was good practice for consolidating multiple sources of data and gave me the opportunity to present more 'hard' results in my thesis,

Preliminary Results

These are the analyses for nodes 37, 32, and 30, taken from my thesis. I include them here verbatim because I think it shows how all the data ties together to paint a full picture - which is the entire point of the project, of course.

Node 37

Every algorithm produce the same genotype for this node, so 37_PAML = 37_MP = 37_PHYLO = 37_LAZARUS = 37 FASTML. The results we obtained for 37_PAML should be equally applicable to the other IIVs at this node. Therefore, only one IIV, PAML, was synthesized. Table 1 contains information from each aspect of analysis of this project. The first row shows the number of amino acid sequence errors made by the algorithms. At site 43, threonine (T) was replaced with serine (S). At side 125, arginine (R) was replaced with histidine (H), and at site 194, lysine (K) was replaced with isoleucine (I). The most notable of these 3 mutations was the R -> H exchange, as it meant the protein gained a ring, imidazole. In a 2010 study of ACAD9 protein, researchers found that an arginine to histidine mutation, much like the mutation seen at site 125 here, was responsible for interfering with dimerization [8]. It is possible the same mutation is responsible for the dimer-monomer switch in phenotype, visible in the SDS PAGE gel of node 37.

node37.PNG
Table 1 in my thesis, comparing the QY, EC, Brightness, appearance, and SDS-PAGE gels between the TA and the IIVs for node 37.

It is evident that the true ancestor is a dimer as only two bands are present and the larger-sized band is roughly double the size of the smaller band (54 to 27). The IIV, however, is obviously a monomer in comparison. Lastly, QY and EC are calculated and compared between the TA and the IIV, and while extremely close in values, the IIV has a brightness of 1.267 (QY * EC) while the TA has a normalized value of 1, meaning the IIV is quantitatively brighter. This is somewhat visible in the UV pictures taken at the beginning of the project. In conclusion, the phenotype of the IIV is substantially different from the phenotype of the TA.

Node 30

Beginning with AA mutations, all five algorithms expressed the following mutations [ Original AA->Mutation]: 17) C->S, 43) A->S, 95) D->E, 117) E->V, 159) G->E, 173) D->E, 177) N->K, and 125) R->H. MP experienced a mutation at site 83: M->L, and PHYLO experienced a mutation at site 206: E->D. Looking at the gel, two bands are identifiable in the TA lane while each IIV has 3, indicating a piece of protein that is remaining attached to the larger structure in the TA is breaking off in the IIVs. PyMOL analysis would be used next to identify the substantial changes and narrow down which mutations are mostly likely the cause of such a structural change.

node30.PNG
Table 2 is a complete analysis of node 30.

Looking at the UV pictures, one would easily predict that 30 PAML and 30 PHYLO are brighter than either 30 3r.1 or 30 MP. However, the quantitative results from the QY and EC values negate this qualitative observation and indicate that all 3 IIVs are dimmer than the TA. Unfortunately, while the UV pictures allow a quick method of qualitatively assessing protein phenotypes, it is mostly reliable in detecting color changes, not changes in color intensity and brightness (which can be very subjective qualities). This is why the quantitative data is [sic] necessary and essential to the phenotypic evaluations of the proteins. The higher QY values mean that the IIV for MP and PAML are more efficient at emitting light per photon absorbed. However, at the same frequency of absorption, all the IIVs are less efficient at absorbing light than the TA.

Node 32

Lastly, Node 32 here is an example of a node that has essentially no structural differences evident in the SDS-PAGE gel but has substantial quantitative differences. Comparing QY, EC, and Brightness values, it is evident that the IIVs are less efficient in both absorbing and emitting photons than the TA is, resulting in dimmer fluorescent proteins. Regarding mutations: 17) C->Y, 117) E->V, 125) R->H, and 173)D->E. PHYLO has an additional mutation at site 94) D->E. Again, PyMOL would be the next step to determine what changes the mutations code for structurally that make the IIVs a much less efficient protein.

node32.PNG
Table 3 shows the complete analysis for node 32.

Hopefully you can see how the gels were used to direct attention towards IIVs that exhibited substantial structural changes from the TA. Structural differences provide a rationale for any observed differences in QY or EC. However, that's as far as that data went as investigating the detailed structure of the proteins was beyond the scope of the project. At the time, I didn't have a great way to synthesize this information, a way to change this qualitative analysis into quantitative data. To be honest, I still don't have good ideas - but according to my latest dive down the google rabbit hole to see what analyses are possible with multiple gels, no one else does either. On one hand, I still have this feeling of dissatisfaction, of wanting my contribution to the project to have been more substantial. On the other hand, it's reassuring to know that I did, first, what was asked of me, and second, all that I could with the data.

Regarding the analysis of the quantitative data: I have a few bar graphs that compare number of genotype errors and QY and EC ratios between the algorithms. However, they don't start at zero and use color chaotically, and ultimately are made using only a fraction of the data since that was what we had at the time. I refuse to include them on the basis that I'd like to be hired one day and these bar graphs are in violation of basic principles and, well, terrible. So, here are the figures from Randall et al 2016.

Figure2.PNG
Figure 2 | Number of incorrectly inferred amino acid sites for each node of the phylogeny. The 19 leaf sequences from Fig. 1 were subjected to ASRanalyses using Bayesian (PAML, FastML, PhyloBayes) with or without rate variation modelled as a gamma distribution (!), as well as parsimony (MP). Theinferred sequences were then compared to the true ancestral sequences from the 17 ancestral nodes in Fig. 1. Dark brown bars are PAML with a gamma distribution, light brown bars are FastML with a gamma distribution, yellow bars are PAML without gamma, light grey bars are PhyloBayes with a gamma distribution, and dark grey bars are maximum parsimony. Colour code is irrespective of FP colour emission phenotype.
Figurfe3.PNG
Figure 3 | Average phenotypic error across all nodes for the five ASR procedures. Extinction coefficient (e), quantum yield (F), and brightness(product of E and F) were determined for all incorrectly inferred ancestral FP proteins and compared to the properties of the true ancestral protein at each node and reported as a function of percent error. Dark brown bars arePAML with a gamma distribution, light brown bars are FastML with a gamma distribution, yellow bars are PAML, light grey bars are PhyloBayes with a gamma distribution, and dark grey bars are maximum parsimony. Single and double asterisks represent confidence at 95% and 99% levels, respectively, and are coloured according to the respective procedure that has significantly less error.

Ryan concludes that Bayesians methods are best and that even if the reconstructed tree includes genotypic errors, the phenotypes mostly match. "This finding should give the ASR field confidence that ancestral phenotypes are encoded correctly even if some residues are incorrectly inferred—assuming such sites do not drive phenotypes."

In the end, very little of my work made it into the final publication, and I understand. I'm mentioned in the acknowledgements, though, and that's nifty. But in all reality, I learned so much about bacteria cultivation (we all took shifts when it came to feeding, changing, and growing the E. coli colonies), proteins and protein characterization techniques, phylogenetics, collaborative research, and experimental design throughout this project. Vast amounts of learning in the span of just a few months. I didn't get a publication out of it, but I think that's probably the least important part. I feel it was a summer very well-spent.

Cheers,

Z

P.S. - The students in the BIOS 4590 class all had projects relating to yeast and beer. I remember the bulk of the projects looked at which strains produced the tallest beer head (the layer of foam generated from pouring the beer into a glass) or the longest lasting beer head, things like that.

References

Randall, Ryan N., et al. "An experimental phylogeny to benchmark ancestral sequence reconstruction." Nature communications 7.1 (2016): 12847.