Comparative
Genomics Using the Trait-to-Gene Algorithm
This tutorial was written to guide you through the process of performing a comparative genomics experiment using the data you’ve collected on bacterial traits. This is based on work done by Levesque et al. who showed that it is possible to guess about a gene’s function based entirely on which bacterial genomes have those genes and which ones don’t. The concept is quite simple, in that we expect evolution to purge a gene from a genome once that gene is no longer needed. Thus, if a motile bacterial species adopts a non-motile lifestyle, over time one would expect that genes involved in motility will begin accruing deleterious mutations at no cost to the now immobile species. As that gene is further mutated, its identity would change so much that its sequence would not resemble the original. Because evolution often acts so efficiently on prokaryotic genomes, it is possible to use this information to correlate the presence of a bacterial trait to the genes present only in bacterial genomes with that trait. However, there are some caveats to this approach, which will be discussed later.
COGS
COGs are Clusters of Orthologous
Groups of proteins, which were developed by a group of researchers at
the
Trait-to-Gene
We have developed a simple tool for this class to analyze the COGs from 66 completely sequenced genomes, which can be accessed at the following link:
When you click on the link, you’ll see a webpage that has a list of all the bacterial species (with their three letter abbreviation) that are in the database. You can access a list of which abbreviation corresponds to which species at:
The first step is to check the boxes on the left for those species that have the trait of interest and the boxes on the right for those species that lack that trait. All the unchecked species will remain so, since we are not including them in the analysis at this time. The threshold for reporting results is at the bottom. What this threshold does is it allows us to retrieve COGs that aren’t exactly in the genomes that we’ve selected, but almost. For instance, if one species that has lost a trait has not yet lost all of the genes for that trait (or is doing something else with those genes), then we are still interested in finding out what the gene does. Likewise, some genes may not be absolutely essential for conferring a particular trait (i.e. their function can be substituted by another gene), so a few genomes may have lost that gene and still retained the trait. In either case, the score for that COG would be less than one, because it’s not perfect. Before you try using the tool with your trait, let’s try doing an example where we know the results.
EXAMPLE: Flagella Genomes from Levesque dataset
The original analysis using flagella as a test case was done two years ago by Levesque et al., where they compared 21 fully-sequenced bacterial genomes. We’ll use that example now with the new version of the COG dataset. First, check the boxes on the left for the following genomes, which represent the species that have flagella:
Aae, Bbu, Bsu, Cje, Eco, Jhp, Pae, Tma, and Vch
Now, click the boxes on the right for the species without flagella:
Afu, Ctr, Hin, Mge, Mja, Mth, Mtu, Nme, Pho, Rpr, Syn, Xfa
Finally, set the threshold to .5, so that we can get a list
of all the COGs that may even remotely have something
to do with flagella. You should see the
following for the COGs that match the above set
perfectly, and below that should be a list of COGs
with a less than perfect match.

If you did not get the list of COGs shown above (plus the other ones with a score <1), then you did something wrong. It’s easy to check the wrong boxes, so make sure that you didn’t click a box twice or put a check in the wrong box.
You should see the power of this approach in identifying genes involved in a particular process. The only prior knowledge that went into this analysis was some information about which species had flagella and which ones didn’t. All of the COGs that are listed with a score = 1, have a presumed role in flagella formation or function (based on their annotation). Annotations are collected over decades of research by biologists studying individual genes in great detail or based on their homology (evolutionary relationship) to genes that have already been characterized in this way. If you scan the list below this score, you’ll also see many other known flagella COGs as well as lot of unknown ones that may or may not be related to flagella function. You can click on the COGs to find out more about them as well as to determine which bacteria have a gene represented by that COG.