Comparative Genomics Using the Trait-to-Gene Algorithm

 

This tutorial was written to guide you through the process of performing a comparative genomics experiment using the data you’ve collected on bacterial traits. This is based on work done by Levesque et al. who showed that it is possible to guess about a gene’s function based entirely on which bacterial genomes have those genes and which ones don’t. The concept is quite simple, in that we expect evolution to purge a gene from a genome once that gene is no longer needed. Thus, if a motile bacterial species adopts a non-motile lifestyle, over time one would expect that genes involved in motility will begin accruing deleterious mutations at no cost to the now immobile species. As that gene is further mutated, its identity would change so much that its sequence would not resemble the original. Because evolution often acts so efficiently on prokaryotic genomes, it is possible to use this information to correlate the presence of a bacterial trait to the genes present only in bacterial genomes with that trait. However, there are some caveats to this approach, which will be discussed later.

 

COGS

            COGs are Clusters of Orthologous Groups of proteins, which were developed by a group of researchers at the National Center for Biotechnology Information.  You can find out more information about COGs at their website. Basically, they are just groups of proteins that are shared between three or more sequenced bacterial genomes. For instance, if your bacterium of interest has gene X, and that gene is also present in two closely related species, then gene X would form a COG (consisting of X, X’, and X’’). Thus, a COG consists of a group of related proteins that often have the same function, but in different species. This can be seen in the example of a real COG below, where a gene is present in three species and thus forms a COG. Each gene has a different name in each species, but they are all likely to have the same function.

 

 

Trait-to-Gene

We have developed a simple tool for this class to analyze the COGs from 66 completely sequenced genomes, which can be accessed at the following link: 

 

Trait-To-Gene

 

When you click on the link, you’ll see a webpage that has a list of all the bacterial species (with their three letter abbreviation) that are in the database. You can access a list of which abbreviation corresponds to which species at:

 

Species List

 

The first step is to check the boxes on the left for those species that have the trait of interest and the boxes on the right for those species that lack that trait. All the unchecked species will remain so, since we are not including them in the analysis at this time.  The threshold for reporting results is at the bottom. What this threshold does is it allows us to retrieve COGs that aren’t exactly in the genomes that we’ve selected, but almost. For instance, if one species that has lost a trait has not yet lost all of the genes for that trait (or is doing something else with those genes), then we are still interested in finding out what the gene does. Likewise, some genes may not be absolutely essential for conferring a particular trait (i.e. their function can be substituted by another gene), so a few genomes may have lost that gene and still retained the trait. In either case, the score for that COG would be less than one, because it’s not perfect. Before you try using the tool with your trait, let’s try doing an example where we know the results.

 

EXAMPLE: Flagella Genomes from Levesque dataset

 

            The original analysis using flagella as a test case was done two years ago by Levesque et al., where they compared 21 fully-sequenced bacterial genomes. We’ll use that example now with the new version of the COG dataset. First, check the boxes on the left for the following genomes, which represent the species that have flagella:

Aae, Bbu, Bsu, Cje, Eco, Jhp, Pae, Tma, and Vch

 

Now, click the boxes on the right for the species without flagella:

 

            Afu, Ctr, Hin, Mge, Mja, Mth, Mtu, Nme, Pho, Rpr, Syn, Xfa

 

Finally, set the threshold to .5, so that we can get a list of all the COGs that may even remotely have something to do with flagella.  You should see the following for the COGs that match the above set perfectly, and below that should be a list of COGs with a less than perfect match.

 

If you did not get the list of COGs shown above (plus the other ones with a score <1), then you did something wrong. It’s easy to check the wrong boxes, so make sure that you didn’t click a box twice or put a check in the wrong box.

            You should see the power of this approach in identifying genes involved in a particular process. The only prior knowledge that went into this analysis was some information about which species had flagella and which ones didn’t. All of the COGs that are listed with a score = 1, have a presumed role in flagella formation or function (based on their annotation). Annotations are collected over decades of research by biologists studying individual genes in great detail or based on their homology (evolutionary relationship) to genes that have already been characterized in this way. If you scan the list below this score, you’ll also see many other known flagella COGs as well as lot of unknown ones that may or may not be related to flagella function. You can click on the COGs to find out more about them as well as to determine which bacteria have a gene represented by that COG.