Chapter 21 Blog: Genomes, Proteomes, and Bioinformatics (Semon)


A.  Blog

Prokaryotic genomes consist of a single circular "chromosome" (called gDNA) located in a central nucleoid region of the bacterium that contains all the DNA coding for the essential functioning of the organism. The gDNA is supercoiled and wound around structural proteins to condense it to a reasonable size; when parts of the gDNA need to be accessed, they unwind and transcription/translation begins. The gDNA usually has a single origin of replication, in contrast to eukaryotes which have many on each chromosome. The gDNA has a very small amount of non-coding DNA - less than 15% - and many of the genes are organized in a polycistronic fashion: there are several genes, generally related to one another in fucntion, located one after the other after a promoter which are read off all at once by a single run of an RNA polymerase. Bacteria may also have additional small circular pieces of DNA called plasmids that contain genes that are useful but non-essential. Plasmids replicate independently from the gDNA, and can be transferred from bacterium to bacterium through conjugation or transformation. Plasmids generally contain genes for resistance to toxins, digestion of unusual substances, secretion of anti-bacterial proteins, development of virulence, and transfer of plasmids to other bacteria. 

 

The human genome, like eukaryotic genomes in general, is much more complicated than any prokaryotic genome. First of all, the DNA is arranged on multiple linear hyper-compacted (through hypercoiling and lots of proteins) strands called chromosomes. The ends of the chromosomes are covered in "caps" of huge numbers of repetitive gene sequences called telomeres; they make the linear DNA more stable and help the cell copy the linear DNA without losing the bits at the ends due to the peculiarities of the copying process.  Often, in eukaryotic genomes, there are several chromosomes in each individual that contain very similar information; these are homologous chromosomes, with one generally coming from each parent. Much of the eukaryotic genome consists of DNA that does not code for anything in particular; some of this "junk" DNA is repetitive, while some is not. In addition, the eukaryotic genome is littered with transposable elements (discussed later) and no longer functioning genes. Then, there are regulatory sequences - sequences of DNA that serve to regulate gene expression; they're generally used as protein binding sites. Finally, there are the genes; the part of the DNA that actually code for something. Eukaryotic genes consist of non-coding introns meshed with coding exons; the introns are cut out of the pre-mRNA before translation. That leaves just a tiny portion of the eukaryotic DNA that actually codes for protein. 

 

Transposable elements are pieces of DNA that can jump from one place to another in the genome. The simplest transposable elements consist of two inverted repeats of DNA at either end, with the gene for transposase in the middle. Transposase, by recognizing the inverted repeats, can cut a transposable element out and paste it somewhere else in the genome. If there are genes in between two such simple transposable elements, then there is a chance that a transposase enzyme will recognize the whole complex as one large transposable element (since it's still flanked by inverted repeats) and paste it somewhere else; these are complex transposable elements. There are complex transposable elements that contain reverse transcriptase and integrase that act like simple viruses, making many copies of themselves and inserting themselves into many places in the genome. Transposable elements seem quite useless at first glance, but they can significantly impact evolution! They can create a copy of a gene somewhere else in the genome, allowing for differing genetic variations to accumulate in both copies, eventually giving rise to two different but related proteins! This is considered to be one of the major factors helping the creation of gene families, sets of related but different genes that all "descended" from a common anscestor.

 

The proteome is the sum of all the proteins expressed by a gene, cell, tissue, or organism, named in analogy with the genome - the set of all genes in a cell. The genome contains many things that do not code for proteins. The proteome, on the other hand, includes all the post-translational modifications that occur to proteins, as well as all the different ways of alternatively splicing genes with exons. Despite the small amount of DNA that codes for proteins, because of the latter two factors, the proteome is bigger than the genome! The proteome consists of structural proteins and regulatory proteins. Structural proteins are proteins that actually do things in the cell - they let ions in and out of membranes, they catabolize and anabolize, they do work! Regulatory proteins control the transcription and activity of other proteins, by acting as transcription factors or by modifying other proteins through, say phosphorylation or binding. 

 

Obviously, there is a lot of information in all the proteomes and genomes in biology. To organize all this information, large, centralized databases have been created, containing huge data banks of every gene and protein sequence known. These sequences are submitted by labs around the world, and then analyzed by computational methods and annotated by humans. These databases are called bioinformatics databases. These databases store information about each sequence, and mark which parts of each sequence are which protein, which regulatory sequence, etc. In short, they are a catalog of all the possible information to be extracted from genetic/protein sequences. One can do a search for a protein, and see its most common sequence as well as various common variants of it. The sequence's evolutionary history, through gene families and species, can also be traced. This is possible because of a tool called BLAST, which allows one to compare two sequences and see how similar they are, or to find a similar but not exact match of a smaller sequence in a larger sequence.  So, if one just sequenced some unknown DNA strand, that sequence could be BLASTed against the genomics database to find what that sequence codes for, even if the sequence has a mutation in it! The intersection of computing and biology has proved invaluable to practicing biologists indeed.

B.  Useful Materials

 

http://blast.ncbi.nlm.nih.gov/Blast.cgi

This had to be linked. This is the source for BLASTing sequences against the main public bioinformatics database. To a practicing biologist, it is probably one of the useful pages on the web, tailing not far behind www.google.com. The page lets you BLAST against several species; it lets you BLAST different types of sequences; it lets you do specialized, tremendously useful variants of BLAST. A rare case of taxpayer money well spent.

 

http://www.ncbi.nlm.nih.gov/pubmed/2231712

The paper that started the BLASTing. Fairly readable, and quite well written. Worth reading if you can get access, which I think we all can through RVCC. It explains the algorithm and reviews its performance on real searches.

 

http://www.ndsu.edu/pubweb/~mcclean/plsc731/transcript/transcript5.htm

Classic example of alternative splicing in rat muscle troponin T. The 5 exons are W, X, Alpha, Beta, and Z. However, the proteins found consist of exons W, X, Alpha, and Z; and W, X, Beta, and Z. It's kind of awesome to really see a concrete example of this strange, fairly theoretical-sounding phenomenon. The page goes into the example in more detail.