Beginning Biochemistry—Bioinformatics

Let's think about communication. A pretty fundamental topic, eh?

I have an abstract idea. That idea might originate from internal monitoring of my physiological state, external stimulation internalized by a sense organ or sporatic neural activity of undefined character. That abstract idea is considered in my cerebral cortex, developed and ultimately polished into a clear and concise thought.

But there hasn't been communication yet.

In order to communicate my now well–formed idea, I must first encode that idea in language—a script or encoding which I share with the person, or persons, with whom I wish to communicate my thoughts. That language is written (as I am doing now) or spoken, but the communication is not complete as it is now a script for the reader who must decode the language and, from that language, create abstract ideas.

Alright, but what does that have to do with the academic discipline of bioinformatics?

The answer, in a nutshell, is encoding. The roots of modern bioinfomatics can be traced to a suggestion made by Gamov G and Ycas M at the Symposium for information theory in biology in 1958.

And what was their suggestion ... single letter codes to represent the amino acids. Proteins then could be represented with one character for each amino acid. Written in order, left to right on a page, correspoding to the sequence of amino acids, N to C–terminus, in the protein. That encoding scheme would allow for the exchange of sequence data efficiently from researcher to researcher, and, more importantly, researcher to digital computer. An encoding which provides a method for representing the physical world in such a way that the pedestrian digital computer can input efficiently.

So, who was ultimately responsible for implementing the suggestion and making it useful?

Formal implementation fell to the IUPAC–IUB (International Union of Pure and Applied Chemistry–International Union of Biochemistry) committee on nomeclature. The draft suggestion was published in 1968, IUPAC-IUB Commission on Biochemical Nomenclature A One-Letter Notation for Amino Acid Sequences Tentative Rule (1968) JMB 243 3357–9.

Typical committee work. The real driver of making this idea work and sharing it with the larger community was Margaret Dayhoff who, in 1965, published the book Atlas of Protein Sequence and Structure which listed the 65 protein sequences known at the time. The 65 sequences which she had assembled by reading all the papers listing complete or partial sequence of proteins.

Now, I know, students who have only lived in the twenty–first century probably can't get all that excited about a book listing protein sequences. However, if you knew that the work for the Atlas of Protein Sequence and Structure would ulimately lead to electronic respositories of sequences and structures, you might say that Margaret Dayhoff was the mother of bioinformatics.

A method for aligning proteins was published in 1985 in the paper Lipman DJ and Pearson WR (1985) Rapid and Sensitive Protein Similarity Searches Science 227 1435–41. While this paper certainly made a splash, published in Science and all, the most enduring contribution to the field was not the alignment algorithm but the file format defined for the program—FAST Alignment, or FASTA, format.

An example of the FASTA file format is shown below.

>pdb|9RNT|A Chain A, RIBONUCLEASE T1
ACDYTCGSNCYSSSDVSTAQAAGYKLHEDGETVGSNSYPHKY
NNYEGFDFSVSSPYYEWPILSSGDVYSGGSPGADRVVFNENN
QLAGVITHTGASGNNFVECT

Deceptively simple in presentation, it has become the lingua franca for sequence representation. The first line of the file begins with a greater than character (">") and the remainder of the first line is a free–form comment about the sequence. In our example, the comment tells us the sequence is from a PDB structure, the PDB code is 9RNT and this is the sequence of chain A of RNase T1. The first line ends with the new line character. I know, a bit challenging as the new line character is a non–printing character. You don't see it, but it is there to generate the new line. Beginning with the first character on the second line and running continuously is the biological sequence in single–letter representation. Here a protein sequence, but the file format represents nucleic acid sequences equally well.

When in doubt, any method which is using a sequence as input is likely to accept that sequence in FASTA format. This representation should be considered the default sequence format. There are others, but this one should work (almost) everywhere.

What is the fundamental task you use the computer for on a daily basis? What single function defines a usable internet or internet site?

Want to find the specification for the FASTA format—Google search. Looking for a friend's (oh, who are we kidding, a celebrity's) profile on instagram? Touch the magnifying glass and type the name.

So, how do you search with a sequence?

BLAST is the Basic Local Alignment Search Tool. See, search is in the name. BLAST will take a sequence as input and return sequences which have some local (letters inside the sequence, not necessarily the entire sequence, that's a global alignment) similarity. BLAST will work for both nucleic acid and protein sequences, just be sure to choose the correct method.

Try searching with the middle sequence line of RNase T1 above (the NN ... NN line) and see what the results are (I'll not show an example because the results will change over time, but the basic description will still hold) for that query.

I'm willing to wait ... it might take a few minutes for the search, particularly during the business day in the United States.

Alright. The highest score (first line in the table) is RNase T1. Found it! The query coverage is 100% which means that all 42 amino acids were match in the result sequence. The Expect (E) value is a very small number (note the negative sign on the exponent). This value represents the number of alignments expected for a query of this length and a few other parameters. Very small values are good; the match is unlikely to happen at random. The percent identity tells you have many letters match exactly. That value is 100% for the first result. You can be pretty confident that your query was a sequence from RNase T1.

The next couple results mention RNase T1 or Aspergillus oryzae, the fungus which makes RNase T1.

Scroll down toward the end of the table. Scores have dropped, the expect values has increased (but is still very small) and the percent identity has dropped. You should also note that the enzyme name and the organism may be different—you've found related proteins from other species. Search has worked. Just like when you discover a new person on the social network of your choice with the same affinity you have toward cats.

While you're down here, let's look at one of the actual alignments. Click on one alignment (the link on the left of the line that looks like a protein name) where the percent identity is lower, around eighty percent, give or take five percent. Following that link will bring you to the alignment. The upper portions repeat the information that you've already seen. In the alignment, the upper line is your query sequence, the lower line is the sequence matched from the sequence database and the line between is what is called the concensus sequence. If you've ever worked in a group, concensus is that thing you come to when everyone finally agrees one one thing. The concensus sequence shows letters where the amino acids match, plus signs (+) where the amino acids are chemically similar by not identical and there is no letter where the two amino acids are very different. The percentages for these three are in the header above the alignment as "Identities," "Positives" and "Gaps." Looking at the alignment often illustrates which parts of the sequence are common between the proteins. For small enzymes like these, you might identify amino acids essential for binding, catalysis or structure.

The current issue of Bioinformatics is a good place to start in understanding current questions in the field.

Being that Grove City College students are an active lot, they tend to spend a fair amount of time in locker rooms. Sometimes they remember to wear their shower sandals. Sometimes they don't. Your roommate has been complaining of dry skin between the toes and a burning desire to scratch during humanities lectures (as bad a social faux pas as not continuing to sit in the seat you chose on the first day of that humanities class). Since you are beginning biochemistryⓇ, you have cultured a sample from between their toes, isolated DNA from that culture and sequenced a small section. The resulting sequence is below.
```
CATCAGGTGTAACAGTAACATATAGTCATCATTCATTAA
```
Knowing that the primary cause of athlete's foot is an infection of the dermatophyte fungus Trichophyton rubrum (a bit more than seventy percent of cases), does your roommate have a simple case of athlete's foot? Are you certain? Explain. If you are uncertain, how might you modify your experiment to increase your confidence?

The methods of BLAST make a local sequence alignment which allows you to find small sections of sequences which match. A closely related (mathematically) technique is global sequence alignment. This method essentially takes two or more sequences, usually know to have some relationship, and slides them left to right until the best global alignment is found between the two sequences. The global alignment is most illustrative in showing features which are shared in common between a group of sequences. For example, active site residues tend to be conserved in the same enzyme sampled from across a wide range of species, the same with other features like binding sites or locations of phosphorylation. Below are a few sequences of the protein cyctochrome C (electron transport agent in the mitochondria) from animal species.
```
>NP_061820.1 cytochrome c [Homo sapiens]
MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLE
NPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
>NP_001039526.1 cytochrome c [Bos taurus]
MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFSYTDANKNKGITWGEETLMEYLE
NPKKYIPGTKMIFAGIKKKGEREDLIAYLKKATNE
>CAA25899.1 cytochrome c [Mus musculus]
MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAAGFSYTDANKNKGITWGEDTLMEYLE
NPKKYIPGTKMIFAGIKKKGERADLIAYLKKATNE
>NP_001157486.1 cytochrome c [Equus caballus]
MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFSYTDANKNKGITWKEETLMEYLE
NPKKYIPGTKMIFAGIKKKTEREDLIAYLKKATNE
>NP_001385228.1 cytochrome c [Gallus gallus]
MGDIEKGKKIFVQKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAEGFSYTDANKNKGITWGEDTLMEYLE
NPKKYIPGTKMIFAGIKKKSERVDLIAYLKDATSK
>NP_001123442.1 cytochrome c [Sus scrofa]
MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFSYTDANKNKGITWGEETLMEYLE
NPKKYIPGTKMIFAGIKKKGEREDLIAYLKKATNE
```
Copy all of the above text (it is just six sequences in FASTA format, perfectly legal) and paste it into the "Query Sequences" text box @ COBALT—Constraint-based Multiple Alignment Tool and press the "Align" button. Wait a while. When you see results, scroll to the bottom "Alignments" section. Set "View Format" to "Expanded" and "Conservation Setting" to "Identity." Now the majority of residues will be red (identical) and a few will be blue (at least one amino acid differs across all six sequences). Include that alignment in your answer. Are these proteins similar? Explain. If you were asked to guess about which regions in this sequence were critical to function, which would it be? Highlight that region on your alignment.

Serum albumin is a key blood serum protein responsible for osmotic balance and the transport and retention of small molecules like hormones. Serum albumin is a rather large protein. As a rather large protein, it is challenging to recognize a number of key features from the sequence alone. Choosing either human serum albumin or bovine serum albumin run the ProtParam tool on the sequence (hint: ProtParam doesn't take FASTA as input, just the naked sequence). Report the length in amino acids, molecular weight, the composition of amino acids and the chemical formula. Would you like to balance a reaction involving this protein as a reactant? Why or why not?

There are many fundamental things that we can learn about one protein, sequence, molecular weight or net charge, for example. When it comes to protein function, however, there is often one thing, above all, that helps you understand a protein—knowing what other proteins (or other molecules) it interacts with, binds to, uses as a substrate or makes as a product. That map of interactions is detailed in the STRING database. For our examples, each of you has been assigned an enzyme in central metabolism (glycolysis and the citric acid cycle) from the human. Follow the enzyme link with your name and explore related enzymes.

Student	Enzyme
Carter Anderson	Hexokinase
Frances Baksa	Glucose-6-phosphate isomerase
Julia Bianchin	Phosphofructokinase 1
Iain Brown	Aldolase
Sharen Buehler	Glyceraldehyde 3-phosphate dehydrogenase
Jonah Chen	Phoshoglycerate kinase
Noah Cramer	Phosphate mutase
Gabrielle Farcas	Enolase
Mark Hale	Pyruvate kinase
Simon Hershberger	Citrate synthase
Caleb Kirk	Aconitase
Victoria Leak	Isocitrate dehydrogenase
Bayleigh Miller	Succinyl-CoA synthetase
Truman Poole	Succinate dehydrogenase
Wade Springer	Fumarase
Cristiana Terhune	Malate dehydrogenase

Alright, what am I looking at with STRING? The first thing to understand is that this database is providing an interaction graph, or network. It is, more or less, just like the social network behind the facebook, Instagram or YouTube. The nodes (circles, your enzyme is a red one) represent the enzymes. The edges (lines, color denotes the connection type, experimental evidence, genetic connection, entered directly or automatically found) are the connections. Watch a cute kitten video, the machine assumes you like kittens and connects you to other users who make kitten videos, or who have just watched kitten videos in the past. It's about the same thing for enzymes.
You can trace the network of connections. Scroll to the bottom and select the node with the highest score (top of the table at the bottom of the page, the score is on the right, ordered highest to lowest). Record that enzyme name (the acronym is a couple letters and sometimes a number, the name is the long set of characters before the semicolon). Click on the score to see the evidence for the connection, which line of evidence (categories on the left) is the strongest (score closest to one). Is that line of evidence surprising? Would you have predicted that winner before looking? Explain.
Close the score window and follow the "∑ Analysis" link (center, below the network diagram). Scroll down to "Cellular Component" section. Is your enzyme found in the cytosol or the mitochondria?
Wanna have some fun? Back to the grey bar under the network and click on the "+ More" button somewhere between ten and tweny times. The network is large, but after a while it becomes more than you can easily come to terms with in one graphic. But include a screenshot anyway.