PS09
Phylogenetics & Sequencing
Due 11 December, 2025
1
The oxygen binding protein from skeletal muscle is myoglobin. The sequences of myoglobin from several sea mammals are available in the file
myog-sea.gz. These sequences represent several whale and dolphin species.
Use these sequences for phylogenetic analysis to answer the age old question, "Is the orca a whale that kills (killer whale), or is it a dolphin that kills whales (killer of whales)?"
To do this, perform the muliple alignment of the sequences and draw the alignment tree with clustalw. Be sure to report the tree in beautified Newick format. Use drawtree from the phylip package to draw an unrooted visual tree as well. This should support your interpretation of the Newick tree. When running drawtree, use /usr/local/bin/phylip/font1 as your font for drawing the tree.
2
"High ethanol tolerance is an exquisite characteristic of the yeast Saccharomyces cerevisiae, which enables this microorganism to dominate in natural and industrial fermentations."
— Swinnen, et al. (2012)
(emphasis added)
For clarity, the authors of this study were interested only in the alcohol tolerance of yeast strains.
Alcohol tolerance is observed in humans (with a good metabolic explanation), but
alcohol is a poison that can be avoided. Excessive alcohol consumption leads to a
variety of maladies. Even the
Cleveland Clinic advises strongly against the
BORG. Consider
this Christmas story a warning (
the article does not detail on what charges the sheep were held).
**Public Service Announcement Over**
In particular,
Swinnen, et al. were interested in the basis of alcohol tolerance in several strains of yeast used in an industrial setting. The unique feature of these strains is that they tolerate ethyl alcohol to concentrations of seventeen percent (typical yeast strains only tolerate alcohol concentrations of 10–12 percent, the typical maximum alcohol concentration in a non–distilled alcoholic beverage). In particular, they were interested in identifying the genetic basis for this polygenic trait. With that question, they turned to whole–genome sequencing methods in order to get all the answers in one experiment.
Their raw reads have been deposited in the Sequence Read Archive (SRA) at NCBI and can be retrieved from that database. The SRA is a bit different when it comes to raw data. They store their data in compressed binary formats for space and bandwidth conservation reasons. The data that they serve directly can then be transformed back into a variety of formats, like FASTQ, that you are familiar with using a suite of programs called sra-tools. To get their data from the sequencing of the most ethanol tolerant strain (SRR403105, 1 run, 5.9 million spots, 1.2 gigabases), try the following commands.
[user@451]$ wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR403105/SRR403105
[user@451]$ fastq-dump --split-spot SRR403105
[user@451]$
Verify that you have generated FASTQ data for one run. What is the length of each read in bases?
Align these data to the
Saccharomyces cerevisiae reference genome Release 64 and call all the variants
(hint: please allow at least thirty minutes on the wall clock for all the required calculations).
What is a typical read depth for a base in this experiment (feel free to estimate an average from the first couple variants in the variant call file, that'll be close enough, the read depth will, of course, vary for each base)?
At this point, you would normally commit months to looking at each difference, mapping it to a gene (inclusive, could be variation in a sequence promoting expression) and trying to rationalize the effect. Since Christmas is coming, let's take the easy way out and just answer one of the big questions. How many variants are found in each of the yeast chromosomes?
Ok, one more bit to help you out (season of giving, and all). The variant call file (assume it's named snps.vcf) uses tabs to separate columns, not spaces. Thus, extracting the records for chromosome I takes a bit more work than you might initially think. I use egrep with \b to find the whitespace boundary, regardless of the identity of the whitespace (tab or space). Try something like (not the full solution) for chromosome I.
[user@451]$ egrep '^I\b' snps.vcf
[user@451]$
Does your evidence support
polygenic inheritance for ethyl alcohol tolerance? Explain.
Last updated at 08:39:49 on 2025-12-04.
Page generated in 2 milliseconds.