# Math 127 - Fall 2016

## Mathematical and Computational Methods in Molecular Biology

EM images of Zika
Zika 3D model by EM

Structure and figures from Sirohi et. al.
 Instructor David Dynerman Lectures Tuesdays & Thursdays, 0800-0930 in Barrows 122 Question? Post it on bCourses! Mail dynerman@berkeley.edu Office Hours Tues. 0945-1045, Weds. 0830-1030 in 943 Evans

The goal of this course is to introduce some basic mathematical tools that are used to study biological problems. We will do this by learning about two such problems: inferring evolutionary trees from living species and determining the 3D structure of proteins by x-ray crystallography and cryo-electron microscopy (cryo-EM).

These problems will give us the opportunity to see two beautiful examples of how mathematics can be rigorously applied to study interesting and important scientific problems. Here are some of the topics we will encounter during the semester:

1. Rigorously modeling experimental science
2. Modeling problems using trees
3. Solving optimization problems
4. Modeling problems using Markov chains
5. Fourier Series and the Discrete Fourier Transform
7. Basic concepts in solving inverse problems
8. Writing computer simulations

Note: The above topics have applications in many other areas of pure & applied mathematics, physics, chemistry and biology, so even if you do not study any further math biology, my hope is that you will find the course useful in your future work.

## Syllabus

Reading should be completed before the indicated lecture.

Aug 25 Course overview; Start of Unit 1: the problem of phylogenetic inference
Aug 30 basic concepts about DNA Phylo 1.1-1.3
Sep 1 Trees   Python Tutorial 1
Sep 6 Maximum Parsimony Phylo 2.1-2.4
Sep 8 Maximum Parsimony   HW1 Due, Python Tutorial 2
Sep 13 Maximum Parsimony conclusion; Probability review; Markov chains Phylo 5.1-5.2, 6.1-6.3, 7.1
Sep 15 Markov chains; transition matrices; linear algebra review Phylo 5.3-5.4 Python Tutorial 3
Sep 20 Perron-Frobenius Theorem; Properties of Markov matrices Phylo 5.5-5.6
Sep 22 Modeling DNA mutation using Markov chains Stats 3.1, 3.2 HW2 Out
Sep 27 The Jukes-Cantor Model; Jukes-Cantor Distance; Introduction to distance based inference Stats 2.1-2.4
Sep 29 Four point condition; neighbor joining SL 1.3, 1.4; Optional SL 1.1, 1.2
Oct 4 neighbor joining
Oct 6 Start of Unit 2: DNA translation; the problem of protein structure determination; protein structure   HW2 Due
Oct 11 Protein structure; protein energetics and folding
Oct 13 scattering experiments; the Fourier Transform
Oct 18 Fourier Series   HW3 Due. Project proposals due
Oct 20 Discrete Fourier Transform
Oct 25 Discrete Fourier Transform
Oct 27 Mathematical model for X-ray crystallography
Nov 1 Reconstructing proteins from x-ray diffraction data
Nov 3 the cryo-EM revolution; mathematical model for biological electron microscopy   Project Week 1 deliverables due (Fri)
Nov 8 reconstructing proteins from EM projections
Nov 10 directly estimating orientations of EM projections   Project Week 2 deliverables due (Fri)
Nov 15 recovering 3D structures by back-projection
Nov 17 back-projection   Project Week 3 deliverables due (Fri)
Nov 22 Reconstruction in practice: iterative methods
Nov 24 Thanksgiving   Project Week 4 deliverables due (Sun Nov 27)
Nov 29 reconstruction by maximum likelihood
Dec 1 reconstruction by maximum likelihood   Project Week 5 deliverables due (Fri)
Dec 6 RRR Week
Dec 8 RRR Week   Project due. Project demo night. (Fri)

Dec 14 Final Exam
3pm-6pm, location TBA

## Homework

### Homework 1

You may work on this assignment in groups, but each student must hand in a solution that they themselves wrote up. Make sure you give the names of everyone you worked with and properly cite any sources you use (internet Q&A sites, books, etc).

Do not hand in a first draft of your solutions. Rewrite your first draft neatly, checking for errors in reasoning and language.

1. Give the precise mathematical definitions of:
1. a connected graph,
3. a path in a graph,
4. a cycle in a graph,
5. a tree $$T$$.
2. For this problem, use two different colors of ink.
1. Draw your favorite tree with 5 vertices. Pick your favorite two vertices in this tree and write down a path between the vertices. Draw the path on the tree in a second color of ink.
2. Draw your favorite graph that has 5 vertices AND has no cycles AND is disconnected. Using your picture, prove that there is not a path between every pair of vertices.
3. Draw your favorite graph that has 5 vertices AND has a cycle AND is connected. Using your picture, prove that there is not a unique path between every pair of vertices.
3. Prove the following theorem

Theorem. If $$v_0$$ and $$v_1$$ are two vertices in the tree $$T$$, then there is a unique path from $$v_0$$ to $$v_1$$.

Note: Your solution to Problem #2 above shows why we must require $$T$$ to be a tree: the theorem is simply not true if $$T$$ is not a tree.

4. How many possible codons are there? How many amino acids are found in organisms on Earth? Give an explanation that seems plausible to you as to why these numbers are different.
5. Phylo Exercise 2.7.3
6. Phylo Exercise 2.7.4
7. Phylo Exercise 2.7.9

### Homework 2

1. Phylo Exercise 3.8.1
2. Phylo Exercise 3.8.2
3. Phylo Exercise 3.8.5
4. Stats Section 3.6 Exercise 3.3
5. Stats Section 3.6 Exercise 3.5
6. Stats Section 2.6.1 Exercise 2.9
7. Stats Section 2.6.2 Exercise 2.23
8. Stats Section 2.6.2 Exercise 2.26

For the remaining exercises you will need to use the free software packages numpy, scipy and matplotlib. See here for instructions on installing these on your own computer. If you do not have access to a computer, please contact me.

Before starting on the following problems, work through some exercises in SL 1.3 and SL 1.4 on your own.

1. Download this dataset containing GRE scores for 33,282 students. Using numpy and matplotlib, plot histograms of the GRE score datasets.
2. Write a function plot_normal(mu, sigma) that superimposes a normal curve with mean mu and standard deviation sigma over the top of your histogram. Hand in the source code to your function.
3. Using your plot_normal() function, plot a number of different normal curves. Try to match the histogram. Plot the normal curve best matching the histogram. What is the mean and standard deviation of this curve curve? Hand plots with your best fitting normal curves and the mean and standard deviation you deduced.

## Course Textbooks

Required readings will be taken from the following sources. You are not required to personally purchase any of these books.

• Phylo The Mathematics of Phylogenetics by Allman and (John A.) Rhodes
• Stats OpenIntro Statistics
• SL Scipy Lectures
• Bio The Molecules of Life: Physical and Chemical Principles by Kuriyan, Konforti, and Wemmer
• Not free. This is the main textbook for MCB100 at Berkeley, which is a large course with many students, so there are many copies of this book on campus. Before you purchase a copy, you should:
2. use a reserve copy of the book in the library (UCB Chemistry Library Reserves Desk) for 2 hours at a time,
3. borrow the book from a friend who has taken MCB100,
4. purchase the five chapters we will cover from the publisher for $45, 5. purchase or rent the full book (19 chapters - Used:$60+, New: $65-$109) Cal bookstore, Amazon.
• Xtal Crystallography Made Crystal Clear by (Gale) Rhodes
• EM Electron Crystallography of Biological Macromolecules by Glaeser, Downing, DeRosier, Chiu, and Frank
• Not free. Please do not purchase at this time - I will update with more information once we start covering the topics in this book.

## Course Format

In this course you will be required to read, do homework (including writing code, see writing code), take some quizzes, complete a big project, and take a final exam.

 Readings Each lecture will have assigned reading meant to be completed before the lecture. Homework In the first part of the course, you will have several individual homework assignments on phylogenetic inference. Quizzes There will be a handful of straightforward short unannounced quizzes on: your readings, past homework, mathematical definitions we've been using, etc. Project In the second half of the semester you will work in groups of 3-4 on a substantial project that will make up a large portion of your grade. This project will involve simulating electron microscope images of a 3D protein, and then recovering the 3D structure from the simulated images. Final There will be a final exam during our university-assigned exam slot.

## Project

During the second half of the semester you will complete a substantial group project on protein reconstruction, including simulating 2D images of a protein and then reconstructing a 3D model of the protein from these images.

Please carefully read the project page for detailed information on the project.

## Writing code

This course will require everyone to write some computer code: you will write snippets of code on several homework assignments, and work in groups on a project that will involve a substantial amount of programming. Today math biology is inseparably linked with programming: some programming ability is basic literacy in this field. Writing code is necessary not only for making biological conclusions (e.g., analyzing a genome with 3 x 109 genes to infer an evolutionary tree) but is also useful for forming and testing mathematical hypotheses (e.g., is the Fourier basis appropriate for the data I observed in this experiment?). Moreover, I believe that writing short snippets of code to play and experiment with the topics we cover is a fantastic way to truly master them.

In my experience most students have written code in another course and will be able to manage this course's programming requirements. For example, if you've ever written a Python or MATLAB function for a course, you should be fine. If you've used R to analyze a dataset, you should be fine. If you've taken CS 61A, the first introductory semester course in programming, you're very well prepared.

#### "I've never written any code!" -some student, probably

If you've never written any code, don't panic. I will hold three optional evening tutorials at the start of the semester on writing basic Python code that should provide a good foundation for what you'll need to know. You should plan to spend some extra time on the course at the start of the semester to get your feet wet. I will also provide extra resources on introductory programming and am willing to help you outside of class anytime during the semester if you need it.

Writing code is a basic skill today. My hope is that developing some programming literacy in this course will be very useful to you in the future.

## Workload: 9 hours a week outside class

This is an ambitious course - in only 14 weeks we will cover a lot of interesting material. The University determines course requirements expecting that each unit will require you to work 3 hours a week, including class time. Math 127 is a 4 unit course, so you should expect to spend 9 hours a week outside class working on this course's readings, homework and projects.

My goal is to introduce you to two exciting topics at the interface of mathematics and biology - my goal is certainly not to ruin your life with a crushing work load. You should expect a challenging, intellectually engaging course, and you should be prepared to work up to 9 hours a week outside class, especially near project deadlines.

Important: If you find yourself consistently working more than 9 hours/week on this course, please contact me so we can fix it.