Math and Data Science

MATP 4961/6961/CS 4961/6961 Spring 2006

The ability to collect large amounts of data and
the importance of gaining understanding from it are becoming essential in
science, engineering, and business. In
the past, data was generated to investigate specific hypotheses. With the explosion of the capacity to
generate, store, and share data, frequently collection of data occurs
independently of any hypotheses. In
this course, we will examine how such data can be transformed into information
for decision making. We will examine
how mathematical models of data from statistics and machine learning can be
used for tackling compelling problems in science, engineering and business such
as fighting infectious diseases, designing drugs, and screening spam. Students will conduct research projects in
data science. The course will be
targeted toward advanced undergraduates but graduate credit may be obtained through
extra assignments and a more advanced project.

Prerequisites: Multivariable Calculus and a course is
probability and/or statistics, or instructor permission.

**Instructor:** Kristin Bennett

Bennek
at rpi dot edu

Office Hours:
Tuesday 3 to 5, AE327 or by appointment

Place:
Monday and Thursday 2 to 4 Science
2C13

**Evaluation:** Graduate:

Homework/Commentaries 28%, Class Presentation 14%,
Research Presentation 14% Research Project 28%, Participation 16%.

Undergraduate:
Homework/Commentaries 28%, Research Presentation 20%, Research Project
28%, Participation 24%.

` `

**Homework:**Your primary homework is to read the weekly reading assignments and be prepared to discuss them in class. We will have one or two computer/homework exercises. Computer exercises will be done in Matlab. So we will presume student has very basic knowledge of Matlab. See tutorial below. In addition, undergraduate students are required to hand in 4 commentaries during the course of the semester. Graduate students are required to hand in 5 commentaries or short discussion papers during the course of the semester. Commentaries addressing the coming weeks reading may be turned in any Monday. For a commentary, the student should read one of more the research papers and/or chapters to be discussed that week, and prepare a one page (one and only one page typeset with minimum 10 point font and 1 inch margins) paper. The paper should:

·
discuss an important idea/result in a
paper,

·
explain why the idea/result is important

·
give thoughts on possible limitations of
the work and/or how the work could be extended or applied.

The paper should be your analysis of the
paper, not a simple restatement of the contents of the paper. You can assume the reader has read the paper
and is familiar with its contents. Do
not simple restate the abstract. Use
your own words. Correct grammar is
important and will constitute a major portion of the paper grade. Commentaries must be typed. The
final grade will be based on best 4 (for undergrad) or 5 (for grads)
commentaries handed in, so you may submit as many as you like. Note some commentaries are mandatory. See syllabus.

Your
grade will be based on how correctly and completely you address the points
above as well as on readability (clarity, flow, grammar, spelling, punctuation,
etc.). Here is a rough grading guide
(grades are 0-3):

·
3:
Excellent. Thoughtful, clear use
of concepts, clear evidence of incorporating ideas from the reading, creative thoughts on limitations and/or
extensions, all points are developed and supported, all requirements above
adhered to. Minimal summarizing, maximal
presentation of your thoughts. Few or no
mechanical problems (grammar, flow, etc.).

·
2:
Good. Thoughtful and clear, but
connections with the reading and thoughts on limitations and/or extensions not
as strong as for a “3”. Most points
reasonably well-supported, most requirements adhered to. May replace some presentation of your
thoughts with some summarizing. May have
some mechanical problems.

·
1:
Adequate. Basic response with
little or no depth, and very little evidence of careful reading of the text or
creative thought. Requirements above are probably not adhered to, and may have
substantial mechanical problems.

**Project:**Students will do a related research project. The research may be related to your thesis or other work. Undergraduates may reimplement and explore an idea in a published paper. Graduate projects must involve a novel computational and/or theoretical analysis component. Undergraduates may work in teams of 2. Graduates project should be on an individual basis. Feel free to discuss potential topics with the instructor. A one to two page project proposal is due on 3/9. A one to two page project status report summarizing your project to date, any changes in project goals, and any potential difficulties is due on 4/6. The final project paper is due on 5/3. Note that the project will be graded based on the quality of the research investigation and presentation not on the significance of the final research result. So if you investigate a new approach and it doesn’t work as well as current methods, you can still do well.**Graduate Presentations:**Each graduate student will lead the presentation and discussion of a research topic. The possible topics are given below. One or two students will be assigned to each date and together will be responsible for that class period. The students will present the main ideas of the topic and lead discussion on that topic. The slides for the presentation will be made available on the course web page. Here the goals are to help the class achieve understanding of the material and to practice your presentation and teaching skills. Undergrads may volunteer to give talks for extra credit, but must declare intention to do so by second week of class. Click here to get evaluation form used for talks.**Project Presentations:**In our mini-workshop, the students will give 20 minute presentations about their research project. Here the goal is to give a professional quality research presentation suitable for a conference. Presentations on each students will be given in a mini-symposium help in class on the last four days of class. Click here to get evaluation form used for talks.**Participation: Students are expected to read the assign readings everyday and participate in class discussions.**If you don’t understand readings, bring questions to class.**Cheating Policy:**Discussion between students on all aspects of the course are encouraged. All papers, homeworks, presentations, and commentaries should be original work. Plagiarism will not be tolerated and will be grounds for failing the course.

**Resources:**

Matlab tutorial :
http://www.math.ufl.edu/help/matlab-tutorial/

Wikipedia: www.wikipedia.org

**Texts: **

J. Ecker and M.
Kupfershmid, *Introductions to Operations
Research*, Krieger, 1991. Several
chapters of this book will be used, so you to buy it.

T. Mitchell, Machine
Learning, McGraw Hill, 1997.

S. Durbin, Eddy, A.
Krogh, and G. Mitchison, *Biological
Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids*, R.,

**Class
Schedule (subject to change):**

1.
1/19/06
Challenges in Data Science, with
special focus on drug discovery.

Computers
Replace Petri Dishes,

**http://news.com.com/Vision+Series+Computers+replace+petri+dishes/2030-1070_3-998622.html**

Presentation/Discussion
Leader – Kristin Bennett slides

2.
1/23/06
Why is drug development so expensive?

**About
drug developement. FAQ PPDI.com**

**http://www.ppdi.com/corporate/faq/about_drug_development/home.htm**

Merck estimates $2.5B impact from pulling Vioxx
plug,
Julie Appleby and Matt Krantz, **http://www.usatoday.com/money/industries/health/drugs/2004-09-30-merck-cover_x.htm**

**G.
Banik, Insilico ADME-Tox prediction: the more the merrier, Current Drug
Discovery, 2004.**

http://www.currentdrugdiscovery.com/pdf/2004/537275.pdf

http://www.argentadiscovery.com/news/pdf/cdd_2003_article.pdf

Presentation/Discussion
Leader - Kristin Bennett slides

**Mandatory
Commentary Due: 1/23**

3. 1/26/06 How
to estimate the similarity of molecules?

Nikolova N., J. Jaworska. Approaches
to Measure Chemical Similarity - a Review. QSAR Comb. Sci. 22, No. 9-10,
1006-1026, 2003.

http://ambit.acad.bg/nina/publications/Similarity%20-%20reprint.pdf

W. Jorgenson, The many roles of Computation in drug discovery, Science, 2004

http://www.rpi.edu/~bennek/class/mds/JorgensenComputationDrugDiscovery.pdf

ADME-TOX
Outlook, Winter 05.

http://www.admetoxoutlook.com/editions/adme_outlook_nl_i01_winter_05.pdf

Guest
Presentation - N. Sukumar slides

4. 1/30/06 Regression
models: least squares models

http://en.wikipedia.org/wiki/Least_squares

Presentation/Discussion
Leader - Kristin Bennett slides

**Commentary Due**

5. 2/2/06 Linear
programming based models

Linear Programming, Chapter 2, in J. Ecker and
M. Kupferschmidt, Introduction to
Operations Research, Krieger, 1991, pay
special attention to pg 24-29.

Presentation/Discussion
Leader - Kristin Bennett slides

6.
2/6/06 Kernel Methods:

Nonlinear: Chapter 2, “Kernel Methods: an overview”,J.
Shawe-Taylor and N. Christianini Kernel Methods for Pattern Analysis,

Presentation/Discussion
Leader - Kristin Bennett

**If you have not
written a commentary by now, consider this one mandatory.**

7. 2/9/06
Support Vector Machine

`K. Bennett and C. Campbell, “Support Vector Machines: Hype or Hallelujah?”, SIGKDD Explorations, 2:2, 2000, 1-13.`

`Background reading`

Nonlinear Programming, Chapter 9, in J. Ecker and
M. Kupferschmid, Introduction to
Operations Research, Krieger, 1991

Presentation/Discussion
Leader – Kristin Bennett slides

8. 2/13/06
Duality

Presentation/Discussion
Leader – Kristin Bennett slides

9.
2/16/06 SVM methods for chemometrics

Presentation/Discussion
Leader - Kristin Bennett slides

10. 2/21/06
Background Mathematics and Computer Lab

HERG
research project --- models in action

Presentation/Discussion
Leader – Kristin Bennett

NOTE
no class 2/20 Instead have Tuesday
class

11.
2/23/06 Bioinformatics and Gene
Microarrays

A Scientific Primer,

www.ncbi.nlm.nih.gov/About/primer/

Biology 101 -- revisited

http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html What is a cell?

http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html What is a genome?

Bioinformatics

http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html Bioinformatics

http://www.ncbi.nlm.nih.gov/About/primer/microarrays.html Microarray Technology

Presentation/Discussion
Leader - Kristin Bennett

12.
2/27/06
Mathematical Challenges in Bioinformatics

R. **Karp**, **Mathematical
Challenges** from **Genomics** and **Molecular Biology**, Notices of

the AMS,. 49(5) 544-553 2002. http://www.cs.chalmers.se/Cs/Education/Kurser/algfk/karp.pdf

Presentation/Discussion
Leader - John P.

**Commentary Due**

13. 3/2/06
SVM approaches to Microarrays

Knowledge-based
analysis of microarray gene expression data by using support vector machines,
Michael P. S. Brown, William Noble Grundy, David Lin, Nello Cristianini,
Charles Walsh Sugnet, Terence S. Furey, Manuel Ares, Jr., David Haussler, Proc.
Natl. Acad. Sci. USA, vol. 97, pages 262-267

pdf

http://www.pnas.org/cgi/reprint/97/1/262.pdf

Presentation/Discussion
Leader – TBD

14.
3/6/06 Principal Component Analysis

A
**tutorial** on **Principal Components Analysis**. **Lindsay** I **Smith**.
February 26, 2002

www.cs.otago.ac.nz/cosc453/
student_**tutorial**s/**principal**_**component**s.pdf -

Lindsey
I Smith, February 26, 2002.

Presentation/Discussion
Leader – Mike and Jed

15. 3/9/06
**Principal components analysis to summarize microarray **experiments: application to sporulation time …

**Raychaudhuri**

Presentation/Discussion
Leader – Jed

**Project Proposal
Deadline: 3/9**

16. 3/20/06
Baby intro to SPAM + Naïve Bayes

A
PLAN For SPAM, Paul Graham,

http://www.paulgraham.com/spam.html

Bayesian
Learning, Chapter 6, in T. Mitchell, Machine Learning, McGraw Hill,
1997.

Part
1: pg 154-171 Bayesian Learning background

Bayesian
Learning, Chapter 6, in T. Mitchell, Machine Learning, McGraw Hill,
1997.

Part
2: pg 177-184 Naïve Bayes

17. 3/23/05
Bayesian SPAM filters

Mehran
Sahami, Susan Dumais, David Heckerman and Eric Horvitz. ``A Bayesian Approach
to Filtering Junk E-Mail.'' Proceedings of AAAI-98 Workshop on Learning for
Text Categorization.
http://research.microsoft.com/pubs/view.aspx?pubid=278

Presentation/Discussion
Leader – Wenhui

18.
3/27/05 Tuberculosis Intro, + EM
Algorithm for Mixture Models

Bayesian
Learning, Chapter 6, in T. Mitchell, Machine Learning, McGraw Hill,
1997.

Part
3: pg 191-199 EM algorithm

Presentation/Discussion
Leader for EM– TBD

Presentation/Discussion
Leader for TB – Prof Bennett

19.
3/30/05
Inna Vitol, Jeffrey Driscoll, Barry Kreiswirth, Natalia Kurepina, Kristin P. Bennett, "
Identifying Mycobacterium tuberculosis Complex Strain Families using Spoligotypes", Infection, Genetics, and Evolution,
to appear 2006. The SPOTCLUST program that goes with this can be found at www.rpi.edu/~bennek
/EpiResearch.

Presentation/Discussion
Leader – Inna Vitol

21.
4/6/06 Integer Programming

Nonlinear Programming, Chapter 9, in J. Ecker and
M. Kupferschmid, Introduction to
Operations Research, Krieger, 1991

Presentation/Discussion
Leader – Jingye

**Project Status
Report DUE**

22. 4/10/06 An integer programming approach to
Suduko

Presentation/Discussion
Leader – Susan/Alicia

23. 4/13/06
Crafting a good machine learning paper, how do you know method is working?

24. 4/17/06

**Presentation Abstract Deadline:
4/17**

25. 4/20/06
Catch-up day

**Math in Data Science Mini Conference: Participant presentations **

**4/24, 4/27, 5/1, 5/4
(Undergrads get first pick of dates)**

**Final Project Due: Wednesday
5/3, 5 p.m. Prof Bennett’s Box**