Math and Data Science

MATP 4961/6961/CS 4961/6961 Spring 2006

The ability to collect large amounts of data and the importance of gaining understanding from it are becoming essential in science, engineering, and business.  In the past, data was generated to investigate specific hypotheses.  With the explosion of the capacity to generate, store, and share data, frequently collection of data occurs independently of any hypotheses.   In this course, we will examine how such data can be transformed into information for decision making.   We will examine how mathematical models of data from statistics and machine learning can be used for tackling compelling problems in science, engineering and business such as fighting infectious diseases, designing drugs, and screening spam.    Students will conduct research projects in data science.  The course will be targeted toward advanced undergraduates but graduate credit may be obtained through extra assignments and a more advanced project.

Prerequisites:  Multivariable Calculus and a course is probability and/or statistics, or instructor permission.

Instructor:  Kristin Bennett

Bennek at rpi dot edu

Office Hours:    Tuesday 3 to 5, AE327 or by appointment

Place:  Monday and Thursday 2 to 4    Science 2C13



Homework/Commentaries 28%, Class Presentation 14%, Research Presentation 14% Research Project 28%, Participation 16%.

Undergraduate:   Homework/Commentaries 28%, Research Presentation 20%, Research Project 28%, Participation 24%.


• Homework:   Your primary homework is to read the weekly reading assignments and be prepared to discuss them in class.   We will have one or two computer/homework exercises.  Computer exercises will be done in Matlab.  So we will presume student has very basic knowledge of Matlab.  See tutorial below. In addition, undergraduate students are required to hand in 4 commentaries during the course of the semester.  Graduate students are required to hand in 5 commentaries or short discussion papers during the course of the semester.  Commentaries addressing the coming weeks reading may be turned in any Monday.   For a commentary, the student should read one of more the research papers and/or chapters to be discussed that week, and prepare a one page (one and only one page typeset with minimum 10 point font and 1 inch margins) paper.  The paper should:

·         discuss an important idea/result in a paper,

·         explain why the idea/result is important

·         give thoughts on possible limitations of the work and/or how the work could be extended or applied.

The paper should be your analysis of the paper, not a simple restatement of the contents of the paper.  You can assume the reader has read the paper and is familiar with its contents.   Do not simple restate the abstract.  Use your own words.  Correct grammar is important and will constitute a major portion of the paper grade.   Commentaries must be typed.   The final grade will be based on best 4 (for undergrad) or 5 (for grads) commentaries handed in, so you may submit as many as you like.    Note some commentaries are mandatory.  See syllabus.

Your grade will be based on how correctly and completely you address the points above as well as on readability (clarity, flow, grammar, spelling, punctuation, etc.).  Here is a rough grading guide (grades are 0-3):

·         3:  Excellent.  Thoughtful, clear use of concepts, clear evidence of incorporating ideas from the reading,  creative thoughts on limitations and/or extensions, all points are developed and supported, all requirements above adhered to.  Minimal summarizing, maximal presentation of your thoughts.  Few or no mechanical problems (grammar, flow, etc.).

·         2:  Good.  Thoughtful and clear, but connections with the reading and thoughts on limitations and/or extensions not as strong as for a “3”.  Most points reasonably well-supported, most requirements adhered to.  May replace some presentation of your thoughts with some summarizing.  May have some mechanical problems.

·         1:  Adequate.  Basic response with little or no depth, and very little evidence of careful reading of the text or creative thought. Requirements above are probably not adhered to, and may have substantial mechanical problems.

• Project:   Students will do a related research project.    The research may be related to your thesis or other work.   Undergraduates may reimplement and explore an idea in a published paper.   Graduate projects must involve a novel computational and/or theoretical analysis component.  Undergraduates may work in teams of 2.  Graduates project should be on an individual basis.  Feel free to discuss potential topics with the instructor.  A one to two page project proposal is due on 3/9. A one to two page project status report summarizing your project to date, any changes in project goals, and any potential difficulties is due on 4/6. The final project paper is due on 5/3.    Note that the project will be graded based on the quality of the research investigation and presentation not on the significance of the final research result.   So if you investigate a new approach and it doesn’t work as well as current methods, you can still do well.
• Graduate Presentations:    Each graduate student will lead the presentation and discussion of a research topic.  The possible topics are given below.  One or two students will be assigned to each date and together will be responsible for that class period.   The students will present the main ideas of the topic and lead discussion on that topic.   The slides for the presentation will be made available on the course web page.    Here the goals are to help the class achieve understanding of the material and to practice your presentation and teaching skills.  Undergrads may volunteer to give talks for extra credit, but must declare intention to do so by second week of class.     Click here to get evaluation form used for talks.
• Project Presentations:  In our mini-workshop, the students will give 20 minute presentations about their research project.  Here the goal is to give a professional quality research presentation suitable for a conference. Presentations on each students will be given in a mini-symposium  help in class on the last four days of class.    Click here to get evaluation form used for talks.
• Participation:   Students are expected to read the assign readings everyday and participate in class discussions.  If you don’t understand readings, bring questions to class.  Students are expected to demonstrate their knowledge of and engagement with the material though active participation in class discussions.   Since discussion and presentation are major component of the class.  Students are expected to attend and participate in all classes.  Participation includes attending all talks in the mini-symposium at the end of the semester.
• Cheating Policy:  Discussion between students on all aspects of the course are encouraged.   All papers, homeworks, presentations, and commentaries should be original work.  Plagiarism will not be tolerated and will be grounds for failing the course.

Resources:

Matlab tutorial :  http://www.math.ufl.edu/help/matlab-tutorial/

Wikipedia:  www.wikipedia.org

Texts:

Readings will be from research papers that may be downloaded from the course webpage and from the following books on 2 hour reserve in the library :

J. Ecker and M. Kupfershmid, Introductions to Operations Research, Krieger, 1991.  Several chapters of this book will be used, so you to buy it.

T. Mitchell, Machine Learning, McGraw Hill, 1997.

S. Durbin, Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids,  R.,  Cambridge University Press, 1991.

Class Schedule (subject to change):

1.        1/19/06      Challenges in Data Science, with special focus on drug discovery.

Computers Replace Petri Dishes, E. Fruenheim,  CNET news.com, June 2, 2003.

Presentation/Discussion Leader – Kristin Bennett   slides

2.       1/23/06     Why is drug development so expensive?

Merck estimates \$2.5B impact from pulling Vioxx plug, Julie Appleby and Matt Krantz, USA TODAY, September, 2003. http://www.usatoday.com/money/industries/health/drugs/2004-09-30-merck-cover_x.htm

G. Banik, Insilico ADME-Tox prediction: the more the merrier, Current Drug Discovery, 2004.

Presentation/Discussion Leader -   Kristin Bennett slides

Mandatory Commentary Due:  1/23

3.  1/26/06      How to estimate the similarity of molecules?

Nikolova N., J. Jaworska. Approaches to Measure Chemical Similarity - a Review. QSAR Comb. Sci. 22, No. 9-10, 1006-1026, 2003.

W. Jorgenson, The many roles of Computation in  drug discovery, Science, 2004

Guest Presentation -   N. Sukumar  slides

4.  1/30/06      Regression models: least squares models

Presentation/Discussion Leader -   Kristin Bennett  slides

Commentary Due

5.  2/2/06        Linear programming based models

Linear  Programming, Chapter 2, in J. Ecker and M.  Kupferschmidt, Introduction to Operations Research,  Krieger, 1991, pay special attention to pg 24-29.

Presentation/Discussion Leader -   Kristin Bennett   slides



6. 2/6/06 Kernel Methods:

Nonlinear:  Chapter 2, “Kernel Methods: an overview”,J. Shawe-Taylor and N. Christianini Kernel Methods for Pattern Analysis, Cambridge, 2004.

Lab class     slides    Lab

If you have not written a commentary by now, consider this one mandatory.

7.  2/9/06   Support Vector Machine

K. Bennett and C. Campbell, “Support Vector Machines: Hype or Hallelujah?”, SIGKDD Explorations, 2:2, 2000, 1-13.
Background reading

Nonlinear  Programming, Chapter 9, in J. Ecker and M.  Kupferschmid, Introduction to Operations Research,  Krieger, 1991

Presentation/Discussion Leader – Kristin Bennett    slides

8.  2/13/06  Duality

Presentation/Discussion Leader – Kristin Bennett slides

9. 2/16/06   SVM methods for chemometrics

Presentation/Discussion Leader -   Kristin Bennett   slides

10.  2/21/06   Background Mathematics and Computer Lab

HERG research project --- models in action

NOTE no class 2/20   Instead have Tuesday class

11. 2/23/06  Bioinformatics and Gene Microarrays

A Scientific Primer,  National Center for Biotechnolgy Information

Biology 101 -- revisited

Bioinformatics

12.  2/27/06   Mathematical Challenges in Bioinformatics

R. Karp, Mathematical Challenges from Genomics and Molecular Biology, Notices of
the AMS,. 49(5) 544-553 2002. http://www.cs.chalmers.se/Cs/Education/Kurser/algfk/karp.pdf

Commentary Due

13.  3/2/06   SVM approaches to Microarrays

Knowledge-based analysis of microarray gene expression data by using support vector machines, Michael P. S. Brown, William Noble Grundy, David Lin, Nello Cristianini, Charles Walsh Sugnet, Terence S. Furey, Manuel Ares, Jr., David Haussler, Proc. Natl. Acad. Sci. USA, vol. 97, pages 262-267
pdf
http://www.pnas.org/cgi/reprint/97/1/262.pdf

14. 3/6/06  Principal Component Analysis

A tutorial on Principal Components Analysis. Lindsay I Smith. February 26, 2002

Lindsey I Smith,  February 26, 2002.

Presentation/Discussion Leader – Mike and Jed

15.   3/9/06  Principal components analysis to summarize microarray experiments: application to sporulation time …
S Raychaudhuri, JM Stuart, RB Altman - Pacific Symposium on Biocomputing, 2000 - smi.stanford.edu

16.  3/20/06   Baby intro to SPAM + Naïve Bayes

A PLAN For SPAM, Paul Graham,

Bayesian Learning,  Chapter 6, in  T. Mitchell, Machine Learning, McGraw Hill, 1997.

Part 1:  pg 154-171   Bayesian Learning background

Bayesian Learning,  Chapter 6, in  T. Mitchell, Machine Learning, McGraw Hill, 1997.

Part 2: pg 177-184    Naïve Bayes

17.    3/23/05    Bayesian SPAM filters

Mehran Sahami, Susan Dumais, David Heckerman and Eric Horvitz. A Bayesian Approach to Filtering Junk E-Mail.'' Proceedings of AAAI-98 Workshop on Learning for Text Categorization.  http://research.microsoft.com/pubs/view.aspx?pubid=278

In class SVM exercise Assign3

18.  3/27/05   Tuberculosis Intro, + EM Algorithm for Mixture Models

Bayesian Learning,  Chapter 6, in  T. Mitchell, Machine Learning, McGraw Hill, 1997.

Part 3: pg 191-199    EM algorithm

Presentation/Discussion Leader for TB – Prof Bennett

21. 4/6/06 Integer Programming

Nonlinear  Programming, Chapter 9, in J. Ecker and M.  Kupferschmid, Introduction to Operations Research,  Krieger, 1991

Project Status Report DUE

22.   4/10/06 An integer programming approach to Suduko

Relevant paper

23.  4/13/06  Crafting a good machine learning paper, how do you know method is working? Topic 1: Andrew Moore's cross validation slides. Topic 2: P. Langley's how to craft a good machine learning paper.

24.  4/17/06

Phaedra Agius, Barry Kreiswirth, Steve Nadich, Kristin P. Bennett, " ihiTyping Staphylococcus aureus using the \emph{spa}gene and novel distance measures under review 2006. The SPOTCLUST program that goes with this can be found at www.rpi.edu/~bennek /EpiResearch.

25.  4/20/06   Catch-up day

Math in Data Science Mini Conference:  Participant presentations

4/24, 4/27, 5/1, 5/4  (Undergrads get first pick of dates)

Final Project Due:   Wednesday 5/3, 5 p.m.  Prof Bennett’s Box