Math and Data Science 

MATP 4961/6961/CS 4961/6961 Spring 2006

 

The ability to collect large amounts of data and the importance of gaining understanding from it are becoming essential in science, engineering, and business.  In the past, data was generated to investigate specific hypotheses.  With the explosion of the capacity to generate, store, and share data, frequently collection of data occurs independently of any hypotheses.   In this course, we will examine how such data can be transformed into information for decision making.   We will examine how mathematical models of data from statistics and machine learning can be used for tackling compelling problems in science, engineering and business such as fighting infectious diseases, designing drugs, and screening spam.    Students will conduct research projects in data science.  The course will be targeted toward advanced undergraduates but graduate credit may be obtained through extra assignments and a more advanced project. 

 

Prerequisites:  Multivariable Calculus and a course is probability and/or statistics, or instructor permission. 

 

Instructor:  Kristin Bennett

                    Bennek at rpi dot edu

 

Office Hours:    Tuesday 3 to 5, AE327 or by appointment

 

Place:  Monday and Thursday 2 to 4    Science 2C13

 

Evaluation:  Graduate:

Homework/Commentaries 28%, Class Presentation 14%, Research Presentation 14% Research Project 28%, Participation 16%.

 

Undergraduate:   Homework/Commentaries 28%, Research Presentation 20%, Research Project 28%, Participation 24%.

   
  • Homework:   Your primary homework is to read the weekly reading assignments and be prepared to discuss them in class.   We will have one or two computer/homework exercises.  Computer exercises will be done in Matlab.  So we will presume student has very basic knowledge of Matlab.  See tutorial below. In addition, undergraduate students are required to hand in 4 commentaries during the course of the semester.  Graduate students are required to hand in 5 commentaries or short discussion papers during the course of the semester.  Commentaries addressing the coming weeks reading may be turned in any Monday.   For a commentary, the student should read one of more the research papers and/or chapters to be discussed that week, and prepare a one page (one and only one page typeset with minimum 10 point font and 1 inch margins) paper.  The paper should:

·         discuss an important idea/result in a paper,

·         explain why the idea/result is important

·         give thoughts on possible limitations of the work and/or how the work could be extended or applied.

 The paper should be your analysis of the paper, not a simple restatement of the contents of the paper.  You can assume the reader has read the paper and is familiar with its contents.   Do not simple restate the abstract.  Use your own words.  Correct grammar is important and will constitute a major portion of the paper grade.   Commentaries must be typed.   The final grade will be based on best 4 (for undergrad) or 5 (for grads) commentaries handed in, so you may submit as many as you like.    Note some commentaries are mandatory.  See syllabus.

 

Your grade will be based on how correctly and completely you address the points above as well as on readability (clarity, flow, grammar, spelling, punctuation, etc.).  Here is a rough grading guide (grades are 0-3):

 

·         3:  Excellent.  Thoughtful, clear use of concepts, clear evidence of incorporating ideas from the reading,  creative thoughts on limitations and/or extensions, all points are developed and supported, all requirements above adhered to.  Minimal summarizing, maximal presentation of your thoughts.  Few or no mechanical problems (grammar, flow, etc.).

 

·         2:  Good.  Thoughtful and clear, but connections with the reading and thoughts on limitations and/or extensions not as strong as for a “3”.  Most points reasonably well-supported, most requirements adhered to.  May replace some presentation of your thoughts with some summarizing.  May have some mechanical problems.

 

·         1:  Adequate.  Basic response with little or no depth, and very little evidence of careful reading of the text or creative thought. Requirements above are probably not adhered to, and may have substantial mechanical problems.

 

  • Project:   Students will do a related research project.    The research may be related to your thesis or other work.   Undergraduates may reimplement and explore an idea in a published paper.   Graduate projects must involve a novel computational and/or theoretical analysis component.  Undergraduates may work in teams of 2.  Graduates project should be on an individual basis.  Feel free to discuss potential topics with the instructor.  A one to two page project proposal is due on 3/9. A one to two page project status report summarizing your project to date, any changes in project goals, and any potential difficulties is due on 4/6. The final project paper is due on 5/3.    Note that the project will be graded based on the quality of the research investigation and presentation not on the significance of the final research result.   So if you investigate a new approach and it doesn’t work as well as current methods, you can still do well.
  • Graduate Presentations:    Each graduate student will lead the presentation and discussion of a research topic.  The possible topics are given below.  One or two students will be assigned to each date and together will be responsible for that class period.   The students will present the main ideas of the topic and lead discussion on that topic.   The slides for the presentation will be made available on the course web page.    Here the goals are to help the class achieve understanding of the material and to practice your presentation and teaching skills.  Undergrads may volunteer to give talks for extra credit, but must declare intention to do so by second week of class.     Click here to get evaluation form used for talks.
  • Project Presentations:  In our mini-workshop, the students will give 20 minute presentations about their research project.  Here the goal is to give a professional quality research presentation suitable for a conference. Presentations on each students will be given in a mini-symposium  help in class on the last four days of class.    Click here to get evaluation form used for talks.
  • Participation:   Students are expected to read the assign readings everyday and participate in class discussions.  If you don’t understand readings, bring questions to class.  Students are expected to demonstrate their knowledge of and engagement with the material though active participation in class discussions.   Since discussion and presentation are major component of the class.  Students are expected to attend and participate in all classes.  Participation includes attending all talks in the mini-symposium at the end of the semester.
  • Cheating Policy:  Discussion between students on all aspects of the course are encouraged.   All papers, homeworks, presentations, and commentaries should be original work.  Plagiarism will not be tolerated and will be grounds for failing the course. 

 

Resources:

Matlab tutorial :  http://www.math.ufl.edu/help/matlab-tutorial/

Wikipedia:  www.wikipedia.org

 

 

 

Texts:  

Readings will be from research papers that may be downloaded from the course webpage and from the following books on 2 hour reserve in the library :

J. Ecker and M. Kupfershmid, Introductions to Operations Research, Krieger, 1991.  Several chapters of this book will be used, so you to buy it. 

T. Mitchell, Machine Learning, McGraw Hill, 1997.

S. Durbin, Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids,  R.,  Cambridge University Press, 1991.

 

Class Schedule (subject to change):

1.        1/19/06      Challenges in Data Science, with special focus on drug discovery.

Computers Replace Petri Dishes, E. Fruenheim,  CNET news.com, June 2, 2003. 

http://news.com.com/Vision+Series+Computers+replace+petri+dishes/2030-1070_3-998622.html

Presentation/Discussion Leader – Kristin Bennett   slides

 

2.       1/23/06     Why is drug development so expensive?

About drug developement.  FAQ  PPDI.com

http://www.ppdi.com/corporate/faq/about_drug_development/home.htm

 

Merck estimates $2.5B impact from pulling Vioxx plug, Julie Appleby and Matt Krantz, USA TODAY, September, 2003. http://www.usatoday.com/money/industries/health/drugs/2004-09-30-merck-cover_x.htm

G. Banik, Insilico ADME-Tox prediction: the more the merrier, Current Drug Discovery, 2004.

http://www.currentdrugdiscovery.com/pdf/2004/537275.pdf

http://www.argentadiscovery.com/news/pdf/cdd_2003_article.pdf

Presentation/Discussion Leader -   Kristin Bennett slides

Mandatory Commentary Due:  1/23

 

3.  1/26/06      How to estimate the similarity of molecules?

Nikolova N., J. Jaworska. Approaches to Measure Chemical Similarity - a Review. QSAR Comb. Sci. 22, No. 9-10, 1006-1026, 2003.

http://ambit.acad.bg/nina/publications/Similarity%20-%20reprint.pdf

 

W. Jorgenson, The many roles of Computation in  drug discovery, Science, 2004

http://www.rpi.edu/~bennek/class/mds/JorgensenComputationDrugDiscovery.pdf

ADME-TOX Outlook,  Winter 05.

http://www.admetoxoutlook.com/editions/adme_outlook_nl_i01_winter_05.pdf

Guest Presentation -   N. Sukumar  slides

 

4.  1/30/06      Regression models: least squares models

http://en.wikipedia.org/wiki/Least_squares

Presentation/Discussion Leader -   Kristin Bennett  slides

Commentary Due

 

5.  2/2/06        Linear programming based models

Linear  Programming, Chapter 2, in J. Ecker and M.  Kupferschmidt, Introduction to Operations Research,  Krieger, 1991, pay special attention to pg 24-29.

Presentation/Discussion Leader -   Kristin Bennett   slides

 

6. 2/6/06 Kernel Methods:

Nonlinear:  Chapter 2, “Kernel Methods: an overview”,J. Shawe-Taylor and N. Christianini Kernel Methods for Pattern Analysis, Cambridge, 2004.

Presentation/Discussion Leader -   Kristin Bennett

Lab class     slides    Lab

 

If you have not written a commentary by now, consider this one mandatory.

 

7.  2/9/06   Support Vector Machine

K. Bennett and C. Campbell, “Support Vector Machines: Hype or Hallelujah?”, SIGKDD Explorations, 2:2, 2000, 1-13.
Background reading

Nonlinear  Programming, Chapter 9, in J. Ecker and M.  Kupferschmid, Introduction to Operations Research,  Krieger, 1991

Presentation/Discussion Leader – Kristin Bennett    slides  

 

8.  2/13/06  Duality

Presentation/Discussion Leader – Kristin Bennett slides

 

 

9. 2/16/06   SVM methods for chemometrics

J. Bi, K. Bennett, M. Embrechts, C. Breneman, and M. Song, "Dimensionality Reduction via Sparse Support Vector Machines", Journal of Machine Learning Research, 3, 2003, 1229-1243.

Presentation/Discussion Leader -   Kristin Bennett   slides

 

 

10.  2/21/06   Background Mathematics and Computer Lab

HERG research project --- models in action

Presentation/Discussion Leader – Kristin Bennett

 

NOTE no class 2/20   Instead have Tuesday class

 

11. 2/23/06  Bioinformatics and Gene Microarrays

A Scientific Primer,  National Center for Biotechnolgy Information

www.ncbi.nlm.nih.gov/About/primer/

Biology 101 -- revisited

http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html  What is a cell?

http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html  What is a genome?

Bioinformatics

http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html   Bioinformatics

http://www.ncbi.nlm.nih.gov/About/primer/microarrays.html       Microarray Technology

Presentation/Discussion Leader  - Kristin Bennett

 

12.  2/27/06   Mathematical Challenges in Bioinformatics

R. Karp, Mathematical Challenges from Genomics and Molecular Biology, Notices of
the AMS,. 49(5) 544-553 2002. http://www.cs.chalmers.se/Cs/Education/Kurser/algfk/karp.pdf

Presentation/Discussion Leader -   John P.

Commentary Due

 

13.  3/2/06   SVM approaches to Microarrays

Knowledge-based analysis of microarray gene expression data by using support vector machines, Michael P. S. Brown, William Noble Grundy, David Lin, Nello Cristianini, Charles Walsh Sugnet, Terence S. Furey, Manuel Ares, Jr., David Haussler, Proc. Natl. Acad. Sci. USA, vol. 97, pages 262-267
pdf
http://www.pnas.org/cgi/reprint/97/1/262.pdf

Presentation/Discussion Leader – TBD

 

14. 3/6/06  Principal Component Analysis

A tutorial on Principal Components Analysis. Lindsay I Smith. February 26, 2002
www.cs.otago.ac.nz/cosc453/ student_tutorials/principal_components.pdf -

Lindsey I Smith,  February 26, 2002.

Presentation/Discussion Leader – Mike and Jed

 

15.   3/9/06  Principal components analysis to summarize microarray experiments: application to sporulation time …
S Raychaudhuri, JM Stuart, RB Altman - Pacific Symposium on Biocomputing, 2000 - smi.stanford.edu

Presentation/Discussion Leader – Jed

Project Proposal Deadline:  3/9

 

16.  3/20/06   Baby intro to SPAM + Naïve Bayes 

A PLAN For SPAM, Paul Graham,

http://www.paulgraham.com/spam.html

Bayesian Learning,  Chapter 6, in  T. Mitchell, Machine Learning, McGraw Hill, 1997.

Part 1:  pg 154-171   Bayesian Learning background

Bayesian Learning,  Chapter 6, in  T. Mitchell, Machine Learning, McGraw Hill, 1997.

Part 2: pg 177-184    Naïve Bayes

 

17.    3/23/05    Bayesian SPAM filters

Mehran Sahami, Susan Dumais, David Heckerman and Eric Horvitz. ``A Bayesian Approach to Filtering Junk E-Mail.'' Proceedings of AAAI-98 Workshop on Learning for Text Categorization.  http://research.microsoft.com/pubs/view.aspx?pubid=278

Presentation/Discussion Leader – Wenhui

In class SVM exercise Assign3

 

18.  3/27/05   Tuberculosis Intro, + EM Algorithm for Mixture Models

Bayesian Learning,  Chapter 6, in  T. Mitchell, Machine Learning, McGraw Hill, 1997.

Part 3: pg 191-199    EM algorithm

Presentation/Discussion Leader for EM– TBD

Presentation/Discussion Leader for TB – Prof Bennett

 

 

19.  3/30/05  Inna Vitol, Jeffrey Driscoll, Barry Kreiswirth, Natalia Kurepina, Kristin P. Bennett, " Identifying Mycobacterium tuberculosis Complex Strain Families using Spoligotypes", Infection, Genetics, and Evolution, to appear 2006. The SPOTCLUST program that goes with this can be found at www.rpi.edu/~bennek /EpiResearch.

Presentation/Discussion Leader – Inna Vitol 

 

 

21. 4/6/06 Integer Programming

Nonlinear  Programming, Chapter 9, in J. Ecker and M.  Kupferschmid, Introduction to Operations Research,  Krieger, 1991

Presentation/Discussion Leader – Jingye 

Project Status Report DUE

 

22.   4/10/06 An integer programming approach to Suduko

Relevant paper

Presentation/Discussion Leader – Susan/Alicia

 

23.  4/13/06  Crafting a good machine learning paper, how do you know method is working? Topic 1: Andrew Moore's cross validation slides. Topic 2: P. Langley's how to craft a good machine learning paper.

 

24.  4/17/06  

 

Presentation Abstract Deadline:  4/17

Phaedra Agius, Barry Kreiswirth, Steve Nadich, Kristin P. Bennett, " ihiTyping Staphylococcus aureus using the \emph{spa}gene and novel distance measures under review 2006. The SPOTCLUST program that goes with this can be found at www.rpi.edu/~bennek /EpiResearch.

 

25.  4/20/06   Catch-up day

 

Math in Data Science Mini Conference:  Participant presentations

4/24, 4/27, 5/1, 5/4  (Undergrads get first pick of dates)

 

Final Project Due:   Wednesday 5/3, 5 p.m.  Prof Bennett’s Box