Math Models for Learning and Discovery 

MATP  6961/CS 6961 Spring 2005

 

This course examines the question of how we can make inference from data in order to solve problems.   Motivated by application problems in diverse areas such as bioinformatics, brain/computer human interface, drug discovery, computer vision, and material science, we will examine how inference problems can be modeled from a mathematical and geometric perspective.   Inference tasks such as classification, regression, ranking, feature selection, and novelty detection will be modeled mathematically and solved using appropriate algorithms.   Statistical learning theory will be used to investigate the generalization capabilities of these approaches.  Emphasis will be on state-of-the-art learning methodologies such as support vector machines, kernel methods, and ensemble techniques.  

 

This is a research seminar course.   By reading machine learning papers, we will examine leading-edge research. A background text will be used to provide a unifying framework.  Students interested in machine learning, mathematical modeling, and/or potential applications of machine learning are all welcome.  Class can be customized to meet goals of the students.  Students will each be responsible for presenting a paper for class discussion.  Each student will define a research project, perform the research and present their results to the class in a mock conference setting at the end of the semester.  The course will examine both mathematical and computational aspects of the learning algorithms. 

 

Prerequisites:  Multivariate calculus, linear algebra, or equivalent.   Previous experience in machine learning or pattern recognition is helpful but not required.

 

Instructor:  Kristin Bennett

                    Bennek at rpi dot edu

 

Office Hours:    Monday and Wednesday 2 to 3:30 AE 327 or by appointment

 

Place:  Tuesday and Friday 2 to 4     Low 3101

 

Evaluation:   Homework/Commentaries 30%, Presentations 30%, Research Project 30%, Participation 10%.

   
  • Homework:   Your primary homework is to read the weekly reading assignments and be prepared to discuss them in class.   We will have a few computer exercises.  In addition, students are required to hand in 4 commentaries or short discussion papers during the course of the semester.  There are seven possible due dates (see below) so you have three built-in excuses.  For a commentary, the student should read one of more the research papers and chapters to be discussed that week, and prepare a one page (one and only one page with minimum 10 point font and 1 inch margins) paper that summarizes the principle idea of the paper, the major contributions of the paper, and thoughts on possible limitations of the work and/or how the work could be extended.  The paper should be your analysis of the paper, not a simple restatement of the contents of the paper.  You can assume the reader has read the paper and is familiar with its contents.   Do not simple restate the abstract.  Use your own words.  Correct grammar is important.
  • Project:   Students will do a related original research project.    The research may be related to your thesis or other work.    The project should involve a novel computational and/or theoretical analysis component.  Feel free to discuss potential topics with the instructor.  A one to two page project proposal is due on 2/18. A one to two page project status report summarizing your project to date, any changes in project goals, and any potential difficulties is due on 4/8. The final project paper is due on 5/3.    Note that the project will be graded based on the quality of the research investigation and presentation not on the significance of the final research result.   So if you investigate a new approach and it doesn’t work as well as current methods, you can still do well.
  • Presentations:  Students will have at least two opportunities to discuss work presented in class.  Each student will lead the presentation and discussion of a research topic.  The possible topics are given below.  Two students will be assigned to each date and together will be responsible for that class period.   The students will present the main ideas of the topic and lead discussion on that topic.   The slides for the presentation will be made available on the course web page.    Here the goals are to help the class achieve understanding of the material and to practice your presentation and teaching skills.  Second is our mini-learning workshop, the students will give 20 minute presentations about their research project.  Here the goal is to give a professional quality research presentation suitable for a conference. Presentations on each students will be given in a mini-symposium  help in class on the last three days of class (4/26, 4/29, 5/3) and during the first reading day  5/5 from 10:00 a.m. AE 316  to  4 p.m.    Click here to get evaluation form used for talks.
  • Participation:   Students are expected to demonstrate their knowledge of and engagement with the material though active participation in class discussions.   Since discussion and presentation are major component of the class.  Students are expected to attend and participate in all classes.  Participation includes attending the mini-symposium. 

 

Texts:  

 

Readings will be from research papers that may be downloaded from the course webpage and from the book :

J. Shawe-Taylor and N. Cristianini,  Kernel Methods for Pattern Analysis, Cambridge2004.

 

Some of the papers are in postscript form and require the ghostview viewer.  See:

www.cs.wisc.edu/~ghost/  .    Papers may also be available in pdf form that requires an acrobat reader. 


Class Schedule (subject to change):

 

Here is the tentative class schedule.   The schedule may be altered to coordinate with visiting speakers. 

 

1.        1/18/03      What is learning and discovery?  Class Overview

Presentation -   Kristin Bennett

 

2.       1/21           Pattern Analysis

J. Shawe-Taylor and N. Cristianini, “Chapter 1: Pattern Analysis”,

Kernel Methods for Pattern Analysis, 2004.

Presentation – Gautam Kunapuli

 

3.       1/24           Overview of Kernel Methods

J. Shawe-Taylor and N. Cristianini, “Chapter 2: Kernel Methods: an Overview”,

Kernel Methods for Pattern Analysis, 2004.

Presentation – Prof. Bennett

 

4.       1/31           Alternative views of KRR and LS-SVM

J. Shawe-Taylor and N. Cristianini, “Chapter 2: Kernel Methods: an Overview”,

Kernel Methods for Pattern Analysis, 2004.

Presentation – Prof. Bennett

 

5.       2/1             Support Vector Machine Classification

 

C. Cortes and V. Vapnik, Support-Vector Networks, Machine Learning, 20(3):273-297
 
K. Bennett and C. Campbell, “Support Vector Machines: Hype or Hallelujah?”, SIGKDD Explorations, 2:2, 2000, 1-13.
 
Presentation Prof.  Bennett
Assignment 1 handed out.

   

6.       2/4            SVM Regression       

 
A. Smola and B. Schoelkopf, “A tutorial on support vector regression”, NeuroCOLT2 Technical Report NC2-TR-1998-030, 1998.   (Focus on  Chapters 1 through 4)

 

Student Presentation/Discussion Leaders:  John Marsh

 

7.       2/7            Brain Computer Human Interface Application        Mandatory Commentary    

 

G. Fabiani, D. McFarland, J. Wolpaw, G. Pfurtscheller,  “Conversion of EEG activity into cursor movement by a brain-computer interface (BCI)”, IEEE Trans. on Rehabilitation Engineering,  !@:3, 331-338, 2004.

 

P. Meinichke et al,  “Improving Transfer Rates in Brain Computer Interfacing: A Case Study”,  Nips 15, 2002.

 

Guest Lecturers :Dennis McFarland , Eric Sellers,  Wadsworth Center, NY State Dept of Health

 

 

 

8.       2/11  nu-SVM  Regression   and nu-SVM classification   

 
B. Schoelkopf , A. Smola and R. Williamson,  "Shrinking the tube: A new support vector regression algorithm", In M. S. Kearns, S. A. Solla, and D. A. Cohn (Eds.), Advances in Neural Information Processing Systems, Volume 11, Cambridge, MA. MIT Press. 1999.
 
B. Schoelkopf, A. Smola, R. Williamson and P. Bartlett,  "New support vector algorithms",  Neural Computation, 12:1083 - 1121, 2000.
 

Student Presentation/Discussion Leaders:  Alex Tyrrel

 

9.       2/ 15  Sequential Minimization Optimization   Commentary Due

J. Platt, Using Sparseness and Analytic QP to Speed Training of Support Vector Machines, in Advances in Neural Information Processing Systems 11, M. S. Kearns, S. A. Solla, D. A. Cohn, eds., MIT Press, (1999). Available in PDF format (69K) or PostScript format (101K). This is a seven page paper, but it mentions some heuristics (which are about 2x faster than those described below) and also gives timing comparisons to Thorsten Joachims' SVMlight simulator.

   J. Platt, Fast Training of Support Vector Machines using Sequential Minimal Optimization, in Advances in Kernel Methods - Support Vector Learning,  B. Schölkopf, C. Burges, and A. Smola, eds., MIT Press, t(1998). Available in PDF format (278K) or Compressed PostScript format (111K).

Xianping Ge, "Sequential Minimal Optimization for SVM: C++
Implementation"  http://www.datalab.uci.edu/people/xge/svm/smo.pdf  

 

Student Presentation/Discussion Leaders: Travis

 

2/18 Class was cancelled.               Assignment Due in Prof. Bennett’s box

 

NOTE NO CLASS 2/22

 

10.    2/25                     Ranking                          PROJECT PROPOSAL DUE         

 

T. Joachims, Optimizing Search Engines Using Clickthrough Data, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2002.

 

 

Student Presentation/Discussion Leader: Mohammad Hasan

 

 

 

11.     3/1          Properties of Kernels and tuning kernels    Commentary Due

 

J. Shawe-Taylor and N. Cristianini, “Properties of Kernels”, Chapter 3, Kernel Methods for Pattern Analysis, 2004.

 

 

 

Student Presentation/Discussion Leaders:  Renzhi Lu, Zhaolin  Cheng

 

 

12.    3/4            Boosting                                              

 

Y. Freund and R. Schapire, "A short introduction to boosting",J. Japan.Soc. for Artif. Intel. 14(5) (1999), 771-780. 11", 1999.
 
Y. Freund, R. Iyer,  R. Schapire, and Y. Singer, “An Efficient Boosting Algorithm for Combining Preferences”, http://jmlr.csail.mit.edu/papers/volume4/freund03a/freund03a.pdf  Journal of Machine Learning Research, 4, 2003, 933-969.
 

Student Presentation/Discussion Leader: Xiaoli (instroduction)  and Paul Evangilista (preferences)

 

 

13.     3/8 Today we have Guest Speaker giving 2 one hour talks one  at 2:00 and one at 4:00.            Guest Speaker   Dr. Cynthia Rudin, Howard Hughes Medical Institute and NYU Center for Neural Science

Reading:

 

C. Rudin, I. Daubechies, and R. Freund,  “The Dynamics of Adaboost: Cyclic Behavior and the Convergence of Margins, http://jmlr.csail.mit.edu/papers/v5/rudin04a.html,  Jounral of Machine Learning Research, 5, 2004, 1557-1595.

 

2:00 talk in class         

Title: An Introduction to the Problem of Ranking and Some Recent Work

 

Abstract:

The problem of ranking has recently gained considerable attention in

the machine learning community. One example of a ranking task is to

combine the results of many different search engines. Another example

is to retrieve a list of documents (from a database) sorted by

relevance. A third example is the task of providing movie

recommendations to users based on the movie ratings of other users.

 

I would like to explain the ranking problem to the class, and

intuitively describe some recent work on this topic. This work

includes a new maximum margin ranking algorithm (based on some new

theory), and a really strange fact about AdaBoost, which I will also

describe briefly in my earlier talk. I am assuming only a basic

understanding of boosting.

 

4:00 p.m. talk by Cynthia Rudin in Walker 6113

Title: The Dynamics of AdaBoost

The goal of Statistical Learning Theory is to construct and understand algorithms that are able to generalize from a given training data set. Statistical learning algorithms are wildly popular now due to their excellent performance on many types of data; one of the most successful learning algorithms is AdaBoost, which is a classification algorithm designed to construct a "strong" classifier from a "weak" learning algorithm. Just after the development of AdaBoost eight years ago, scientists derived margin-based generalization bounds to explain AdaBoost's good performance. Their result predicts that AdaBoost yields the best possible performance if it always achieves a "maximum margin" solution. Yet, does AdaBoost achieve a maximum margin solution? At least three large-scale empirical studies conducted within this eight year period conjecture the answer to be "yes". In order to understand AdaBoost's convergence properties and answer this question, we look toward AdaBoost's dynamics. We simplify AdaBoost to reveal a nonlinear iterated map and analyze its behavior in specific cases. We find stable cycles for these cases, which can explicitly be used to solve for AdaBoost's output. Thus, we find the key to answering the question of whether AdaBoost always maximizes the margin - a set of examples in which AdaBoost's convergence can be completely understood. Using this unusual technique, we are able to answer the question of whether AdaBoost always converges to a maximal margin solution; the answer is the opposite of what was thought to be true! In this talk, I will introduce AdaBoost, describe our analysis of AdaBoost when viewed as a dynamical system, and briefly mention a new boosting algorithm which always maximizes the margin with a fast convergence rate. This talk is designed for a general mathematical audience that likes to see pretty pictures of cyclic dynamics.

 

 

 

 

14.    3/11          Properties of Kernels     Commentary Due              

J. Shawe-Taylor and N. Cristianini, “Detecting Stable Patterns”, Chapter 4, Kernel Methods for Pattern Analysis, 2004.

 

Student Presentation/Discussion Leaders: Josh

 

 

 

 

NO CLASS 3/22

ASSIGNMENT 2 due in class 3/25

 

15.     3/25    Evaluation 

 

Primary papers: 

S.Salzberg, On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach, Data Mining and Knowledge Discovery, 1997.

 

Rich Caruana, Alexandru Niculescu-Mizil: Data mining in metric space: an empirical analysis of supervised learning performance criteria. KDD 2004: 69-78  (Yi Guo)

 

Secondary information for your project:

Pat Langley, “Crafting Papers in Machine Learning”, http://www.ecn.purdue.edu/ICML2001/craft.html

 

Today we will examine what makes a good machine learning paper with specific emphasis on the design and presentation of computational experiments. 

 

Student Presentation/Discussion Leader:  Yi Guo and Professor Bennett

 

 

16.    3/29    Elementary Algorithms: Kernel Fisher  and Novelty Detection

 

B. Schölkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Technical Report 99-87, Microsoft Research, 1999. Neural Computation, 2001.

 

S. Mika, G. Rätsch, and K.-R. Müller. A mathematical programming approach to the Kernel Fisher algorithm. In T.K. Leen, T.G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 591-597. MIT Press, 2001.

http://citeseer.ist.psu.edu/cache/papers/cs/7930/http:zSzzSzwww.first.gmd.dezSz~mikazSzMikRaeWesSchMue99.pdf/mika99fisher.pdf

 

 NOTE:  Different coverage of this material can be found in:             

Kernel Fisher Discriminant Analysis

J. Shawe-Taylor and N. Cristianini, “Chapter 5: Elementary Algorithms in Feature Space”, Kernel Methods for Pattern Analysis, 2004.

 

 

Student Presentation/Discussion Leader:   Zhiwei Zhu  and  Craig Tashman

 

17.  4/1    Guest Speaker Rich Caruana

Paper to be determined

 

 

18.    4/5            Kernel PCA, Kernel PLS     Commentary Due

 

Bernhard Schölkopf, Alexander Smola, Klaus-Robert Müller, Kernel Principal Component Analysis, in:  B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods --- Support Vector Learning. MIT Press, Cambridge, MA, 1999. 327 -- 352. Short version or chapter in Support Vector Learning  for background.

 

R. Rosipal and R. Trejo,  “Kernel Partial LeastSquares Regression in Reproducing Kernel Hilbert Space”,  Journal of Machine Learning Research, 2, 2001, 97-123.

 

For overview of the week read, J. Shawe-Taylor and N. Cristianini, “Chapter 6: Pattern Analysis via Eigencompositions”, Kernel Methods for Pattern Analysis, 2003.

 

Student Presentation/Discussion Leaders:  Long Han   (KPLS)

 

 

19.    4/8            Independent Component Analysis/ Cannonical Correlation Analysis

Project Status Report Due

A. Vinokourov, J. Shawe-Taylor, N. Cristianini, “Finding Language-Independent Semantic Representation of Text using Kernel Canonical Correlation Analysis,  NeuroCOLT Technical Report NC-TR-02-119, 2002.

F. Bach and M. Jordan, Kernel Independent Component Analysis, JMLR, 3, 1-48, 2002.

Student Presentation/Discussion Leader:   Evrim  Acar

 

20.       4/12       Funky Kernels Part 1  - String Kernels                           Commentary Due

 

For overview of the week read, J. Shawe-Taylor and N. Cristianini, “Chapter 11:  Kernels for structured data: strings, trees, etcs”, Kernel Methods for Pattern Analysis, 2003.

Christina Leslie, Eleazar Eskin, Adiel Cohen, Jason Weston, William Stafford Noble, Mismatch String Kernels for Discriminative Protein Classification, Bioinformatics, http://www1.cs.columbia.edu/compbio/mismatch/journal-mismatch-final.pdf

M. Collins and N, Duffy, “ Convolution Kernels for Natural Language”, NIPS 2002, 625-632

Student Presentation/Discussion Leader:   Phaedra Agius  and  Ajish George

 

21.        4/15      Funky Kernels -  Tree kernels /generative models

                       

J. P. Vert, A tree kernel to analyze phylogenetic profiles,  bioinformatics 2002. 

http://web.kuicr.kyoto-u.ac.jp/~vert/publi/ismb02/latex/ismb02.pdf

 

K. Tsuda, et al,  Marginalized Kernelize for Biological Sequences,  bioinformatics 2002.

 

Background reading, chapter 11 of the text. 

 

 

Student Presentation/Discussion Leader:   Jed Zaretski and Il Young Song

 

22.      4/19   Bioinformatics /Cheminformatics Applications     Feature selection

 

I. Guyon et al, Gene Selection for Cancer Classification using Support Vector Machines”,  Machine Learning, 46, 2002.

http://www.tsi.enst.fr/~campedel/Biblio/FeatureSelection/Guyon_GeneSelection_2000.pdf

 

J. Bi et al,  “Dimensionality Reduction via Sparse Support Vector Machines”,  JMLR, 2003.

 http://jmlr.csail.mit.edu/papers/volume3/bi03a/bi03a.pdf

     

Student Presentation/Discussion Leader:   Vineet Chaoji   

 

 

23.     4/22        Bag of pixels kernel,  Support vector clustering     

 

R. Kondor and T. Jebara. "A Kernel Between Sets of Vectors" . International Conference on Machine Learning, ICML 2003

http://www1.cs.columbia.edu/~jebara/papers/Kondor,Jebara_point_set.pdf

 

Ben-Hur et al., Support Vector Clustering. JMLR, 2, 2001.  

 

We had a last minute switch of papers in the last day to spectral clustering

Francis R. Bach, Michael I. Jordan. Learning spectral clustering, Advances in Neural Information Processing Systems (NIPS) 16, 2004.

http://cmm.ensmp.fr/~bach/nips03_cluster.pdf

 

Student Presentation/Discussion Leader:   Andrey

 

Full information on our learning workshop as well on submission(project paper) requirements and criteria can be found here by clicking here. 

 

24.    4/26         “Kernel Methods, How Far Have We Come?” -  Prof. Bennett

25.    4/28         Learning Workshop - Project Presentations

26.    5/3           Learning-Workshop - Project Presentations

                              PROJECT RESEARCH PAPER DUE

27.    5/5           Learning-Workshop – Project Presentation –Attendance mandatory

                        10:00 to 4 p.m.   location AE 316.