Dimensionality Reduction via Sparse Support Vector Machines
Kristin Bennett, Jinbo Bi, Mark Embrechts, Curt Breneman and Minghu Song
Departments of Mathematics, DSES, and Chemistry
We describe a methodology for
performing variable ranking and selection using support vector machines (SVMs).
The method constructs a series of sparse linear SVMs to generate linear models
that can generalize well, and uses a subset of nonzero weighted variables found
by the linear models to produce a final nonlinear model. The method exploits the
fact that a linear SVM (no kernels) with $\ell_1$-norm regularization inherently
performs variable selection as a side-effect of minimizing capacity of the SVM
model. The distribution of the linear model weights provides a mechanism for
ranking and interpreting the effects of variables. Starplots are used to
visualize the magnitude and variance of the weights for each variable. We
illustrate the effectiveness of the methodology on synthetic data, benchmark
problems and challenging regression problems in drug design. This method can
dramatically reduce the number of variables, and outperforms SVMs trained using
all attributes and using the attributes selected according to correlation
coefficients. The visualization of the resulting models is useful for
understanding the role of underlying variables.
- This paper has
been accepted by JMLR, special issue on variable/feature selection.
- A longer version of the paper than the one accepted for JMLR can be found
here. It actually
comprises two chapters of Jinbo Bi's PhD thesis. A more thorough description
about the QSAR application can be found in this longer version.
- A data set used in this paper
The raw Caco2
dataset (gzipped) was generated in the on-going project of Automated Drug Discovery.
The dataset consists of 27 molecules and 713 descriptors calculated using RECON,
and MOE programs.
These descriptors encode the molecular shape, topology, subdivided
surface-area and electron-density properties, which have been widely applied
in Quantitative Structure-Activity Relationship (QSAR) studies. The property
LogPC (the last column) is the response representing the Caco2 permeability of
the compounds. See the longer version for details.
The preprocessed Caco2
dataset (gzipped) was generated by applying commonly-used chemometric
screening techniques (see the JMLR paper or the longer version). Our variable
selection and induction algorithms were tested on the preprocessed dataset.
- CPLEX programs
All of our algorithms
were implemented using the commercial optimization software CPLEX 6.6.
Programs can be available under request (contact Dr. Kristin Bennett
firstname.lastname@example.org). An appropriate version of CPLEX is required to run the
Contact Jinbo Bi email@example.com for information about this