Dimensionality Reduction via Sparse Support Vector Machines
Kristin Bennett, Jinbo Bi, Mark Embrechts, Curt Breneman and Minghu Song
Departments of Mathematics, DSES, and Chemistry
Rensselaer Polytechnic Institute
Abstract
We describe a methodology for performing variable ranking and
selection using support vector machines (SVMs). The method
constructs a series of sparse linear SVMs to generate linear
models that can generalize well, and uses a subset of nonzero
weighted variables found by the linear models to produce a final
nonlinear model. The method exploits the fact that a linear SVM
(no kernels) with $\ell_1$-norm regularization inherently performs
variable selection as a side-effect of minimizing capacity of the
SVM model. The distribution of the linear model weights provides a
mechanism for ranking and interpreting the effects of variables.
Starplots are used to visualize the magnitude and variance of the
weights for each variable. We illustrate the effectiveness of
the methodology on synthetic data, benchmark problems and
challenging regression problems in drug design. This method can
dramatically reduce the number of variables, and outperforms SVMs
trained using all attributes and using the attributes selected
according to correlation coefficients. The visualization of the
resulting models is useful for understanding the role of
underlying variables.
- This
paper has been accepted by JMLR, special issue on variable/feature
selection.
- A longer version of the paper than the one accepted for JMLR can be found
here. It actually comprises two chapters of Jinbo Bi's PhD thesis. A more thorough description about the QSAR
application can be found in this longer version.
- A data set used in this paper
The raw Caco2 dataset (gzipped) was generated in the on-going project of
Automated Drug Discovery. The dataset consists of 27 molecules
and 713 descriptors calculated using
RECON, PEST and MOE programs. These
descriptors encode the molecular shape, topology, subdivided
surface-area and electron-density properties, which have been widely applied
in Quantitative Structure-Activity Relationship (QSAR) studies.
The property LogPC (the last column) is the response representing the
Caco2 permeability of the compounds. See the longer version for details.
The
preprocessed Caco2 dataset (gzipped) was generated by applying commonly-used chemometric screening techniques
(see the JMLR paper or the longer version).
Our variable selection and induction algorithms were tested on the preprocessed dataset.
- CPLEX programs
All of our algorithms were implemented using the commercial optimization software CPLEX 6.6.
Programs can be available under request (contact Dr. Kristin Bennett bennek@rpi.edu). An appropriate version of
CPLEX is required to run the programs.
Contact Jinbo Bi (bij2@rpi.edu) for information about this page.