Subset Selection for Linear Regression
December 1, 2011
JEC 3117 - 4:00 p.m. to 5:00 p.m.
One of the central problems in many data-driven sciences is the
right selection of attributes to observe in order to accurately
predict a variable of interest. Applications abound in areas as diverse as
medical sciences (predicting medical conditions based on observable
attributes), social sciences (predicting future behaviors or outcomes) and
sensor networks, among others.
In many of these cases, time or cost constraints prohibit sampling more
than a few attributes. Also, many application domains use linear
regression as a method of prediction, and evaluate the quality of
attributes in terms of the R^2 fit with the quantity to be predicted.
This motivates the following formal problem definition:
"Given the covariances between observable variables X_i and a target
variable Z, select k of the variables X_i such that the selected set has
the best possible R^2 fit with Z."
The main result presented in this talk is that so long as the covariance
matrix between the X_i variables is far from singular, greedy algorithms
frequently used in practice are provably constant-factor approximations.
The proof is based on extending the widely used concept of submodularity
to a notion of approximate submodularity, and relating it to the spectrum
of the covariance matrix.
Furthermore, we will investigate various graph-theoretical properties of
covariance matrices which allow for efficient exact or approximate
We conclude with several exciting open questions.
[This talk is based on joint work with Abhimanyu Das, appearing in
ICML 2011 and STOC 2008.]
Hosted by: Dr. Elliot Anshelevich
Last updated: October 24, 2011