This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder. |
Publication List (pdf)
Talks (and Posters) |
2017; 2016; 2015; 2014; 2013; 2012; 2011; 2010; 2009; 2008; 2007; 2006; 2005; 2004; 2003; 2002; 2001; 2000; 1999 and before.
Summary: Using a reduction from CLIQUE we show NP-hardness and inapproximability of Sparse PCA.
Summary: This paper is concerned with approximating the dominant left singular vector space of a real matrix A of arbitrary dimension, from block Krylov spaces q(AAT,AX). Two classes of results are presented. First are bounds on the distance, in the two and Frobenius norms, between the Krylov space and the target space. The distance is expressed in terms of principal angles. Second are quality of approximation bounds, relative to the best low-rank approximation in the Frobenius norm. For starting guesses X of full column-rank, the bounds depend on the tangent of the principal angles between X and the dominant right singular vector space of A. The results presented here form the structural foundation for the analysis of randomized Krylov space methods. The innovative feature is a combination of traditional Lanczos convergence analysis with optimal low-rank approximations via least squares problems.
Summary: This paper addresses how well we can recover a data matrix when only given a few of its elements. We present a randomized algorithm that element-wise sparsifies the data, retaining only a few its elements. Our new algorithm independently samples the data using sampling probabilities that depend on both the squares (ℓ2 sampling) and absolute values (ℓ1 sampling) of the entries. We prove that the hybrid algorithm recovers a near-PCA reconstruction of the data from a sublinear sample-size: hybrid-(ℓ1,ℓ2) inherits the ℓ2-ability to sample the important elements as well as the regularization properties of ℓ1 sampling, and gives strictly better performance than either ℓ1 or ℓ2 on their own. We also give a one-pass version of our algorithm and show experiments to corroborate the theory.
Summary: Multilayer networks have seen a resurgence under the umbrella of deep learning. Current deep learning algorithms train the layers of the network sequentially, improving algorithmic performance as well as providing some regularization. We present a new training algorithm for deep networks which trains \emph{each node in the network} sequentially. Our algorithm is orders of magnitude faster, creates more interpretable internal representations at the node level, while not sacrificing on the ultimate out-of-sample performance.
Summary: We document that a surprisingly large number of editors change their behavior and begin focusing more on a particular controversial topic once they are promoted to administrator status. The conscious and unconscious biases of these few, but powerful, administrators may be shaping the information on many of the most sensitive topics on Wikipedia; some may even be explicitly infiltrating the ranks of administrators in order to promote their own points of view. In addition, we ask whether administrators who change their behavior in this suspicious manner can be identified in advance. Neither prior history nor vote counts during an administrator’s election are useful in doing so, but we find that an alternative mea- sure, which gives more weight to influential voters, can successfully reject these suspicious candidates. This second result has important implications for how we harness collective intelligence: even if wisdom exists in a collective opinion (like a vote), that signal can be lost unless we carefully distinguish the true expert voter from the noisy or manipulative voter.
Summary: We study the relationship between chatter on social media and observed actions concerning charita- ble donation. One hypothesis is that a fraction of those who act will also tweet about it, implying a linear relation. However, if the contagion is present, we expect a super- linear scaling. We consider two scenarios: donations in response to a natural disaster, and regular donations. We empirically validate the model using two location-paired sets of social media and donation data, corresponding to the two scenarios. Results show a quadratic relation between chatter and action in emergency response case. In case of regular donations, we observe a near-linear relation. Additionally, regular donations can be explained by demographic factors, while for a disaster response social media is a much better predictor of action. A contagion model is used to predict the near-quadratic scaling for the disaster response case. This suggests that diffusion is pre- sent in emergency response case, while regular charity does not spread via social network.
Summary: We give two provably accurate feature-selection techniques for the linear SVM. The algorithms run in deterministic and randomized time respectively. Our algorithms can be used in an unsupervised or supervised setting. The supervised approach is based on sampling features from support vectors. We prove that the margin in the feature space is preserved to within ε-relative error of the margin in the full feature space in the worst-case. In the unsupervised setting, we also provide worst-case guarantees of the radius of the minimum enclosing ball, thereby ensuring comparable generalization as in the full feature space and resolving an open problem posed in Dasgupta et al. (2007).
Summary: We provide fast algorithms for robust (L_1) and general L_p regression using a fast projection of the design matrix with a Hardmard scaled by Cauchy random variables (hence the name Fast Cauchy Transform, FCT). Our projection algorithm runs in O(ndlogn) and provides O(d^(2+)) L_1 distortion obliviously for an arbitrary subspace of dimension d. We give applications of our fast embedding algorithm to robust regression, L_p regression and L_1 subspace approximation.
Summary: Principal components analysis (PCA) is the optimal linear auto-encoder of data, and it is often used to construct features. Enforcing sparsity on the principal components can promote better generalization, while improving the interpretability of the features. We study the problem of constructing optimal sparse linear auto-encoders. Two natural questions in such a setting are: i) Given a level of sparsity, what is the best approximation to PCA that can be achieved? ii) Are there low-order polynomial-time algorithms which can asymptotically achieve this optimal tradeoff between the sparsity and the approximation quality? In this work, we answer both questions by giving efficient low-order polynomial-time algorithms for constructing asymptotically \emph{optimal} linear auto-encoders (in particular, sparse features with near-PCA reconstruction error) and demonstrate the performance of our algorithms on real data.
Summary: We give a reduction from {\sc clique} to establish that sparse PCA is NP-hard. The reduction has a gap which we use to exclude an FPTAS for sparse PCA (unless P=NP). Under weaker complexity assumptions, we also exclude polynomial constant-factor approximation algorithms.
Summary: When actors in a social network interact, it usually means they have some general goal towards which they are collaborating. This could be a research collaboration in a company or a foursome planning a golf game. We call such groups \emph{planning groups}. Our particular focus is hidden planning groups who have not "declared" their membership explicitly. We formulate the problem of hidden group discovery from streaming interaction data, and we propose efficient algorithms for identifying the hidden group structures by isolating the hidden group's non-random, planning-related, communications from the random background communications. We validate our algorithms on real data (the Enron email corpus and Blog communication data).
Summary: PCA, Clustering and Regression with sparse features: compute them efficiently and not sacrifice in in-sample performance too much. Mostly show and tell. A little SVM.
Summary: We introduce different notions of prominence in a network based on its community structure: a local rank, which is the prominence of an actor within its communities (the communities are computed using a clustering algorithm) and its community rank, which is the rank of an actor's community within the "graph of communities". We show in a variety of real networks that these two measures of prominence capture different effects. On these same real networks, we show that external measures of prominence (like an authors citation prominence, or an actors ability to feature in big-budget movies) can be deconstructed into a contribution from local rank and community rank. Local and community ranks also play differing roles depending on the nature of the network and the type of external prominence one wishes to capture.
Summary: We give o(SVD) algorithms that construct O(k/e) columns to reconstruct a matrix to within O(1+e)| |A-Ak||_F, where Ak is the best rank k reconstruction from the SVD. The number of columns is near optimal, asymptotically matching a known lower bound. We introduce two new tools in our algorithm - fast approximate matrix factorizations, and dual set sparsification results which may be of independent interest.
Summary: We give additive error bounds on sparse regression (where the regression vector is required to have a small number of non-zeros) in terms of the top-k PCA regression.
Summary: We show that using random projections for dimensionality reduction in SVM preserves (to within relative error) both the margin and the radius of the data. Since the margin and data radius are the two quantities parametrizing the statistical generalization error, these statistical bounds are preserved during the feature selection.
Summary: We develop a new algorithm that computes the SVD-truncated solution to a least squares problem (regression onto the top-k principal components) in time faster than SVD, while guaranteeing an additive error. The algorithm first quickly constructs an approximation to the top-k PCA and uses that approximation in the regression. We give a lower bound showing that our quality of approximation is the best one can expect of such algorithms which first approximate PCA.
Summary: We give additive error bounds on sparse regression (where the regression vector is required to have a small number of non-zeros) in terms of the top-k PCA regression.
Summary: This talk describes ways to do PCA, Clustering and Regression with sparse features: compute them efficiently and not sacrifice in in-sample performance too much.
Summary: This talk describes ways to do PCA, Clustering and Regression with sparse features: compute them efficiently and not sacrifice in in-sample performance too much. Mostly show and tell.
Summary: This talk describes ways to do PCA, Clustering and Regression with sparse features: compute them efficiently and not sacrifice in in-sample performance too much.
Summary: We give linear coresets for linear regression in both the spectral and Frobenius norms. Our algorithms are deterministic. We also provide lower bounds on coreset size for both deterministic and randomized algorithms.
Summary: We give new deterministic algorithms for selecting O(k) features to perform clustering. The resulting features give a relative error approximation to the optimal objective when considering the clustering error in the full space.
Summary: Using a maximum likelihood algorithm, we simultaneously estimate the dynamical parameters of the Saggitarius Dwarf galaxy stream as well as the parameters of the Milky Way galaxy. As a by-product, we present the first catalog of stars having the same density profile as the Saggitarius stream. These results were achieved by leveraginig the MilkyWay@home volunteer computing platform.
Summary: We give new algorithms for ranking in collaborative networks based on [i]clustering[/i] the collaborative artifacts into hyperedges and using an iterative pagerank type algorithm to obtain the rankings from these hyperedges.
Summary: We develop new algorithms to seed an information diffusions for a non-submodlar model. The model is built to mimic social proceeses that are likely to occur in information diffusion. Our approach is to project the non-submodular model to a class of submodular models for which an efficient solution can be constructed. We demonstrate significantly better performance as compared to using random or high degree seeding strategies on real and synthetic networks.
Summary: We develop the Fast Cauchy Transform (FCT), the L1 equivalent of the Fast Johnson Lindenstrauss Transform, which allows us to develop a fast algorithm for robust (L1) regression. This can be extended to Lp regression for p in the range [1,2]. We also give experimental simulations which show that the practice follows the theory closely.
This book is Part I of our text book on Machine Learning that is on the foundations and basic principles of learning from data. Roughly speaking, the book addresses the following topics: What is learning? Is learning feasible? Can we do it efficiently (linear models)? Can we do it well (overfitting and regularization)? Lesons learned - what are the take home messages?
Summary: We give the first algorithms to compute statistical leverage and coherence in sub-SVD time using random projections. The algorithms are accurate to within relative error. Datapoints with high statistical leverage are important in that they are likely to belong to good coresets that can reproduce a linear regression to within relative error.
Summary: We consider the optimal attack pattern to spam the page rank algorithm. It turns out to be particularly simple: all attackers point to the victim. This optimally increases not only the page-rank of the victim but also the rank (based on the page-rank) of the victim. We also consider optimal 'disguised' attacks, i.e. attacks where the attackers do not want to be directly associated to the victim. A similar result holds in that setting.
Summary: We give one of the first deterministic algorithms for provable reconstruction of a matrix using a small subset of its columns. The method is based on a greedy approximation of the SVD. In order to greedily approximate the SVD from a dictionary, we extend Natarajan's result for greedy approximation of a vector from a dictionary.
Summary: We give an oblivious random projection algorithm which samples a number features proportional to the rank of the data matrix such that the margin and radius of the data in the dimensionally reduced space are comparable to those in the original full-dimension space. Thus the generalization properties of the SVM are preserved. We give experimental simulations which bear out the theory.
Summary: We give the first polynomial algorithms which find a coreset of size order the number of variables, such that the regression on this coreset gives a relative error approximation to the regression on the full data set.
Summary: We consider a spreading process that can be mapped onto a random graph model: order the vertices and randomply place directed edges from lower ranked to higher ranked vertices; each such edge exists independently with probability \math{p}. If the spread starts at the first vertex, the size of the infection is the component reachable from this vertex. We prove existence of a sharp threshold \math{p^*=\log n/n} at which this reachable component transitions from \math{o(n)} to \math{\Omega(n)}.
Summary: A random projection based power iteration for estimating spectral norms to within relative error.
Summary: Overlapping communities play a large role in the functioning of social networks. We give a definition based on minimal properties that such communities should have, and develop efficient algorithms to find such comunities. We give some methods for validating that the communities found are "real" and compare this minimal axiomatic approach to several other benchmark algorithms which take a specific view as to the structure of the communities.
Summary: We collect some of our previous results on oblivious routing algorithms with near-optimal congestion and dilation (as compared with optimal offline routing) for sensor network topologies which include the d-dimensional mesh and uniform geometric networks.
Summary: We show that selecting columns of a data matrix to have maximum "numerical rank" (as measured by the volume of the data vectors selected) is exponentially hard to approximate.
Summary: We give a succint model for edit dynamics in collective wisdom processes in which users (the collective) arrive sequentially and contribute to the wisdom on a particular topic. The model captures the arrival of new information, together with the saturation of information on a particular topic. As the quality of a particular topic increases, there is a tradeoff between more uses arriving and those users having less to contribute. We demonstrate the model on Wiki and Blog edit behavior, in particular capturing the main observed phenomenological dynamics within just this simple model.
Summary: For $p\backslash ge\; 1$, we prove that every forest with $p$ trees whose sizes are $a\_1,\; .\; .\; .\; ,\; a\_p$ can be embedded in any graph containing at least $\backslash sum\_\{i=1\}^p\; (a\_i\; +\; 1)$ vertices and having minimum degree at least $\backslash sum\_\{i=1\}^p\; a\_i$.
Summary: We present a method for using Gaussian Mixture Models (GMMs) efficiently using low rank perturbations to a diagonal covariance matrix. The full covariance matrix has O(d^2) free parameters and could be prone to overfitting as well as inefficient. The diagonal covariance matrix has O(d) parameters and is efficient but can be restrictive. The low rank perturbation to a diagonal covariance matrix has O(d) parameters, is efficient and offers a middle ground between full and diagonal covariance which can capture correlations but using linear (in d) parameters.
Summary: We show that it is possible to reduce the feature dimension and provably obtain clusterings which are comparable to the good clusterings in the original dimension.
Summary: We develop edit behavioral indices that capture change in behavior and show correlation between blocked (potentially manipulative) editors and adverse changes in this index.
Summary: We present a method for "estimating" the out-sample error (validation) using a permutation estimate. For classification, one can bound the out-sample error using a permutation complexity which is related to the permutation estimate. The permutation estimate, however, applies to classification, regression, multiclass, ... and can be extended to different error measures (other than squared error) -- the paper primarily focusses on classification and L2-regression. Our experiments show that the permutation estimate performs well for model selection, outperforming a variety of other model methods including LOO-CV and 10-fold CV. The permutation estimate general, easy to compute and efficient, relying only on being able to run the learning algorithm on data.
Erratum: Rudiger Hewer, in his bachelor thesis graciously pointed out to me that the bound in Lemma 8 should be corrected by a factor of 2 which propagates to the error term in Theorem 5, which should be 5*sqrt(8/n...) instead of 3*sqrt(8/n...).
Summary: We develop an agent based model for information diffusion in social networks based on information fusion, propagation and querying. We validate the model on real data from the San Diego wild fires evacuation in 2007.
Summary: We develop a novel 2-d ball mechanism for comparing prediction market microstructures, and use it to compare two market makers LMSR and BMM.
Summary: We study the problem of selecting a set of columns of a matrix with maximum volume. This is useful in trying to construct representative columns of high "numerical rank", which may be of use in matrix reconstruction. We study the algorithmic problem, showing that it is NP-hard and inapproximable. We then study a natural greedy algorithm for this problem, giving a worst case approximation guarantee, also demonstrating a problem which almost achieves this worst case. The performance of greedy is considerably worse than the inapproximability result we prove, suggesting avenues for further work in strengthening the inapproximability result or finding better algorithms.
Summary: We study atomic routing games on networks in which players choose a path with the objective of minimizing the maximum congestion along the edges of their path. The social cost is the global maximum congestion over all edges in the network. We show that the price of stability is 1. The price of anarchy, PoA, is determined by topological properties of the network. In particular, PoA=O(L+logn), where L is the length of the longest path in the player strategy sets, and n is the size of the network. Further, K-1<=PoA<=c(K^2+log^2 n), where K is the length of the longest cycle in the network, and c is a constant.
Summary: We give heuristics for learning the parameters of a HMM describing dynamics of social groups in social networks. The input data are communications for the social network. The HMM is an agent based micro-law model which is highly interdependent. We apply our methodology to real data from blogs, newsgroups.
Summary: We give a parameterised generative HMM model for social group evolution based on social capital theory, and demonstrate its potential for modeling of social groups in social networks. In particular, one may observe certain abrupt changes in the behavior of macroscopic properties such as the group size distribution as one continuously changes some of the parameters. Within this model we can reproduce many observed phenemona within the social science literature.
Summary: We give a maximum likelihood algorithm to identify streams in from SDSS like data. In particular, we use a cylindrical profile for the stream over small length scales and a Hernquist profile for the Milky Way halo. The algorithm finds the stream and separates it from the background halo, thus producing more accurate parameters of the stream and giving the first catalogue of stars extracted from the data to have the stream density profile. In general, we are studying the problem of separating geometric objects in spatial data bases.
Summary: We give bufferless hot-potato style scheduling algorithms for trees and leveled networks. For leveled networks, we give centralized and distributed algorithms. The centralized algorithm has routing time within a logarithmic factor of optimal, and the distributed algorithm is a logarithmic factor worse than the centralized. For trees, we give deterministic and randomized algorithms. The deterministic algorithm applies to bounded degree networks and has routing time within a logaritmic factor of optimal. The randomized algorithm applies to arbitrary networks, and has routing time within a log-squared factor from optimal.
Summary: We give an oblivious algorithm for the d-dimensional mesh in which packets select their paths independently of each other. The stretch and congestion are both within O(d^2) from the optimal stretch and congestion attainable by oblivious algorithms, and for oblivious algorithms, the congestion is within a logarithmic factor from the optimal congestion attainable by non-oblivious algorithms. Our algorithm is randomized, and we show that significant randomization is required for any algorithm which attains near optimal congestion. In particular, we show that our algorithm uses at most a factor O(d) more bits per packet than any algorithm which attains a comparable congestion.
Summary: We study TDMA-based MAC protocols for asynchronous wireless sensor networks in very harsh environments. Specifically, the protocols are contention-free (avoid collisions), distributed and self-stabilize to topological changes in the network; any topological changes are contained, namely, affect only the nodes in the vicinity of the change; our protocols do not assume that nodes have a global time reference, that is, nodes may not be time-synchronized, and nodes may wake up in an arbitrary order. We also discuss how to accomodate clock skew, slot misalignment and inability for collision detection at a node.
Summary: We address the problem of localizing an object in many dimensions by selecting a small number of sensors from which to obtain a measurement. We abstract a sensor measurement as a convex set in R^n and a localization as the intersection of all measurments available to the object. We show that a constant number of sensors is all that is required to obtain a constant factor approximation to the localization error obtained from using all the sensors. Instrumental in our proof is a new construction of an enclosing simplex of a convex polygon with bounded volume.
Summary: We give universal packet switching algorithms for bufferless networks, settling an open question regarding the possibility of optimal (to within poly-log factors) bufferless packet switching algorithms. Our argument is constructive and the algorithm is polynomial time. The main idea is to convert a bufferless scheduling problem with prespecified paths in a graph G to a new routing problem in a related graph G'. The problem in G' is solved using buffers, and the solution in G' is emulated in a deterministic bufferless manner in G to give the final bufferless schedule. Buffering in G' is emulated by packet circulation in regions of G.
Summary: It is often the case that learning with a monotonicity constraint is appropriate, however the directions of the monotonicity constraints may not be available in a general d-dimensional setting. One approach is to use a simple learning model to learn the monotonicity constraints, which can then be applied to constrain a more complex learning model. We show that the optimal linear fit can extract the correct monotonicity directions in one dimension, which remains true in multi-dimensions provided that the input distribution is of a Mahalanobis type. The same results remain true asymptotically when the the OLS estimator is used instead of the optimal linear fit.
Summary: We develop a statistical framework for studying ranker aggregation and a new algorithm for ranker aggregation based on a Kernigan-Lin style iterative best flip approach. We give experimental results for varying amounts of noise in the rankers as well as dis-similarity in the rankers (mis-information). In this space the optimal ranker can vary significantly, illustrating an information vs. robustness tradeoff for ranker aggregation.
Summary: We relate the problem of boosting a pair of classifiers to an optimal weighted linear separation problem in two dimensions. We give efficient algorithms to perform this separation and compute the leave one out error. The main idea is to carefully enumerate the possible separators to finally obtain an O(n^2logn) algorithm, where n is the number of data points.
Summary: We consider the direct routing problem, in which packets must follow pre-specified paths in a bufferless network. In this routing model, the only parameter to be determined for a packet is the injection time. We give a general greedy algorithm which has routing time O(C D), and we show that there are routing problems where this is the best achievable by a direct algorithm. We show that optimal direct routing is NP-hard to approximate by gap preserving reduction from graph coloring. We give versions of the greedy algorithm which have optimal or near optimal routing time for trees, multi-dimensional meshes, hypercubes and the butterfly. Finally, we use hard direct routing problems to obtain lower bounds on the buffering requirements of any algorithm that requires packets to follow pre-specified paths. If such algorithms obtain near optimal routing time, then packets must be buffered Omega(N^{1/3}) times (on average), where N is the number of packets.
Summary: We model a node using a Markovian automaton which transitions between various modes which offer different capabilities and use power at different states. We develop a mean field theoretic analysis for the behavior of the system, which we optimize with respect to power consumption using connectivity and coverage as constraints. We apply the methodology to a 3 state automaton with off, listen and transmit states. We give results comparing the mean field theory analysis with a simulation, and we compare our approach with existing power saving approaches, which do not necessarily offer connectivity and coverage.
Summary: We give linear time algorithms for (unweighted) isotonic and unimodal regression in the Linfinity norm. For the L1 norm, we give linear time algorithms when the output values are in a bounded, finite set, and an approximation algorithm when the output values are in a bounded range. An open question that remains is to construct linear time isotonic regression in the L1 norm for the arbitrary case.
Summary: We consider the algorithm aspects of fair division of a resource, in which limited input from the users may be obtained regarding how much they value a certain part of the resource. If there are n users, it is known that a fair division of the resource (one in which no user thinks they have a less than 1/n of the resource) can be obtained using n logn cuts of the resource (which is modeled as the unit interval). No matching lower bound is available. We take a first step in this direction by showing (through reduction from sorting) that Omega(n logn) comparisons are made by any fair division algorithm. We also consider strong envy-free division (in which no user thinks she has a lesser share than any other user), and give lower bounds of Omega(n^2) on the number of cuts that must be made.
Summary: We use some recent results on the behavior of the Maximum Drawdown of a Brownian motion to develop a scaling law for the Maximum Drawdown which allows one to compare the Sterling/Calmar ratios of trading strategies when the historical data available for the different trading strategies extends over different lengths of time.
Summary: We develop the distribution for the Maximum Drawdown (largest drop from a peak to a bottom) of a Brownian motion. We compute the expected value as an infinite integral series. We reduce this function to a single "universal" function, which can be computed numerically once, and used to compute the expected drawdown for any Brownian motion. We compute the asymptotic behavior of this infinite series to reveal three regimes for the behavior of the MDD -- logarithmic, square-root and linear, depending on the sign of the drift parameter in the Brownian motion.
Summary: Since joint distribution of the high, low and close (conditioned on the open, drift and variance parameters) can be computed, we use this distribution to develop a likelihood for observing the high, low and close conditioned on the drift and variance parameters. By maximizing this likelihood, we construct an estimate of the variance parameter, which we compare to other well known estimators of variance parameter. Maximizing the likelihood yields a systematic gain in performance.
Summary: We consider density estimation and random variate generation using neural networks. For density estimation, we give two methods, both based on learning the sample cumulative distribution function, one is randomized (SLC) and one is a deterministic interpolation using a smoothness constraint. We prove that the randomized algorithm converges to the deterministic version, asymptotically, however practically, randomization offers an additional smoothness property. We prove convergence of the deterministic algorithm to the true density given bounded derivative constraints on the true density.
Summary: We give an introduction to the Martingale measure, a tool for pricing options, which forml a link between pricing of financial derivatives and Monte Carlo simulation. We use this approach together with a limiting argument to give an elementary derivation of the price of the European call option, and discuss American options.
Summary: We give give efficient algorithms for estimating the height (maximum antichain) of a set of random points in a d-dimensional unit cube. The algorithms are based on a result we show which states that it suffices to consider only points near the diagonal of the cube. We develop efficient algorithms for generating these points directly, together with an efficient algorithm for constructing the height of this set. We use the co-convergence of estimates from different sized regions around the diagonal to the same value too develop an improved estimate. As a result, we give the first accurate estimates of c_d, the coefficients governing the convergence rate.
Summary: We give an analysis of the early restart phenomenon which can be used to speed up an algorithm whose running time is a function of the initial conditions. For some initial conditions, the runtime may be excessively long or unbounded (for example local minima in an optimization). Given some probability distribution over which the initial conditions are sampled, there is some distribution for the stopping time of the process. We give a analysis of how using restart can improve the expected runtime, and develop a condition for selecting the optimal restart time. Essentially, if the algorithm has run for long enough, it may be favorable to restart the algorithm.
Summary: We show that if the noise is additive, then the prior and posterior distributions of the noise are equal when the prior over target functions is uniform. There is no free lunch in the sense that if no useful assumptions can be made about the target function, then it is not possible gain any extra information regarding the nature of the noise.
Summary: We give a general characterization of the performance of a general class of learning systems. Specifically, when the learning system is stable (satisfies certain regularity conditions) and the noise is additive with bounded variance and the input distribution has compact support, the test performance approaches the optimal test performance at a universal rate of O(1/N), where N is the number of data points.
Summary: We show that the success of early stopping is an artifact of training dynamics which tends to initialize the weights at smaller values and so the resulting function tends to have smaller magnitude weights, and hence tends to be smoother, which corresponds to having a smoothness prior on the target functions. When no such hidden prior assumption is made, i.e., when early stopping chooses uniformily among all hypotheses with the same (higher than optimal) error, we show that the resulting classifier has performance which is monotonically decreasing in the value of the early stopping error. Thus there is No Free Lunch for early stopping, in that it can only be succesfull if some bias is given toward the target function prior -- which means that some prior on the target function is necessary.
Summary: We analyse some systematic errors in the prediction of volatility using Maximum Likelihood methods. In particular, there is a systematic underprediction even when the mean is predicted well. Further, Maximum Likelihood methods can systematically select the wrong model over the correct model, when faced with a choice of two models. We develop an expression for a correction factor for systematically correcting for this systematic underprediction.