Volume 36, Number 1&2, 1999
Editors
Philip Chan,
Florida Institute of Technology
Salvatore Stolfo,
Columbia University
David Wolpert,
NASA Ames Research Center
The goal of combining the predictions of multiple learned models is to form an improved estimator. A combining strategy must be able to robustly handle the inherent correlation, or multicollinearity, of the learned models while identifying the unique contributions of each. A progression of existing approaches and their limitations with respect to these two issues are discussed. A new approach, PCR*, based on principal components regression is proposed to address these limitations. An evaluation of the new approach on a collection of domains reveals that 1) PCR* was the most robust combining method, 2) correlation could be handled without eliminating any of the learned models, and 3) the principal components of the learned models provided a continuum of ``regularized'' weights from which PCR* could choose.
Several effective methods for improving the performance of a single learning algorithm have been developed recently. The general approach is to to create a set of learned models by repeatedly applying the algorithm to different versions of the training data, and then combine the learned models' predictions according to a prescribed voting scheme. Little work has been done in combining the predictions of a collection of models generated by many learning algorithms having different representation and/or search strategies. This paper describes a method which uses the strategies of stacking and correspondence analysis to model the relationship between the learning examples and the way in which they are classified by a collection of learned models. A nearest neighbor method is then applied within the resulting representation to classify previously unseen examples. The new algorithm consistently performs as well or better than other combining techniques on a suite of data sets.
This paper presents experimental results with both real and artificial data on using the technique of stacking to combine unsupervised learning algorithms. Specifically, stacking is used to form a linear combination of finite mixture model and kernel density estimators for non-parametric multivariate density estimation. The method is found to outperform other strategies such as choosing the single best model based on cross-validation, combining with uniform weights, and even using the single best model chosen by "cheating" and examining the test set. We also investigate in detail how the utility of stacking changes when one of the models being combined generated the data; how the stacking coefficients of the models compare to the relative frequencies with which cross-validation chooses among the models; visualizatio n of combined "effective" kernels; and the sensitivity of stacking to overfitting as model complexity increases. In an extended version of this paper we also investigate how stacking performs using L1 and L2 performance measures (for which one must know the true density) rather than log-likelihood (Smyth and Wolpert 1998).
The size of many data bases have grown to the point where they cannot fit into the fast memory of even large memory machines, to say nothing of current workstations. If what we want to do is to use these data bases to construct predictions of various characteristics, then since the usual methods require that all data be held in fast memory, various work- arounds have to be used. This paper studies one such class of methods which give accuracy comparable to that which could have been obtained if all data could have been held in core and which are computationally fast. The procedure takes small bites of the data, grows a predictor on each small bite and then pastes these predictors together. The methods are also applicable to on-line learning.
Algorithms for voting classification algorithms, such as Bagging and AdaBoost, were shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets. We review these algorithms and describe a large empirical study comparing several variants of these algorithms in conjunction with a decision tree inducer (three variants) and a Naive-Bayes inducer. The purpose of the study is to improve our understanding of why and when these algorithms, which use perturbation, reweighting, and combination techniques affect classification error. We provide a bias and variance decomposition of the error to show how different methods and variant have different effects on these two terms. Algorithmic variants, some of which are introduced in this paper, include: pruning versus no pruning, use of probabilistic estimates, weight perturbations (Wagging), and backfitting of data. We measure tree-sizes and show interesting correlations between between the average number of nodes and the success of AdaBoost. We compare the mean-squared error of voting methods and show that these methods lead to large and significant improvements. Practical problems that arise in implementing boosting algorithms, such as numerical instabilities and underflows, are explored. We us scatterplots that graphically show how AdaBoost reweighs instances, emphasizing not only ``hard'' areas but also outliers and noise.