Working Notes of IMLM-96

(in first-author order)

On Explaining Degree of Error Reduction due to Combining Multiple Decision Trees
Kamal Mahmood Ali (ali@almaden.ibm.com), IBM Almaden Research Center
pp. 1-7 (PostScript)
In multiple models research, classifications or evidence is combined from several classification models in order to form overall classifications with higher accuracy than any of the individual models. However, most previous work has concentrated on demonstrating that use of multiple models leads to statistically significant error reduction as compared to the error rate of the single model learned on the same data. In this paper we present a model of the degree of error reduction due to the use of multiple decision trees. In particular, we show that the degree of error reduction can be linearly modeled by the ``degree to which decision trees make uncorrelated errors.'' These conclusions are based on results from 21 domains.
Creating and Exploiting Coverage and Diversity
Carla E. Brodley (brodley@ecn.purdue.edu) and Terran Lane (terran@ecn.purdue.edu), Purdue University
pp. 8-14 (PostScript)
In this paper, we illustrate that increasing coverage through diversity is not enough to ensure increased prediction accuracy---if the integration method does not utilize the coverage, then no benefit arises from integrating multiple models. We compare four criteria for selecting base-level classifiers and demonstrate that informed selection can lead to more accurate meta-level classifiers than random selection. In addition, we illustrate empirically that straightforward integration methods fail to utilize the diversity of the base-level classifiers.
Human Expert-Level Performance on a Scientific Image Analysis Task by a System Using Combined Artificial Neural Networks
Kevin J. Cherkauer (cherkaue@cs.wisc.edu), University of Wisconsin-Madison
pp. 15-21 (PostScript)
This paper presents the Plannett system, which combines artificial neural networks to achieve expert-level accuracy on the difficult scientific task of recognizing volcanos in radar images of the surface of the planet Venus. Plannett uses ANNs that vary along two dimensions: the set of input features used to train and the number of hidden units. The ANNs are combined simply by averaging their output activations. When Plannett is used as the classification module of a three-stage image analysis system called JARtool, the end-to-end accuracy (sensitivity and specificity) is as good as that of a human planetary geologist on a four-image test suite. JARtool-Plannett also achieves the best algorithmic accuracy on these images to date.
Learning from Multiple Models to Integrate Sensed Data
Bobi Den Hartog (denhartog@lanl.gov), John W. Elling (elling@lanl.gov), Susan M. Mniszewski (smm@lanl.gov), Los Alamos National Laboratory;
Roger M. Kieckhafer (rogerk@raptor.unl.edu), University of Nebraska at Lincoln
pp. 22-28 (PostScript)
The problem of identifying and quantifying mixtures of polychlorinated biphenyls (PCBs), called Aroclors, present in a gas chromatogram defies current single-model classification techniques. Specifically, different pattern recognition techniques are known to be ineffective in different situations. The problem is that an Aroclor "signal" in a chromatogram can be masked by "interferences", such as signals from other Aroclors or by unknown contaminants in the same sample. The unlimited variety and combination of possible interferences does not allow comprehensive training. Instead, we use a world model, called a scenario, which partitions the sources of interferences into a few broad categories. We then apply multiple pattern recognition techniques (sensors or models) to the chromatogram. A two- phase fuzzy logic "combiner" then analyzes the results 1) using a scenario-identifying rule set, we examine the pattern of disagreement among techniques and estimate the scenario 2) using a model-weighting rule set, we can combine the results of the most accurate techniques based on the current scenario. If necessary, the fuzzy analysis can be iterated to achieve the final result. This paper introduces the Aroclor classification problem, discusses its resistance to standard classification methods, and then describes the structure of the fuzzy logic combiner module.
Using Partitioning to Speed Up Specific-to-General Rule Induction
Pedro Domingos (pedrod@ics.uci.edu), University of California, Irvine
pp. 29-34 (PostScript)
RISE (Domingos, IJCAI-95 & Machine Learning Journal, in press) is a rule induction algorithm that proceeds by gradually generalizing rules, starting with one rule per example. This has several advantages compared to the more common strategy of gradually specializing initially null rules, and has been shown to lead to significant accuracy gains over algorithms like C4.5RULES and CN2 in a large number of application domains. However, RISE's running time (like that of other rule induction algorithms) is quadratic in the number of examples, making it unsuitable for processing very large databases. This paper studies the use of partitioning to speed up RISE, and compares it with the well-known method of windowing. The use of partitioning in a specific-to-general induction setting creates synergies that would not be possible with a general-to-specific system. Partitioning often reduces running time and improves accuracy at the same time. In noisy conditions, the performance of windowing deteriorates rapidly, while that of partitioning remains stable.
Nonlinear Combination of Neural Networks From a Diverse and Evolving Population
Thomas M. English (english@cs.ttu.edu), Texas Tech University
pp. 35-39 (PostScript)
The objective is to predict on the basis of recent observations the behavior of a nonlinear dynamical system. An evolutionary algorithm searches a large space of model topologies and precision hyperparameters, fitting limited-precision models to older observations and assessing how well the models generalize to more recent observations. In each generation, topology-precision pairs are ranked according to an inductive criterion favoring validation errors that are not only small but low in redundancy with those of higher-ranked pairs. The fitted models associated with the highest ranks of a generation are combined in a nonlinear fashion. Their predictions are treated collectively as the state vector of the modeled system, and memory of errors in similar states of the past is used to correct and combine predictions in the present. Model combinations of different generations are themselves similarly combined. Empirical results are given for populations of recurrent neural networks modeling chaotic systems.
A Comparative Evaluation of Combiner and Stacked Generalization
David W. Fan (wfan@cs.columbia.edu), Columbia University;
Philip K. Chan (pkc@cs.fit.edu), Florida Institute of Technology;
Salvatore J. Stolfo (sal@cs.columbia.edu), Columbia University
pp. 40-46 (PostScript)
Combiner and Stacked Generalization are two very similar meta-learning methods that combine predictions of multiple classifiers to improve accuracy of any single classifier. In this paper, we compare stacked generalization and combiner from the perspective of training efficiency versus accuracy. We show that both methods improve the accuracy of any single classifier roughly at an equivalent level. Moreover, we also see that the cost of stacked generalization is very large and may prevent it from being used on very large data sets.
Generalization Capability of Homogeneous Voting Classifier Based on Partially Replicated Data
Jacek Jelonek (jacek.jelonek@cs.put.poznan.pl), Poznan University of Technology
pp. 47-52 (PostScript)
The generalization error is one of the most important features taken into account in performance evaluation and verification of any classifier. We propose a voting system based on homogenous base classifiers (HVC) which ensures a better generalization capability than any of its components. The principle of this idea consists in the differentiation of a learning data set for each base classifier and does not relate to their architecture, learning algorithm or any parameters. Therefore, this concept could be applied to decision trees, neural networks, statistical classifiers and many others. In order to show the superiority of HVC we performed several comparative tests on some data sets using cross-validation.
Integrating Multiple Classifiers By Finding Their Areas of Expertise
Moshe Koppel (koppel@bimacs.cs.biu.ac.il) and Sean P. Engelson (engelson@bimacs.cs.biu.ac.il), Bar-Ilan University
pp. 53-58 (PostScript)
Using multiple learned classifiers for increasing learning accuracy has attracted much recent interest. A central problem is how to integrate several classifiers to produce a single final classification. We have explored an approach which relies on analyzing the particular area of expertise of each classifier. In this paper, we describe a method which uses inductive techniques on a set of training examples to determine the areas of a particular classifier's expertise. By applying this method to each of a set of conflicting classifiers, we can determine the most reliable classifier for each new example and use that classifier to classify the example. In this way, we achieve significantly better classification accuracy than even the best individual expert or an induced combination rule.
ANCHOR - A Connectionist Architecture for Hierarchical Nesting of Multiple Heteogeneous Neural Nets
Rajeev Kumar (r.kumar@sheffield.ac.uk) and Peter I. Rockett (p.rockett@sheffield.ac.uk), University of Sheffield
pp. 59-65 (PostScript)
We present a novel connectionist architecture for handling arbitrarily complex eural computations. Artificial Neural Network Compiler for Hierarchical ORganisation (ANCHOR) takes a generalised view of a neuron processing element which we have termed as "Superneuron"; and which accepts n-inputs and outputs m-features where m=1 for an element in a conventional network. A superneuron is a single or higher-order multi-layer perceptron which can be trained individually with support of multiple-learning algorithms from the system. The indistinguishability between a superneuron and a neuron is employed in hierarchical nesting of superneurons up to (theoretically) infinite depth within other superneurons as well as linear or tree-structured cascading. ANCHOR facilitates partitioning of input space into a number of small simpler sub-mappings and has been tested on learning simple boolean functions with hierarchical calls and cascading of simpler superneurons. The issues related to partitioning and self-partitioning for a statistical hierarchical classifier are discussed.
Applying Parallel Learning Models of Artificial Neural Networks To Letters Recognition From Phonemes
Byoung Jik Lee (bjlee@cs.uiowa.edu), University of Iowa
pp. 66-71 (PostScript)
A Parallel Artificial Neural Networks approach which converts strings of phonemes to strings of letters is described. Our approach employs several neural networks, each of which handles a different set of the input data with different weights. This approach is characterized by two important properties. First, this approach provides a way to handle the local minima problem without complex computations by generating multiple neural networks working in parallel. Neural networks are trained on examples iteratively which the previous neural networks could not produce proper results. Second, the test data is distributed across the learning networks according to the ability of each network to handle its assigned task. From among multiple outputs generated by trained neural networks, selection strategies are applied. Performance evaluated over 5,528 .characters taken from the 934 most-used English words shows that our approach achieves more than 10% improvement over the general multilayered neural network (89.1% for our method, compared with 78.6% for the general network). In the training set, our approach achieves a correctness rate of 95.4%, while the general multilayered network has a correctness rate of 92.7%.
Combining Neural and Statistical Classifiers Via Perceptron
Sauchi Stephen Lee (steve@laplace.mathstat.uidaho.edu), University of Idaho
pp. 72-76 (PostScript)
Many classification models are presently available from both the Neural Network s and Statistics communities. Recent research has revealed that, instead of selec ting the ``best'' single model from a set of plausible candidates, integrating a set of ``heterogeneous'' models will maximize overall accuracy. Disagreements between these candidate models enable one to pick and choose thei r relative strengths in an optimal manner. Many statistical models, due to their intrinsic different model structures, could complement neural networks. In classification problems, encouraging results were obtained by integrating some statistical and neural classifiers via perceptron.
Bayesian Model Averaging
David Madigan (madigan@stat.washington.edu), Adrian E. Raftery (raftery@stat.washington.edu), Chris T. Volinsky (volinsky@stat.washington.edu), University of Washington;
Jennifer A. Hoeting (jah@stat.colostate.edu), Colorado State University
pp. 77-83 (PostScript)
Standard statistical practice ignores model uncertainty. Data analysts typically select a model from some class of models and then proceed as if the model generated the data. This approach ignores a component of uncertainty, leading to over-confident inferences. Bayesian model averaging (BMA) provides a coherent mechanism for accounting for this model uncertainty. Several methods for implementing BMA have recently emerged. We discuss these methods and provide pointers to a number of applications. In these applications, BMA provides improved out-of-sample predictive performance. We provide a catalogue of currently available BMA software.
On Voting Ensembles of Classifiers (extended abstract)
Ofer Matan (ofer@cs.stanford.edu), Stanford University
pp. 84-88 (PostScript)
We study the classification ability of majority-vote ensembles of classifiers. A majority ensemble classifies a pattern by letting each member of the ensemble cast a single vote for the correct class and decides according to a simple majority or a special majority vote. We give upper and lower bounds on the classification performance of a majority ensemble as a function of the classification performances of its individual.
Handling Redundancy in Ensembles of Learned Models Using Principal Components
Christopher J. Merz (cmerz@ics.uci.edu) and Michael J. Pazzani (pazzani@ics.uci.edu), University of California, Irvine
pp. 89-92 (PostScript)
When combining a set of learned models to form an improved estimator, the issue of redundancy in the set of models must be addressed. Existing methods for addressing this problem have failed to perform robustly, especially as the redundancy in the set of learned models increases. Recently, a variant of principal components regression, PCR*, demonstrated that these limitations could be overcome by mapping the original learned models to a set of principal components and then choosing which components to include in the final regression. Weights for the original learned models could then be derived from the weights of the principal components regression. The focus of this paper is to compare PCR*'s cross-validation-based stopping criteria for choosing the number of principal components to existing methods. Experimental results show that existing stopping criteria are often too conservative and discard useful components leading to poor performance.
Selection of Learning Algorithms for Trading Systems Based on Biased Estimators - An Abstract
Zoran Obradovic (zoran@eecs.wsu.edu) and Tim Chenoweth (tchenowe@eecs.wsu.edu), Washington State University
pp. 93-94 (PostScript)
This work extends our S&P 500 index trading system based on an appropriate domain partitioning and integration of biased estimators designed using disjoint data sets. A new heterogeneous learning system is proposed with the objective of further reduction of correlation between local estimators. Experimental results conducted on real financial data over a 10 year period indicate that a mixed system consisting of cascade correlation and back-propagation trained neural network estimator components provides the best integration returns. The actual return obtained by this system is better that the return obtained by the buy and hold strategy over the same period even when assuming a 0.05% transaction fee. However, it is important to note that care must be taken when determining which learning algorithm to apply to each subproblem.
A Noise-Tolerant Hybrid Model of a Global and a Local Learning Module
Natsuki Oka (oka@mrit.mei.co.jp) and Kunio Yoshida (yoshida@mrit.mei.co.jp), Matsushita Research Institute Tokyo, Inc.
pp. 95-100 (PostScript)
Proposed is GLLL2, a hybrid architecture of a global and a local learning module, which learns default and exceptional knowledge respectively from noisy examples. The global learning module, which is a feedforward neural network, captures global trends gradually, while the local learning module stores local exceptions quickly. The latter module distinguishes noise from exceptions, and learns only exceptions, which ability makes GLLL2 noise-tolerant. The results of experiments show the process in which training examples are formed into default and exceptional knowledge, and demonstrate that the predictive accuracy, the space efficiency, and the training efficiency of GLLL2 is higher than those of each individual module.
Exploiting Multiple Existing Models and Learning Algorithms
Julio Ortega (julio@almaden.ibm.com), IBM Almaden Research Center
pp. 101-106 (PostScript)
This paper presents MMM and MMC, two methods for combining knowledge from a variety of prediction models. Some of these models may have been created by hand while others may be the result of empirical learning over an available set of data. The approach consists of learning a set of ``Referees'', one for each prediction model, that characterize the situations in which each of the models is able to make correct predictions. In future instances, these referees are first consulted to select the most appropriate prediction model, and the prediction of the selected model is then returned. Experimental results on the audiology domain show that using referees can help obtain higher accuracies than those obtained by any of the individual prediction models.
Scaling Up: Distributed Machine Learning with Cooperation
Foster John Provost (foster@nynexst.com), NYNEX Science & Technology;
Daniel N. Hennessy (hennessy@cs.pitt.edu), University of Pittsburgh
pp. 107-112 (PostScript)
Machine-learning methods are becoming increasingly popular for automated data analysis. However, standard methods do not scale up to massive scientific and business data sets without expensive hardware. This paper investigates a practical alternative for scaling up: the use of distributed processing to take advantage of the often dormant PCs and workstations available on local networks. Each workstation runs a common rule-learning program on a subset of the data. We first show that for commonly used rule-evaluation criteria, a simple form of cooperation can guarantee that a rule will look good to the set of cooperating learners if and only if it would look good to a single learner operating with the entire data set. We then show how such a system can further capitalize on different perspectives by sharing learned knowledge for significant reduction in search effort. We demonstrate the power of the method by learning from a massive data set taken from the domain of cellular fraud detection. Finally, we provide an overview of other methods for scaling up machine learning.
On the Integration of Ensembles of Neural Networks: Application to Seismic Signal Classification
Yair Shimshoni (shimsh@math.tau.ac.il) and Nathan Intrator, Tel-Aviv University
pp. 113-119 (PostScript)
This paper proposes a classification scheme based on the integration of multipl e Ensembles of ANNs. It is demonstrated on a classification problem, in which S eismic recordings of Natural Earthquakes must be distinguished from the recordi ngs of Artificial Explosions. A Redundant Classification Environment consists of several Ensembles of Neural Networks is created and trained on Bootstrap Sample Sets, using various data re presentations and architectures. ANNs within the Ensembles are aggregated (as in Bagging) while the Ensembles ar e integrated non-linearly, in a signal adaptive manner, using a posterior confi dence measure based on the agreement (variance) within the Ensembles. The proposed Integrated Classification Machine achieved 92.1% correct classific ation on the seismic test data. Cross Validation evaluations and comparisons indicate that such integration of a collection of ANN's Ensembles is a robust way for handling high dimensional p roblems with a complex non-stationary signal space as in the current Seismic Cl assification problem.
The Sources of Increased Accuracy for Two Proposed Boosting Algorithms
David B. Skalak (skalak@CS.Cornell.EDU), University of Massachusetts, Amherst
pp. 120-125 (PostScript)
We introduce two boosting algorithms that aim to increase the generalization accuracy of a given classifier by incorporating it as a level-0 component in a stacked generalizer. Both algorithms construct a complementary level-0 classifier that can only generate coarse hypotheses for the training data. We show that the two algorithms boost generalization accuracy on a representative collection of data sets. The two algorithms are distinguished in that one of them modifies the class targets of selected training instances in order to train the complementary classifier. We show that the two algorithms achieve approximately equal generalization accuracy, but that they create complementary classifiers that display different degrees of accuracy and diversity. Our study provides evidence that it may be useful to investigate families of boosting algorithms that incorporate varying levels of accuracy and diversity, so as to achieve an appropriate mix for a given task and domain.
Classifier Combining: Analytical Results and Implications
Kagan Tumer (kagan@pine.ece.utexas.edu) and Joydeep Ghosh (ghosh@pine.ece.utexas.edu), University of Texas, Austin
pp. 126-132 (PostScript)
Several researchers have experimentally shown that substantial improvements can be obtained in difficult pattern recognition problems by combining or integrating the outputs of multiple classifiers. This paper summarizes our recent theoretical results that quantify the improvements due to multiple classifier combining. Furthermore, we present an extension of this theory that leads to an estimate of the Bayes error rate. Practical aspects such as expressing the confidences in decisions and determining the best data partition/classifier selection are also discussed.
Weight Averaging for Neural Networks and Local Resampling Schemes
Joachim Utans (J.Utans@lbs.lon.ac.uk), London Business School
pp. 138-139 (PostScript)
Recently methods for combining estimators have been popular and enjoyed considerable success. Typically, one obtains several models by considering different model architectures (e.g. linear or nonlinear) or uses different partitions of the data (overlapping or non-overlapping) for estimating the model. In a final step the outputs of each model are combined, either by assigning fixed weights a priori (e.g. equal weighting) or be estimating these weights from data as well. Here we propose a strategy similar to Breiman's "bagging": we use a resampling scheme (nonlinear Cross-Validation in this case but the idea is applicable to other local resampling schemes (e.g. bootstrap)) to estimate the expected out-of-sample error as a model selection criterion in the context of neural network models. We want to make further use of the models estimated via resampling by combining them and propose that under certain conditions one can average model parameters directly instead of retaining all networks and combine their outputs. We demonstrate that for the examples considered the network performance is comparable to performance obtained by averaging outputs while offering improved recall speed and savings in storage requirements.
Combining Rules and Cases
Koen Vanhoof (vanhoof@rsftew.luc.ac.be) and Jerzy Surma, Limburg University Center
pp. 139-143 (PostScript)
The recent progress in case-based reasoning has shown that one of the most important challenges in developing future AI methods will be to combine and synergetically uitilize general and case-based knowledge. In this paper an integration of rule-based and case-based reasoning is described. A typicality evaluator dtermines the applied method. The proposed method is tested and compared with alternative approaches. Experimental results show that the presented integration method can lead to an improvement in accuracy and comprehensibility.

Philip Chan (pkc@cs.fit.edu)

Last modified: Wed Aug 14 22:54:35 EST 1996