Working Notes of IMLM-96
(in first-author order)
- On Explaining Degree of Error Reduction due to Combining Multiple
Decision Trees
Kamal Mahmood Ali
(ali@almaden.ibm.com),
IBM Almaden Research Center
pp. 1-7 (PostScript)
In multiple models research, classifications or evidence is combined
from several classification models in order to form overall
classifications with higher accuracy than any of the individual
models. However, most previous work has concentrated on demonstrating
that use of multiple models leads to statistically significant error
reduction as compared to the error rate of the single model learned on
the same data. In this paper we present a model of the degree
of error reduction due to the use of multiple decision trees. In
particular, we show that the degree of error reduction can be linearly
modeled by the ``degree to which decision trees make uncorrelated
errors.'' These conclusions are based on results from 21 domains.
- Creating and Exploiting Coverage and Diversity
Carla E. Brodley
(brodley@ecn.purdue.edu) and
Terran Lane
(terran@ecn.purdue.edu),
Purdue University
pp. 8-14 (PostScript)
In this paper, we illustrate that increasing coverage through
diversity is not enough to ensure increased prediction accuracy---if the
integration method does not utilize the coverage, then no benefit arises from
integrating multiple models. We compare four criteria for selecting
base-level classifiers and demonstrate that informed selection can lead to
more accurate meta-level classifiers than random selection. In addition, we
illustrate empirically that straightforward integration methods fail to
utilize the diversity of the base-level classifiers.
- Human Expert-Level Performance on a Scientific Image Analysis Task
by a System Using Combined Artificial Neural Networks
Kevin J. Cherkauer
(cherkaue@cs.wisc.edu),
University of Wisconsin-Madison
pp. 15-21 (PostScript)
This paper presents the Plannett system, which combines artificial
neural networks to achieve expert-level accuracy on the difficult
scientific task of recognizing volcanos in radar images of the surface
of the planet Venus. Plannett uses ANNs that vary along two dimensions:
the set of input features used to train and the number of hidden units.
The ANNs are combined simply by averaging their output activations. When
Plannett is used as the classification module of a three-stage image
analysis system called JARtool, the end-to-end accuracy (sensitivity and
specificity) is as good as that of a human planetary geologist on a
four-image test suite. JARtool-Plannett also achieves the best
algorithmic accuracy on these images to date.
- Learning from Multiple Models to Integrate Sensed Data
Bobi Den Hartog
(denhartog@lanl.gov),
John W. Elling
(elling@lanl.gov),
Susan M. Mniszewski
(smm@lanl.gov),
Los Alamos National Laboratory;
Roger M. Kieckhafer
(rogerk@raptor.unl.edu),
University of Nebraska at Lincoln
pp. 22-28 (PostScript)
The problem of identifying and quantifying mixtures of
polychlorinated biphenyls (PCBs), called Aroclors, present
in a gas chromatogram defies current single-model
classification techniques. Specifically, different pattern
recognition techniques are known to be ineffective in different
situations. The problem is that an Aroclor "signal" in a
chromatogram can be masked by "interferences", such as
signals from other Aroclors or by unknown contaminants in
the same sample. The unlimited variety and combination of
possible interferences does not allow comprehensive training.
Instead, we use a world model, called a scenario, which
partitions the sources of interferences into a few broad
categories. We then apply multiple pattern recognition
techniques (sensors or models) to the chromatogram. A two-
phase fuzzy logic "combiner" then analyzes the results 1)
using a scenario-identifying rule set, we examine the pattern
of disagreement among techniques and estimate the scenario
2) using a model-weighting rule set, we can combine the
results of the most accurate techniques based on the current
scenario. If necessary, the fuzzy analysis can be iterated to
achieve the final result. This paper introduces the Aroclor
classification problem, discusses its resistance to standard
classification methods, and then describes the structure of the
fuzzy logic combiner module.
- Using Partitioning to Speed Up Specific-to-General Rule Induction
Pedro Domingos
(pedrod@ics.uci.edu),
University of California, Irvine
pp. 29-34 (PostScript)
RISE (Domingos, IJCAI-95 & Machine Learning Journal, in press) is a
rule induction algorithm that proceeds by gradually generalizing
rules, starting with one rule per example. This has several advantages
compared to the more common strategy of gradually specializing
initially null rules, and has been shown to lead to significant
accuracy gains over algorithms like C4.5RULES and CN2 in a large
number of application domains. However, RISE's running time (like that
of other rule induction algorithms) is quadratic in the number of
examples, making it unsuitable for processing very large databases.
This paper studies the use of partitioning to speed up RISE, and
compares it with the well-known method of windowing. The use of
partitioning in a specific-to-general induction setting creates
synergies that would not be possible with a general-to-specific
system. Partitioning often reduces running time and improves accuracy
at the same time. In noisy conditions, the performance of windowing
deteriorates rapidly, while that of partitioning remains stable.
- Nonlinear Combination of Neural Networks
From a Diverse and Evolving Population
Thomas M. English
(english@cs.ttu.edu),
Texas Tech University
pp. 35-39 (PostScript)
The objective is to predict on the basis of recent observations
the behavior of a nonlinear dynamical system. An evolutionary algorithm
searches a large space of model topologies and precision hyperparameters,
fitting limited-precision models to older observations and assessing how
well the models generalize to more recent observations. In each generation,
topology-precision pairs are ranked according to an inductive criterion
favoring validation errors that are not only small but low in redundancy
with those of higher-ranked pairs. The fitted models associated with the
highest ranks of a generation are combined in a nonlinear fashion. Their
predictions are treated collectively as the state vector of the modeled
system, and memory of errors in similar states of the past is used to
correct and combine predictions in the present. Model combinations of
different generations are themselves similarly combined. Empirical results
are given for populations of recurrent neural networks modeling chaotic
systems.
- A Comparative Evaluation of Combiner and Stacked Generalization
David W. Fan
(wfan@cs.columbia.edu),
Columbia University;
Philip K. Chan
(pkc@cs.fit.edu),
Florida Institute of Technology;
Salvatore J. Stolfo
(sal@cs.columbia.edu),
Columbia University
pp. 40-46 (PostScript)
Combiner and Stacked Generalization are two very similar meta-learning
methods that combine predictions of multiple classifiers to improve accuracy
of any single classifier. In this paper, we compare stacked generalization
and combiner from the perspective of training efficiency versus accuracy. We
show that both methods improve the accuracy of any single classifier roughly
at an equivalent level. Moreover, we also see that the cost of stacked
generalization is very large and may prevent it from being used on very large
data sets.
- Generalization Capability of Homogeneous Voting Classifier
Based on Partially Replicated Data
Jacek Jelonek
(jacek.jelonek@cs.put.poznan.pl),
Poznan University of Technology
pp. 47-52 (PostScript)
The generalization error is one of the most important features taken
into account in performance evaluation and verification of any
classifier. We propose a voting system based on homogenous base
classifiers (HVC) which ensures a better generalization capability
than any of its components. The principle of this idea consists in
the differentiation of a learning data set for each base classifier
and does not relate to their architecture, learning algorithm or any
parameters. Therefore, this concept could be applied to decision
trees, neural networks, statistical classifiers and many others. In
order to show the superiority of HVC we performed several comparative
tests on some data sets using cross-validation.
- Integrating Multiple Classifiers By Finding Their Areas of Expertise
Moshe Koppel
(koppel@bimacs.cs.biu.ac.il) and
Sean P. Engelson
(engelson@bimacs.cs.biu.ac.il),
Bar-Ilan University
pp. 53-58 (PostScript)
Using multiple learned classifiers for increasing learning accuracy
has attracted much recent interest. A central problem is how to
integrate several classifiers to produce a single final
classification. We have explored an approach which relies on
analyzing the particular area of expertise of each classifier.
In this paper, we describe a method which uses inductive techniques on
a set of training examples to determine the areas of a particular
classifier's expertise. By applying this method to each of a set of
conflicting classifiers, we can determine the most reliable classifier
for each new example and use that classifier to classify the example.
In this way, we achieve significantly better classification accuracy
than even the best individual expert or an induced combination rule.
- ANCHOR - A Connectionist Architecture for Hierarchical Nesting
of Multiple Heteogeneous Neural Nets
Rajeev Kumar
(r.kumar@sheffield.ac.uk) and
Peter I. Rockett
(p.rockett@sheffield.ac.uk),
University of Sheffield
pp. 59-65 (PostScript)
We present a novel connectionist architecture for handling arbitrarily
complex eural computations. Artificial Neural Network Compiler for
Hierarchical ORganisation (ANCHOR) takes a generalised view of a
neuron processing element which we have termed as "Superneuron"; and
which accepts n-inputs and outputs m-features where m=1 for an element
in a conventional network. A superneuron is a single or higher-order
multi-layer perceptron which can be trained individually with support
of multiple-learning algorithms from the system. The
indistinguishability between a superneuron and a neuron is employed in
hierarchical nesting of superneurons up to (theoretically) infinite
depth within other superneurons as well as linear or tree-structured
cascading. ANCHOR facilitates partitioning of input space into a
number of small simpler sub-mappings and has been tested on learning
simple boolean functions with hierarchical calls and cascading of
simpler superneurons. The issues related to partitioning and
self-partitioning for a statistical hierarchical classifier are
discussed.
- Applying Parallel Learning Models of Artificial Neural
Networks To Letters Recognition From Phonemes
Byoung Jik Lee
(bjlee@cs.uiowa.edu),
University of Iowa
pp. 66-71 (PostScript)
A Parallel Artificial Neural Networks approach
which converts strings of phonemes to strings of letters
is described. Our approach employs several neural networks,
each of which handles a different set of the input data
with different weights. This approach is characterized by
two important properties. First, this approach provides
a way to handle the local minima problem without complex
computations by generating multiple neural networks working
in parallel. Neural networks are trained on examples
iteratively which the previous neural networks could not
produce proper results. Second, the test data is distributed
across the learning networks according to the ability of
each network to handle its assigned task. From among multiple
outputs generated by trained neural networks, selection
strategies are applied. Performance evaluated over 5,528
.characters taken from the 934 most-used English words shows
that our approach achieves more than 10% improvement over
the general multilayered neural network (89.1% for our method,
compared with 78.6% for the general network). In the training
set, our approach achieves a correctness rate of 95.4%, while
the general multilayered network has a correctness rate of 92.7%.
- Combining Neural and Statistical Classifiers Via Perceptron
Sauchi Stephen Lee
(steve@laplace.mathstat.uidaho.edu),
University of Idaho
pp. 72-76 (PostScript)
Many classification models are presently available from
both the Neural Network s and Statistics communities. Recent research
has revealed that, instead of selec ting the ``best'' single model
from a set of plausible candidates, integrating a set
of ``heterogeneous'' models will maximize overall accuracy.
Disagreements between these candidate models enable one to pick and
choose thei r relative strengths in an optimal manner. Many
statistical models, due to their intrinsic different model structures,
could complement neural networks. In classification problems,
encouraging results were obtained by integrating some statistical and
neural classifiers via perceptron.
- Bayesian Model Averaging
David Madigan (madigan@stat.washington.edu),
Adrian E. Raftery (raftery@stat.washington.edu),
Chris T. Volinsky (volinsky@stat.washington.edu), University of Washington;
Jennifer A. Hoeting (jah@stat.colostate.edu),
Colorado State University
pp. 77-83 (PostScript)
Standard statistical practice ignores model uncertainty. Data analysts
typically select a model from some class of models and then proceed as
if the model generated the data. This approach ignores a component of
uncertainty, leading to over-confident inferences. Bayesian model
averaging (BMA) provides a coherent mechanism for accounting for this
model uncertainty. Several methods for implementing BMA have recently
emerged. We discuss these methods and provide pointers to a number of
applications. In these applications, BMA provides improved
out-of-sample predictive performance. We provide a catalogue of
currently available BMA software.
- On Voting Ensembles of Classifiers (extended abstract)
Ofer Matan
(ofer@cs.stanford.edu),
Stanford University
pp. 84-88 (PostScript)
We study the classification ability of majority-vote ensembles of
classifiers. A majority ensemble classifies a pattern by letting each
member of the ensemble cast a single vote for the correct class and
decides according to a simple majority or a special majority vote. We give
upper and lower bounds on the classification performance of a majority
ensemble as a function of the classification performances of its individual.
- Handling Redundancy in Ensembles of Learned Models Using
Principal Components
Christopher J. Merz
(cmerz@ics.uci.edu) and
Michael J. Pazzani
(pazzani@ics.uci.edu),
University of California, Irvine
pp. 89-92 (PostScript)
When combining a set of learned models to form an improved
estimator, the issue of redundancy in the set of models must be
addressed. Existing methods for addressing this problem have failed
to perform robustly, especially as the redundancy in the set of
learned models increases. Recently, a variant of principal components
regression, PCR*, demonstrated that these limitations could be
overcome by mapping the original learned models to a set of principal
components and then choosing which components to include in the final
regression. Weights for the original learned models could then be
derived from the weights of the principal components regression. The
focus of this paper is to compare PCR*'s cross-validation-based
stopping criteria for choosing the number of principal components to
existing methods. Experimental results show that existing stopping
criteria are often too conservative and discard useful components
leading to poor performance.
- Selection of Learning Algorithms for Trading Systems
Based on Biased Estimators - An Abstract
Zoran Obradovic
(zoran@eecs.wsu.edu) and
Tim Chenoweth
(tchenowe@eecs.wsu.edu),
Washington State University
pp. 93-94 (PostScript)
This work extends our S&P 500 index trading system based on an appropriate
domain partitioning and integration of biased estimators designed using
disjoint data sets. A new heterogeneous learning system is proposed with the
objective of further reduction of correlation between local estimators.
Experimental results conducted on real financial data over a 10 year
period indicate that a mixed system consisting of cascade correlation and
back-propagation trained neural network estimator components provides
the best integration returns. The actual return obtained by this system
is better that the return obtained by the buy and hold strategy over the
same period even when assuming a 0.05% transaction fee. However, it is
important to note that care must be taken when determining which learning
algorithm to apply to each subproblem.
- A Noise-Tolerant Hybrid Model of a Global and a Local Learning Module
Natsuki Oka
(oka@mrit.mei.co.jp)
and Kunio Yoshida
(yoshida@mrit.mei.co.jp),
Matsushita Research Institute Tokyo, Inc.
pp. 95-100 (PostScript)
Proposed is GLLL2, a hybrid architecture of a global and a local
learning module, which learns default and exceptional knowledge
respectively from noisy examples. The global learning module, which
is a feedforward neural network, captures global trends gradually,
while the local learning module stores local exceptions quickly. The
latter module distinguishes noise from exceptions, and learns only
exceptions, which ability makes GLLL2 noise-tolerant. The results of
experiments show the process in which training examples are formed
into default and exceptional knowledge, and demonstrate that the
predictive accuracy, the space efficiency, and the training efficiency
of GLLL2 is higher than those of each individual module.
- Exploiting Multiple Existing Models and Learning Algorithms
Julio Ortega
(julio@almaden.ibm.com),
IBM Almaden Research Center
pp. 101-106 (PostScript)
This paper presents MMM and MMC, two methods for combining knowledge
from a variety of prediction models. Some of these models may have
been created by hand while others may be the result of empirical
learning over an available set of data. The approach consists of
learning a set of ``Referees'', one for each prediction model, that
characterize the situations in which each of the models is able to
make correct predictions. In future instances, these referees are
first consulted to select the most appropriate prediction model, and
the prediction of the selected model is then returned.
Experimental results on the audiology domain show that
using referees can help obtain higher accuracies than
those obtained by any of the individual prediction models.
- Scaling Up: Distributed Machine Learning with Cooperation
Foster John Provost
(foster@nynexst.com),
NYNEX Science & Technology;
Daniel N. Hennessy
(hennessy@cs.pitt.edu),
University of Pittsburgh
pp. 107-112 (PostScript)
Machine-learning methods are becoming increasingly popular for
automated data analysis. However, standard methods do not scale up to
massive scientific and business data sets without expensive hardware.
This paper investigates a practical alternative for scaling up: the
use of distributed processing to take advantage of the often dormant
PCs and workstations available on local networks. Each workstation
runs a common rule-learning program on a subset of the data. We first
show that for commonly used rule-evaluation criteria, a simple form of
cooperation can guarantee that a rule will look good to the set of
cooperating learners if and only if it would look good to a single
learner operating with the entire data set. We then show how such a
system can further capitalize on different perspectives by sharing
learned knowledge for significant reduction in search effort. We
demonstrate the power of the method by learning from a massive data
set taken from the domain of cellular fraud detection. Finally, we
provide an overview of other methods for scaling up machine learning.
- On the Integration of Ensembles of Neural Networks:
Application to Seismic Signal Classification
Yair Shimshoni
(shimsh@math.tau.ac.il)
and Nathan Intrator, Tel-Aviv University
pp. 113-119 (PostScript)
This paper proposes a classification scheme based on the integration
of multipl e Ensembles of ANNs. It is demonstrated on a classification
problem, in which S eismic recordings of Natural Earthquakes must be
distinguished from the recordi ngs of Artificial Explosions. A
Redundant Classification Environment consists of several Ensembles of
Neural Networks is created and trained on Bootstrap Sample Sets, using
various data re presentations and architectures. ANNs within the
Ensembles are aggregated (as in Bagging) while the Ensembles ar e
integrated non-linearly, in a signal adaptive manner, using a
posterior confi dence measure based on the agreement (variance) within
the Ensembles. The proposed Integrated Classification Machine
achieved 92.1% correct classific ation on the seismic test data.
Cross Validation evaluations and comparisons indicate that such
integration of a collection of ANN's Ensembles is a robust way for
handling high dimensional p roblems with a complex non-stationary
signal space as in the current Seismic Cl assification problem.
- The Sources of Increased Accuracy for Two Proposed Boosting Algorithms
David B. Skalak
(skalak@CS.Cornell.EDU),
University of Massachusetts, Amherst
pp. 120-125 (PostScript)
We introduce two boosting algorithms that aim to increase the
generalization accuracy of a given classifier by incorporating it as a
level-0 component in a stacked generalizer. Both algorithms construct
a complementary level-0 classifier that can only generate
coarse hypotheses for the training data. We show that the two
algorithms boost generalization accuracy on a representative
collection of data sets. The two algorithms are distinguished in that
one of them modifies the class targets of selected training instances
in order to train the complementary classifier. We show that the two
algorithms achieve approximately equal generalization accuracy, but
that they create complementary classifiers that display different
degrees of accuracy and diversity. Our study provides evidence that
it may be useful to investigate families of boosting algorithms that
incorporate varying levels of accuracy and diversity, so as to
achieve an appropriate mix for a given task and domain.
- Classifier Combining: Analytical Results and Implications
Kagan Tumer
(kagan@pine.ece.utexas.edu)
and
Joydeep Ghosh
(ghosh@pine.ece.utexas.edu),
University of Texas, Austin
pp. 126-132 (PostScript)
Several researchers have experimentally shown that
substantial improvements can be obtained in difficult pattern
recognition problems by combining or integrating the outputs
of multiple classifiers.
This paper summarizes our recent theoretical results that quantify
the improvements due to multiple classifier combining.
Furthermore, we present an extension of this theory that leads
to an estimate of the Bayes error rate. Practical aspects such
as expressing the confidences in decisions and determining the
best data partition/classifier selection are also discussed.
- Weight Averaging for Neural Networks and Local Resampling Schemes
Joachim Utans
(J.Utans@lbs.lon.ac.uk),
London Business School
pp. 138-139 (PostScript)
Recently methods for combining estimators have been popular and
enjoyed considerable success. Typically, one obtains several models by
considering different model architectures (e.g. linear or nonlinear)
or uses different partitions of the data (overlapping or
non-overlapping) for estimating the model. In a final step the outputs
of each model are combined, either by assigning fixed weights a priori
(e.g. equal weighting) or be estimating these weights from data as
well. Here we propose a strategy similar to Breiman's "bagging": we
use a resampling scheme (nonlinear Cross-Validation in this case but
the idea is applicable to other local resampling schemes
(e.g. bootstrap)) to estimate the expected out-of-sample error as a
model selection criterion in the context of neural network models. We
want to make further use of the models estimated via resampling by
combining them and propose that under certain conditions one can
average model parameters directly instead of retaining all networks
and combine their outputs. We demonstrate that for the examples
considered the network performance is comparable to performance
obtained by averaging outputs while offering improved recall speed and
savings in storage requirements.
- Combining Rules and Cases
Koen Vanhoof
(vanhoof@rsftew.luc.ac.be)
and
Jerzy Surma,
Limburg University Center
pp. 139-143 (PostScript)
The recent progress in case-based reasoning has shown that one of the most
important challenges in developing future AI methods will be to combine and
synergetically uitilize general and case-based knowledge. In this paper an
integration of rule-based and case-based reasoning is described. A
typicality evaluator dtermines the applied method. The proposed method is
tested and compared with alternative approaches. Experimental results show
that the presented integration method can lead to an improvement in accuracy
and comprehensibility.
Philip Chan (pkc@cs.fit.edu)
Last modified: Wed Aug 14 22:54:35 EST 1996