Abstract:
We investigate ensemble learning methods that construct a classifier ensemble by
repeatedly sampling the original training data and building a member classifier from
each subsample. We find that the performance of standard Bagging can frequently
be improved upon by simple variations of the sampling scheme, such as varying the
sample size, or sampling without replacement instead of sampling with replacement.
For all methods tested, the ensemble performance is greatly dependent on properties
of the problem domain and the data sample. We try to explain the observed
performances of the various ensemble methods qualitatively and quantitatively, and
find that current ensemble analysis methods such as margin distributions, Error diagrams, bias-variance decomposition etc. are not well suited for this task.
We postulate that the primary explanation for the performance of ensemble methods
is to be found in their effects on accuracy of the ensemble members on one side and
diversity among them on the other side, two contradictory goals which necessitate
a compromise, or trade-off. This motivates the presentation of a precise yet general
definition of what diversity is and how it is to be measured. This definition has the
desirable property that it is applicable to all single-stage voting ensembles, under
any given loss function.
We then study the mathematical relationships between ensemble loss, mean member
loss, and diversity. For squared loss, we show that our definitions lead to the
well known ensemble loss decomposition, and extend this decomposition to the case
where the ensemble members, instead of a real number, return a probability distribution
over R.
For the case of 0-1 loss, we derive the exact mathematical relations between ensemble
loss, mean member loss, and diversity. Studying those relations provides some
valuable insights into ensembles behavior, and produces some unexpected hence interesting
results. These results are also confirmed by the experimental observations.
Turning our attention back to the performance of Bagging variants, we show how the
loss decomposition can be used to reduce the number of parameter settings which
have to be tried out experimentally in order to find a well-performing ensemble
method for a given particular problem.