- Adapted from ,
where a similar taxonomy was suggested in the general framework of
searching for structure in data.
 considered trees that can have internal nodes
with just one child. At these nodes, the data are not split, but
residuals are taken from a single variable regression.
- While converting decision tables to
trees, it is common to have leaf nodes that have a ``no decision''
precisely, a decision tree is said to perform classification if the
class labels are discrete values, and regression if the class
labels are continuous. We restrict almost entirely to classification
trees in this chapter.
- Oblivious decision
trees  from the machine learning literature are
nearly identical in structure to OBDDs.
 suggested a pruning method, based on
, for optimally pruning a balanced TSVQ. The
TSVQ growing procedure suggested by Riskin and Gray
 can be viewed as an inverse to Chou's pruning
- The desirable
properties of a measure of entropy include symmetry,
expandability, decisivity, additivity and recursivity. Shannon's
entropy  possesses all of these properties
. For an insightful treatment of entropy
reduction as a common theme underlying several pattern recognition
problems, see .
and Smyth  report that the idea of using
the mutual information between features and classes to select the
best feature was originally put forward by Lewis .
- Quinlan's C4.5  uses a naive
version of the confidence intervals for doing pessimistic pruning.
 stated and proved a conservation theorem that
states, essentially, that positive performance in some learning
situations must be offset by an equal degree of negative performance
in others. To clarify the, sometimes non-intuitive, consequences of
the conservation theorem, Schaffer  gave an example
of a concept for which information loss gives better
generalization accuracy than information gain.
- For a general discussion about the
relationship between complexity and predictive accuracy of
classifiers, see .
- Techniques that start with a sufficient
partitioning and then optimize the structure (e.g.,
) can be thought of as being a
converse to this approach.
- In bootstrapping, B independent learning samples,
each of size N are created by random sampling with replacement
from the original learning sample L. In cross validation, L is
divided randomly into B mutually exclusive, equal sized
partitions. Efron  showed that, although cross
validation closely approximates the true result, bootstrap has much
less variance, especially for small samples. However, there exist
arguments that cross validation is clearly preferable to bootstrap
in practice .
Campenhout  argues that increasing the amount
of information in a measurement subset through enlarging its size or
complexity never worsens the error probability of a truly Bayesian
classifier. Even after this guarantee, the cost and complexity due to
additional measurements may not be worth the slight (if any)
improvement in accuracy. Moreover, most real world classifiers are not
exception is the optimal feature subset selection method using
zero-one integer programming, suggested by Ichino and Sklansky
- Smoothing is the
process of adjusting probabilities at a node in the tree based on the
probabilities at other nodes on the same path. Averaging improves
probability estimates by considering multiple trees.
- A lot
of work exists in the neural networks literature on using committees
or ensembles of networks to improve classification performance. See
 for example. An alternative to multiple trees
is a hybrid classifier that uses several small classifiers as parts of
a larger classifier. Brodley  describes a system that
automatically selects the most suitable among a univariate decision
tree, a linear discriminant and an instance based classifier at each
node of a hierarchical, recursive classifier.
- Hierarchical unsupervised clustering can
construct, using bottom-up or top-down methods, tree-structured
classifiers. As mentioned in Section 2.1, these methods
are beyond the scope of the current chapter.
- Thanks to Kevin Van Horn for
pointing this out.
- It is argued empirically
 that the variance in decision tree methods
is more a reason than bias for their poor performance on some
- The Statlog project is initiated by the
European Commission, and its full title is ``The Comparative Testing
of Statistical and Logical Learning Algorithms on Large-Scale
Applications to Classification, Prediction and Control''.
- For a general
description of modern classification problems in astronomy, which
prompt the use of pattern recognition and machine learning
techniques, see .
Thu Oct 19 17:40:24 EDT 1995