MACHINE LEARNING PACKAGES
These packages consist primarily of classification algorithms
that have been coded in C++. All have been extensively
OC1 was written in C and is provided by Sreerama Murthy, Steven
Salzberg, and Simon Kasif.
Input File Format
The training and test files each consist of a set of records to be
classified, with one record per line. The predictor attributes
are given first, separated by whitespace, followed by the integer
category at the end of the line. Training file must be named <filestem>.data,
test file <filestem>.test, and names file <filestem>.names.
Sample *.names file:
NOTE! Most of these packages require the category value to
be in the range 0 to N-1, but OC1 requires it to be in the range 1 to N.
donor_score : continuous
acceptor_score : continuous
hexamer_score : continuous
phase : discrete
NET - Neural Networks
Implements an n-layer (n>2) feedforward neural network trained by
backpropagation. Default transfer function is the logistic:
1/(1+e-s). Default input combining function is
summation of weight * activation value. Both can be replaced
through simple modification of the source code.
Sample configuration file:
Specific topologies can be specified conveniently in the source code.
Entropy-based Decision Trees
Based on Ross Quinlan's C4.5 entropy-based decision trees. Can
use information gain or gain ratio for selecting
tests. All tests are binary. Can use discrete and
continuous attributes, with discretization granularity specified on the
command line. Includes separate training program and tree-pruning
BAYES - Bayesian
Standard naive Bayes classifier with discretization parameter
for continuous attributes specified on command line.
- Multivariate Linear Regression
Standard multivariate linear regression. Occasionally fails due
to non-invertible matrices.
KNN - K Nearest
K-nearest-neighbors with options for Mahalanobis distance (to
control for multicollinearity), feature selection via F-ratio
(between-group MS / within-group MS), and the K parameter. Used
for classification and discrete time-series prediction.
GP - Genetic Algorithms
An implementation of John Koza's "genetic programming"
paradigm, a form of genetic algorithm in which a population of
computer programs or mathematical formulas is allowed to mutate,
cross-over, migrate, clone, or die according to fitness over some fixed
number of generations. Fitness-proportionate and tournament
selection are provided, with variable tournament size. Population
size, number of generations, degree of mutation/exploitation, maximum
tree height, and the operator set are all specified in the
configuration file. Sample configuration files are included in
the Sample test problems