MACHINE LEARNING PACKAGES
OVERVIEW
These packages consist primarily of classification algorithms
that have been coded in C++. All have been extensively
tested.
OC1 was written in C and is provided by Sreerama Murthy, Steven
Salzberg, and Simon Kasif.
Input File Format
The training and test files each consist of a set of records to be
classified, with one record per line. The predictor attributes
are given first, separated by whitespace, followed by the integer
category at the end of the line. Training file must be named <filestem>.data,
test file <filestem>.test, and names file <filestem>.names.
Sample *.names file:
2 categories
donor_score : continuous
acceptor_score : continuous
hexamer_score : continuous
phase : discrete
NOTE! Most of these packages require the category value to
be in the range 0 to N1, but OC1 requires it to be in the range 1 to N.
NET  Neural Networks
download
Implements an nlayer (n>2) feedforward neural network trained by
backpropagation. Default transfer function is the logistic:
1/(1+e^{s}). Default input combining function is
summation of weight * activation value. Both can be replaced
through simple modification of the source code.
Sample configuration file:
maxIterations=300
learningRate=0.10
numLayers=1
neuronsPerLayer=5
networkFilename=none
minadj=1
maxadj=1
randomize=1
noisefactor=0.99
Specific topologies can be specified conveniently in the source code.
ET 
Entropybased Decision Trees
download
Based on Ross Quinlan's C4.5 entropybased decision trees. Can
use information gain or gain ratio for selecting
tests. All tests are binary. Can use discrete and
continuous attributes, with discretization granularity specified on the
command line. Includes separate training program and treepruning
program.
BAYES  Bayesian
Classifiers
download
Standard naive Bayes classifier with discretization parameter
for continuous attributes specified on command line.
REGRESS
 Multivariate Linear Regression
download
Standard multivariate linear regression. Occasionally fails due
to noninvertible matrices.
KNN  K Nearest
Neighbors
download
Knearestneighbors with options for Mahalanobis distance (to
control for multicollinearity), feature selection via Fratio
(betweengroup MS / withingroup MS), and the K parameter. Used
for classification and discrete timeseries prediction.
GP  Genetic Algorithms
download
An implementation of John Koza's "genetic programming"
paradigm, a form of genetic algorithm in which a population of
computer programs or mathematical formulas is allowed to mutate,
crossover, migrate, clone, or die according to fitness over some fixed
number of generations. Fitnessproportionate and tournament
selection are provided, with variable tournament size. Population
size, number of generations, degree of mutation/exploitation, maximum
tree height, and the operator set are all specified in the
configuration file. Sample configuration files are included in
the Sample test problems
collection.


