MACHINE LEARNING PACKAGES
OVERVIEW
These packages consist primarily of classification algorithms
that have been coded in C++. All have been extensively
tested.
OC1 was written in C and is provided by Sreerama Murthy, Steven
Salzberg, and Simon Kasif.
Input File Format
The training and test files each consist of a set of records to be
classified, with one record per line. The predictor attributes
are given first, separated by whitespace, followed by the integer
category at the end of the line. Training file must be named <filestem>.data,
test file <filestem>.test, and names file <filestem>.names.
Sample *.names file:
2 categories
donor_score : continuous
acceptor_score : continuous
hexamer_score : continuous
phase : discrete
NOTE! Most of these packages require the category value to
be in the range 0 to N-1, but OC1 requires it to be in the range 1 to N.
NET - Neural Networks
download
Implements an n-layer (n>2) feedforward neural network trained by
backpropagation. Default transfer function is the logistic:
1/(1+e-s). Default input combining function is
summation of weight * activation value. Both can be replaced
through simple modification of the source code.
Sample configuration file:
maxIterations=300
learningRate=0.10
numLayers=1
neuronsPerLayer=5
networkFilename=none
min-adj=1
max-adj=1
randomize=1
noise-factor=0.99
Specific topologies can be specified conveniently in the source code.
ET -
Entropy-based Decision Trees
download
Based on Ross Quinlan's C4.5 entropy-based decision trees. Can
use information gain or gain ratio for selecting
tests. All tests are binary. Can use discrete and
continuous attributes, with discretization granularity specified on the
command line. Includes separate training program and tree-pruning
program.
BAYES - Bayesian
Classifiers
download
Standard naive Bayes classifier with discretization parameter
for continuous attributes specified on command line.
REGRESS
- Multivariate Linear Regression
download
Standard multivariate linear regression. Occasionally fails due
to non-invertible matrices.
KNN - K Nearest
Neighbors
download
K-nearest-neighbors with options for Mahalanobis distance (to
control for multicollinearity), feature selection via F-ratio
(between-group MS / within-group MS), and the K parameter. Used
for classification and discrete time-series prediction.
GP - Genetic Algorithms
download
An implementation of John Koza's "genetic programming"
paradigm, a form of genetic algorithm in which a population of
computer programs or mathematical formulas is allowed to mutate,
cross-over, migrate, clone, or die according to fitness over some fixed
number of generations. Fitness-proportionate and tournament
selection are provided, with variable tournament size. Population
size, number of generations, degree of mutation/exploitation, maximum
tree height, and the operator set are all specified in the
configuration file. Sample configuration files are included in
the Sample test problems
collection.
|
|
|