MACHINE LEARNING PACKAGES



OVERVIEW

These packages consist primarily of classification algorithms that have been coded in C++.  All have been extensively tested. 

OC1 was written in C and is provided by Sreerama Murthy, Steven Salzberg, and Simon Kasif.



Input File Format

The training and test files each consist of a set of records to be classified, with one record per line.  The predictor attributes are given first, separated by whitespace, followed by the integer category at the end of the line.  Training file must be named <filestem>.data, test file <filestem>.test, and names file <filestem>.names.

Sample *.names file:
2 categories
donor_score : continuous
acceptor_score : continuous
hexamer_score : continuous
phase : discrete
NOTE! Most of these packages require the category value to be in the range 0 to N-1, but OC1 requires it to be in the range 1 to N.

NET - Neural Networks

download

Implements an n-layer (n>2) feedforward neural network trained by backpropagation.  Default transfer function is the logistic: 1/(1+e-s).  Default input combining function is summation of weight * activation value.  Both can be replaced through simple modification of the source code.

Sample configuration file:
maxIterations=300
learningRate=0.10
numLayers=1
neuronsPerLayer=5
networkFilename=none
min-adj=1
max-adj=1
randomize=1
noise-factor=0.99
Specific topologies can be specified conveniently in the source code.

ET - Entropy-based Decision Trees

download

Based on Ross Quinlan's C4.5 entropy-based decision trees.  Can use information gain or gain ratio for selecting tests.  All tests are binary.  Can use discrete and continuous attributes, with discretization granularity specified on the command line.  Includes separate training program and tree-pruning program.

BAYES - Bayesian Classifiers

download

Standard naive Bayes classifier with discretization parameter for continuous attributes specified on command line.

REGRESS - Multivariate Linear Regression

download

Standard multivariate linear regression.  Occasionally fails due to non-invertible matrices.

KNN - K Nearest Neighbors

download

K-nearest-neighbors with options for Mahalanobis distance (to control for multicollinearity), feature selection via F-ratio (between-group MS / within-group MS), and the K parameter.  Used for classification and discrete time-series prediction.

GP - Genetic Algorithms

download

An implementation of John Koza's "genetic programming" paradigm, a form of genetic algorithm in which a population of computer programs or mathematical formulas is allowed to mutate, cross-over, migrate, clone, or die according to fitness over some fixed number of generations.  Fitness-proportionate and tournament selection are provided, with variable tournament size.  Population size, number of generations, degree of mutation/exploitation, maximum tree height, and the operator set are all specified in the configuration file.  Sample configuration files are included in the Sample test problems collection.