These packages consist primarily of classification algorithms that have been coded in C++.  All have been extensively tested. 

OC1 was written in C and is provided by Sreerama Murthy, Steven Salzberg, and Simon Kasif.

Input File Format

The training and test files each consist of a set of records to be classified, with one record per line.  The predictor attributes are given first, separated by whitespace, followed by the integer category at the end of the line.  Training file must be named <filestem>.data, test file <filestem>.test, and names file <filestem>.names.

Sample *.names file:
2 categories
donor_score : continuous
acceptor_score : continuous
hexamer_score : continuous
phase : discrete
NOTE! Most of these packages require the category value to be in the range 0 to N-1, but OC1 requires it to be in the range 1 to N.

NET - Neural Networks


Implements an n-layer (n>2) feedforward neural network trained by backpropagation.  Default transfer function is the logistic: 1/(1+e-s).  Default input combining function is summation of weight * activation value.  Both can be replaced through simple modification of the source code.

Sample configuration file:
Specific topologies can be specified conveniently in the source code.

ET - Entropy-based Decision Trees


Based on Ross Quinlan's C4.5 entropy-based decision trees.  Can use information gain or gain ratio for selecting tests.  All tests are binary.  Can use discrete and continuous attributes, with discretization granularity specified on the command line.  Includes separate training program and tree-pruning program.

BAYES - Bayesian Classifiers


Standard naive Bayes classifier with discretization parameter for continuous attributes specified on command line.

REGRESS - Multivariate Linear Regression


Standard multivariate linear regression.  Occasionally fails due to non-invertible matrices.

KNN - K Nearest Neighbors


K-nearest-neighbors with options for Mahalanobis distance (to control for multicollinearity), feature selection via F-ratio (between-group MS / within-group MS), and the K parameter.  Used for classification and discrete time-series prediction.

GP - Genetic Algorithms


An implementation of John Koza's "genetic programming" paradigm, a form of genetic algorithm in which a population of computer programs or mathematical formulas is allowed to mutate, cross-over, migrate, clone, or die according to fitness over some fixed number of generations.  Fitness-proportionate and tournament selection are provided, with variable tournament size.  Population size, number of generations, degree of mutation/exploitation, maximum tree height, and the operator set are all specified in the configuration file.  Sample configuration files are included in the Sample test problems collection.