Distinguishing Exons from Non-Exons With Various Machine Learning Techniques

b.majoros
1/20/2004


FEATURES:

  1. TigrScan's scoring of the first signal (donor/start-codon)
  2. TigrScan's scoring of the second signal (acceptor/stop-codon)
  3. Exon length probability (from empirical training distribution)
  4. Hexamer score = sum log[ P(H|coding)/P(H) ] over all hexamers H in the exon
* ORFs were randomly sampled from DNA containing both coding and noncoding segments -- overlap with true exons was not prevented and certainly occurred

Numbers of true and false exons were roughly equal.

METHODS:


RESULTS:            

			                    Accuracy (%)
------------------------------------------------------------------
OC1 OC1-a ET ET-r NET REG KNN BAYES GA
--- ----- -- ---- --- --- --- ----- --
TOXOPLASMA 92 90 91 89 91 82 91 92 85
ASPERGILLUS#1 88 89 89 90 89 68 88 91 76
ASPERGILLUS#2 90 90 89 87 86 63 90 91 72
ARABIDOPSIS 89 91 91 90 86 84 93 93 84
HUMAN 86 92 91 91 80 83 93 92 81
MOUSE 90 93 94 92 83 85 94 94 83
PLASMODIUM 86 89 89 90 76 81 79 85 78
-- -- -- -- -- -- -- -- --
mean: 89 91 91 90 84 89 90 91 80


COMPARISON OF HEXAMER SCORE VS. MARKOV CHAIN SCORE AS A PREDICTOR (USING NAIVE BAYES MODEL)

-numbers differ slightly from above because training/test data were regenerated &
randomly sampled anew
-the other three features were still utilized (splice-site/start/stop-codon scores, length probability)

		HEXAMER		MARKOV
------- ------
TOXOPLASMA 92 93
ASPERGILLUS#1 92 92
ASPERGILLUS#2 91 91
ARABIDOPSIS 91 91
HUMAN 93 92
MOUSE 95 93
PLASMODIUM 95 93


MULTIVARIATE REGRESSION RESULTS WITH MISSING FEATURES:

		     Missing Feature
------------------------------------
SPECIES LENGTH ATG/AG TAG/GT HEX NONE
------- ------ ------ ------ --- ----
TOXOPLASMA 75 82 81 77 82
ASPERGILLUS#1 61 67 71 67 68
ASPERGILLUS#2 64 63 67 58 63
ARABIDOPSIS 81 87 85 72 84
HUMAN 83 82 83 77 83
MOUSE 80 83 85 80 85
PLASMODIUM 68 74 67 77 80



OC1 RESULTS WITH MISSING FEATURES:

		     Missing Feature
------------------------------------
SPECIES LENGTH ATG/AG TAG/GT HEX NONE FLAGS
------- ------ ------ ------ --- ---- -----
TOXOPLASMA 91 87 89 88 92
ASPERGILLUS#1 88 84 85 89 89
ASPERGILLUS#2 90 87 84 90 90 -a
ARABIDOPSIS 92 88 89 85 91 -a
HUMAN 93 91 90 91 92 -a
MOUSE 93 91 92 92 93 -a
PLASMODIUM 77 83 87 86 89 -a


RESULTS ON A SET OF NON-GENOMIC TASKS:


Problems taken from UCI machine learning repository.

		RANDOM	MAJRTY	ET	OC1	NET	OC1-a	OC1best
------ ------ -- --- -- ----- -------
yeast 10 35 55 54 45 58 58
monk1 50 50 93 91 88 88 91
monk2 50 67 87 67 67 67 67
monk3 50 47 97 93 96 93 93
hypo 20 92 97 99 94 99 99
hepatitis 50 81 84 80 81 80 80
wine 33 36 89 90 99 63 90
watertreat 8 48 68 68 73 68 68
vote 50 61 97 96 97 96 96
trivial 50 58 78 97 99 77 97
tae 33 27 60 59 61 29 59
soybean 5 13 85 82 91 80 82
scale 33 16 79 91 94 80 91
iris 33 31 99 100 99 96 100
ionosphere 50 82 91 89 96 92 92
ind-diabetes 50 64 76 74 74 75 75
liver 50 49 64 57 64 71 71
bands 50 51 64 68 67 71 71
breast-cancer 50 65 95 96 97 93 96
cmc 33 40 51 53 55 52 53
crx 50 55 84 80 87 83 83
dermatology 17 31 97 88 98 97 97
echocardiogram 50 82 82 82 82 82 82
ecoli 13 39 76 72 77 74 74
glass 14 31 71 42 64 69 69
haberman 50 68 73 68 69 68 68
mpg 33 45 81 84 86 80 84
mushroom 50 52 100 100 100 100 100
nl1 50 49 88 99 99 84 99
nl2 50 59 89 86 93 90 90
nl3 50 60 78 85 88 72 85
nl4 50 48 85 87 94 84 87
nl5 50 52 80 82 83 79 82
nl6 50 50 91 88 93 90 90
--------------------------------------------------
mean: 39% 51% 82% 81% 84% 79% 83%

RANDOM = random-guessing baseline
MAJORITY = majority-rule baseline
OC1best = best score from OC1 -- either axis-parallel or oblique