Arabidopsis Thaliana :
Arabidopsis thaliana is the first eukaryotic organism where EuGène has been extensively used.

The following tables compares the results obtained with EuGene and several gene prediction programs (described in Table 1).
Program Reference Version
Grail Xu and Uberbacher, 1997 1.3, data from [*]
Fex Solovyev et al., 1994 data from [*]
MZEF Zhang, 1998 prior p=0.04, data from [*]
GenScan Burge and Karlin, 1997 data from [*]
GlimmerA Salzberg et al., 1999 1.0
GMhmm1 Lukashin and Borodovsky, 1998 data from [*]
GMhmm Lukashin and Borodovsky, 1998 2.2a
FgenesP Solovyev, unpublished, 1997 data from [*]
FgenesH Salamov and Solovyev, unpublished, 1999 1.0
EuGène50 - Rel. 1.2, 50% sensitivity
EuGène80 - Rel. 1.2, 80% sensitivity
Table 1 : A list of the gene prediction programs compared along with the corresponding bibliographic reference and version.


[*] : N. Pavy, S. Rombauts, P. Déhais, C. Mathé, D.V.V. Ramana, P. Leroy, and P. Rouzé. Evaluation of gene prediction software using a genomic data set : application to arabidopsis thaliana sequences. Bioinformatics, 1999, 15(11):887-99.
Base level Exon level
Program Sn Sp CC Pred Corr Olap Wrong Miss Split Merged Sn Sp
Grail - - - 1184 449 506 229 80 12 16 44% 38%
Fex - - - 1745 562 484 699 155 180 23 55% 32%
MZEF - - - 846 459 236 151 358 32 14 45% 54%
GenScan - - - 938 652 204 82 175 10 16 63% 70%
FgenesP - - - 737 433 195 109 403 7 8 42% 59%
GMhmm1 - - - 1104 845 172 87 26 10 4 82% 77%
GMhmm 0.97 0.93 0.94 1093 854 157 85 28 6 6 83% 78%
GlimmerA 0.87 0.89 0.84 1034 697 186 164 149 5 26 67% 67%
FgenesH 0.98 0.93 0.94 1070 900 105 72 23 1 14 88% 84%
FgenesHGC 0.98 0.93 0.94 1021 902 100 78 25 0 12 88% 88%
EuGène50 0.95 0.95 0.93 974 849 94 38 86 2 14 83% 87%
EuGène80 0.95 0.95 0.93 991 862 89 47 71 1 14 84% 87%
Program Pred Corr Miss Part Wrong Split Merged Sn Sp
GenScan 150 28 1 139 13 1 60 17% 19%
FgenesP 92 10 47 111 3 0 60 6% 11%
GMhmm1 208 67 1 100 27 18 12 40% 32%
GMhmm 187 69 1 98 24 2 12 41% 37%
GlimmerA 265 50 2 116 58 35 0 30% 19%
FgenesH 176 94 1 73 14 0 10 56% 53%
FgenesHGC 175 96 1 71 14 0 12 57% 55%
EuGène50 178 101 5 62 15 1 2 60% 57%
EuGène80 199 112 2 54 22 11 0 67% 56%
Table 2 : Results of the gene prediction software evaluation on Araset at the base and exon level. Each line corresponds to a gene prediction program, as described in Table 1. At the base level, on the left, sensitivity (Sn), specificity (Sp) as well correlation coefficient (CC) are reported. At the exon level, each column successively gives the number of predicted (pred), correct (corr), overlapping with an annotated exon in the same coding frame (olap), overpredicted (wrong), missing (miss) exons. The two next columns give the number of annotated exons which are predicted as two split exons (split) and the number of predicted exons which actually merge annotated exons into one exon. The two last columns report sensitivity (Sn) and specificity (Sp) at the exon level. Table 3 : Results of the gene prediction software evaluation on Araset at the whole gene level. Each line corresponds to a gene prediction program, as described in Table 1. The number of predicted (pred), completely correct (corr), completely missing (miss), partially predicted (part) and overpredicted (wrong) genes are first presented. The two next columns give the number of annotated genes that are actually split in the predictions and the number of predicted genes which actually merge annotated genes into one gene. The two last columns report sensitivity (Sn) and specificity (Sp) at the gene level.
Base level Exon level
Version Sn Sp CC Pred Corr Olap Wrong Miss Split Merged Sn Sp
EuGène50 0.948 0.952 0.934 974 849 94 38 86 2 14 82.67% 87.17%
EuGène50EST 0.961 0.952 0.942 1014 895 79 46 53 0 12 87.15% 88.26%
EuGène50Prot 0.981 0.955 0.957 1035 926 75 40 27 1 12 90.17% 89.47%
EuGène50Full 0.980 0.953 0.955 1047 936 65 50 26 0 8 91.14% 89.40%
Table 4 : Results of EuGene's evaluation on Araset at the base and exon level when similarity information is used. Each line corresponds to a specific variant of EuGene that is either given no similarity information (EuGene50), similarity information with EST and cDNA from dbEST and PlantGene databases (EuGene50EST), similarity information with SwissProt rel. 40 from which all Arabidopsis thaliana sequences have been removed (EuGene50Prot), and both EST and protein similarities (EuGene50Full). At the base level, on the left, sensitivity (Sn), specificity (Sp) as well correlation coefficient (CC) are reported. At the exon level, each column successively gives the number of predicted (pred), correct (corr), overlapping with an annotated exon in the same coding frame (olap), overpredicted (wrong), missing (miss) exons. The two next columns give the number of annotated exons which are predicted as two split exons (split) and the number of predicted exons which actually merge annotated exons into one exon. The two last columns report sensitivity (Sn) and specificity (Sp) at the exon level.
Version Pred Corr Miss Part Wrong Split Merged Sn Sp
EuGène50 178 101 5 62 15 1 2 60.12% 56.74%
EuGène50EST 185 115 5 48 19 4 2 68.45% 62.16%
EuGèneProt 185 122 1 45 17 2 2 72.62% 65.95%
EuGèneFull 187 127 2 39 21 1 2 75.60% 67.91%
Run 1 2 3 4 5 6 7 8 9
f 89.0 89.0 89.1 89.4 89.7 89.7 90.1 90.1 90.5
Sne 96 96.2 96.1 96.5 96.6 96.6 96.6 96.7 96.9
Spe 96.8 96.7 96.1 96.6 96.6 96.6 97 96.6 96.7
Sng 83.3 84 84 84 84.7 84.7 85.4 85.4 86.1
Spg 81.6 79.1 80.7 81.8 81.3 81.3 82 82 81.6
Table 5 : Results of EuGene's evaluation on Araset at the base and exon level when similarity information is used. Each line corresponds to a specific variant of EuGene that is either given no similarity information (EuGene50), similarity information with EST and cDNA from dbEST and PlantGene databases (EuGene50EST), similarity information with SwissProt rel. 40 from which all Arabidopsis thaliana sequences have been removed (EuGene50Prot), and both EST and protein similarities (EuGene50Full). The number of predicted (pred), completely correct (corr), completely missing (miss), partially predicted (part) and overpredicted (wrong) genes are first presented. The two next columns give the number of annotated genes that are actually split in the predictions and the number of predicted genes which actually merge annotated genes into one gene. The two last columns report sensitivity (Sn) and specificity (Sp) at the gene level. Table 6 : Evaluation of the robustness of the parameter estimation algorithm used to estimate EuGene's parameters. Nine independent sorted runs of the (alpha,beta) parameters estimation have been performed. The criteria optimized is a combination of exon and gene level sensitivity and sensitivity such that 100% represents perfect prediction. Each line successively reports the value of the criteria obtained (f), the corresponding exon level sensitivity (Sne) and specificity (Spe), and the corresponding gene level sensitivity (Sng) and specificity (Spg).