|
GeneMark has two basic forms of output, a text report and
Postscript graphics
. The options associated with generating this output are discussed on the page "
GeneMark Options
".
Interpretting the GeneMark Report
GeneMark can be instructed to generate reports of open-reading frames, regions of
interest, and estimated exon boundaries. ORF reports may include RBS site evaluations
and frame-shift indications.
The Report Header
Each report generated by GeneMark has a header describing the parameters and matrix
used in the analysis. This information is purely for recordkeeping purposes. Here's
a sample header:
GENEMARK PREDICTIONS
Sequence file: cya
Sequence length: 2100
GC Content: 51.65%
Window length: 96
Window step: 12
Threshold value: 0.500
---
Matrix: E. coli (NCBI/FR-1), Order - 4
Matrix author: JDM (Amiga-TransMatrix)
Matrix order: 4
The Open Reading Frames List
If you have selected to have GeneMark
indicate regions of interest
, areas between in-frame stop codons where a high coding potential occurred, you
will see a list such as the following:
List of Open reading frames
predicted as CDSs, shown with alternate starts
(regions from start to stop codon w/ coding function >0.50)
|
Left
|
Right
|
DNA
|
Coding
|
Avg
|
Start
|
RBS
|
RBS
|
RBS
|
|
end
|
end
|
Strand
|
Frame
|
Prob
|
Prob
|
Prob
|
Site
|
Seq
|
|
----
|
-----
|
----------
|
-----
|
----
|
----
|
----
|
----
|
------
|
3
|
308
|
direct
|
fr 3
|
0.82
|
....
|
0.00<
|
0
|
....
|
|
195
|
308
|
direct
|
fr 3
|
0.60
|
0.04
|
0.74
|
177
|
CCGCAG
|
348
|
668
|
complement
|
fr 2
|
0.90
|
0.96
|
0.98
|
680
|
CAGGAT
|
1368
|
2102
|
direct
|
fr 3
|
0.90
|
0.98
|
0.96
|
1359
|
TTGGAG
|
|
1371
|
2102
|
direct
|
fr 3
|
0.91
|
0.96
|
0.96
|
1359
|
TTGGAG
|
|
1386
|
2102
|
direct
|
fr 3
|
0.93
|
0.63
|
0.91
|
1367
|
AATGAT
|
|
1410
|
2102
|
direct
|
fr 3
|
0.96
|
0.90
|
0.76
|
1401
|
AACGAT
|
|
1509
|
2102
|
direct
|
fr 3
|
0.98
|
0.27
|
0.51
|
1490
|
AGGGTT
|
|
1578
|
2102
|
direct
|
fr 3
|
0.97
|
0.11
|
0.73
|
1567
|
ATGGCA
|
|
1620
|
2102
|
direct
|
fr 3
|
0.97
|
0.11
|
0.16
|
1601
|
GCGCTG
|
The 'Left end' and 'Right end'
columns denote the ends of the indicated open reading frame relative to the begining
of the sequence (5' end of the direct strand). 'DNA Strand'
indicates which strand the signal was found on, and 'Coding
Frame' indicates the reading frame relative to the beginning of the sequence
in which the signal was found. The 'Avg Prob' column
denotes the average coding potential over the indicated range. NOTE:
GeneMark does not indicate if an ORF extends past the sequence provided,
so ORF positions from 1-3 and from length-2 - length may indicate the ORF extends
past the ends of the sample sequence.
The 'Start Prob' column is an assessment of the likelihood
that the start of the open reading frame is the actual start. This value is equal
to the coding potential 1/2 window into the ORF multiplied by 1 minus the coding
potential 1/2 window before the ORF. If no value is given, then them start appears
too close to the end of the sequence in order to calculate the value.
If you
pecified an RBS pattern file
to be used, RBS site evaluation is also performed. The 'RBS Prob' value is a score
indicating how-well the RBS pattern was matched upstream of the putative start site.
The position of the best match for the indicated start and the sequence are indicated
in the next two columns. If the start site is adjacent to the end of the sequence,
it is not possible to evaluate the RBS site and null data are given (see the first
ORF in the table above).
The Regions of Interest List
If you have selected to have GeneMark
indicate regions of interest
, areas between in-frame stop codons where a high coding potential occurred, you
will see a list such as the following:
List of Regions of interest
(regions from stop to stop codon w/ a signal in between)
|
LEnd
|
REnd
|
Strand
|
Frame
|
|
--------
|
--------
|
-----------
|
-----
|
|
3
|
308
|
direct
|
fr 3
|
|
348
|
686
|
complement
|
fr 2
|
|
1092
|
1334
|
direct
|
fr 3
|
|
1365
|
2102
|
direct
|
fr 3
|
The 'LEnd' column indicates the left end of the region
(5' end on the direct) and 'REnd', the right end
of the region. The 'Strand' column indicates whether
the region is indicated on the direct or reverse complement strand. The 'Frame'
column indicates the reading frame on the indicated strand in which the signal occured.
Possible Frameshift Detection
When a GeneMark report is generated, it may contain a section similar to this:
POSSIBLE SEQUENCE FRAMESHIFTS
DETECTED
|
From
|
To
|
|
|
Frame
|
Frame
|
At base...
|
|
-----
|
-----
|
----------
|
|
2
|
1
|
31152 +/- 11 bp (complement)
|
|
2
|
1
|
63372 +/- 11 bp (direct)
|
|
3
|
2
|
75528 +/- 11 bp (complement)
|
Such a notice indicates a sudden shift in coding potential from one reading frame
to another. This situation may occur when there is an insertion or deletion in the
middle of a coding region. The table indicates the frame the signal started in,
the frame the signal constinues in, and the approximate location of the error (the
precision of which is determined by the
step size parameter
used).
The Approximate Exon Location List
The current version of GeneMark uses a "coding potential only" exon designation
that can indicate approximate exon boundaries and suggest exon locations (a modified
version of GeneMark with more accurate exon prediction will be available in the
near future). Exons are denoted by two pairs of putative acceptor/donor sites and
the mean coding potential between those sites:
List of Protein-Coding
Exons
(regions between acceptor and donor site w/ coding function >0.50)
|
Left
|
Right
|
|
|
|
|
End
|
End
|
Strand
|
Frame
|
Prob
|
|
-------
|
-------
|
-----------
|
-----
|
------
|
|
50
|
300
|
direct
|
fr 3
|
0.8566
|
|
63
|
247
|
|
|
0.9998
|
|
|
|
|
|
|
|
365
|
666
|
complement
|
fr 2
|
0.9415
|
|
378
|
657
|
|
|
0.9780
|
|
|
|
|
|
|
|
1201
|
1277
|
direct
|
fr 3
|
0.8722
|
|
1225
|
1254
|
|
|
0.9986
|
|
|
|
|
|
|
|
1377
|
1377
|
direct
|
fr 3
|
0.9085
|
|
1434
|
2042
|
|
|
0.9780
|
|
|
|
|
|
|
In general, this approach is quite good at finding larger exons. However, searching
for smaller exons requires using smaller window sizes (decreasing the accuracy of
prediction, but allowing smaller exons to be detected) and good
matrix data
.
Viewing and Printing the
Postscript
Graphics
The Postscript graphics generated by GeneMark may be viewed using any Postscript
previewer. On the Solaris platform the applications imagetool
(/usr/openwin/bin/imagetool) or
pageview (/usr/openwin/bin/pageview) can
be used to view the graphic output. You can also use the lp
command to send the graphics to a Postscript printer (check with your system administrator
to make sure you have a printer that supports this feature).
If you are interested in viewing or printing the Postscript graphics on another
platform or do not have the imagetool or pageview application installed, we suggest
that you download Aladdin Ghostscript
from the Internet; it is available for a variety of computing platforms and allows
you to print the Postscript graphics on non-Postscript printers.
An example of the graphical output generated by GeneMark is given below with each
feature indicated in red. The coding potential function is plotted in 6 frames,
3 direct and 3 reverse complement. High coding potential represenets the high likelihood
of protein coding in that region.
previous: GeneMark Transition Matrices
next: GeneMark Resources
|