|
The MakeMatrix program is used to create gene-finding matrices from sample protein-coding
and noncoding sequences for use with GeneMark and GeneMark.hmm analysis.
A CAUTIONARY NOTE: Please read these manual pages completely before generating your
own matrix files for use with GeneMark -- the program is very sensitive to the quality
of infomation used to prepare the matrices and failing to follow the guidelines
outlined here will yield very poor gene prediction results.
The general format for executing MakeMatrix is:
mkmat [-x] <coding file> <noncoding file> <order> <output file>
Options
-x Generate matrices using the newer matrix file format. Systems with older versions
of GeneMark will not be able to read this format, other systems may require it.
On some systems, mkmat will only generate this format of matrix.
Input Files
The mkmat program requires two input files: a coding file containing sample coding
sequences, and a noncoding file containing sample noncoding sequences. The sequences
must all come from the same species and should be as free of errors as reasonably
possible.
The format for the sequence files is quite simple and based on the popular FASTA
file format. A file may contain 1 or more sequence records, each representing a
different sequence fragment. Each sequence record should begin with a comment starting
with '>', followed by the sequence (numbers and white- space characters are ignored).
There should be one or more blank lines between each record. For example:
1 ATGCGATCGA ATGCGATCGA ATGCGATCGA
31 ATGCGATCGA ATGCGATCGA ATGCGATCGA
61 ATGCGATCGA ATGCGATCGA ATGCGATCGA
91 ATGCGATCGA ATGCGATCGA ATGCGATCGA
121 NNNNNN
Lines may be of any length, and symbols representing ambiguous nucleotide assignments
are allowed. Any numbers, punctuation, or whitespace characters are ignored.
The Coding File
The coding file must contain sample protein coding sequences from the subject organism.
Ideally, these sequences should be experimentally verified as protein coding or
cDNA sequences. In a pinch, you may be able to extract large open reading frames
from long high-fidelity contiguous sequences, but this is generally not advised.
The following considerations should be taken into account when generating your coding
sample file:
o Avoid including "putative" coding regions in your set of sample coding sequences.
o Include PROTEIN CODING REGIONS ONLY. The program is searching for bases that are
ultimately transcribed and translated. Including extraneous noncoding data will
interfere with the gene-finding algorithm.
o Don't include sequences with in-frame stops. This program will ignore sequences
that contain them and spit out a warning.
o Coding samples should appear "in-frame" -- the first base of each sample sequence
should represent the first reading frame (though, it need not start with the start
codon).
o Keep in mind that some organisms may have different classes of genes, sometimes
associated with local GC content. Splitting your sample by GC content or some a
priori method of classification may improve gene-finding performance.
o Also, be sure to avoid multiply including sequences in the sample. Inadvertant
over-representation of sequence patterns by including them in the coding sample
set may inaccurately bias the matrix.
The Noncoding File
The noncoding file should contain samples of sequence known, or reasonably believed
to not code for proteins. This may include introns, etc. The actual format of data
in this file is the same as that of the coding file (see above). NOTE: short sequences,
those less than thirty bases in length will be ignored during the matrix calculation
procedure.
As with the coding sample data, the data should not include unrepresentative repetitions
of similar sequences as this may inaccurately bias the resulting matrix.
Matrix Order
The third parameter to the matrix generation program is the order of the matrix
to be generated. Things to consider when selecting a matrix order:
o Higher order matrices generally yield better gene prediction results.
o Higher order matrices require more sample information, so prediction accuracy
will degrade if there is insufficient information to support creation of the model.
IMPORTANT: In order to create a matrix for order n, 90 * 4 n+1 bases
of coding sequence and 30 * 4 n+1 bases of noncoding sample sequence
are required (e.g., for a 2nd order matrix, you would need at least 5760 bases of
coding data and 1920 bases of noncoding data). Using smaller samples will generate
less accurate predictions.
BUGS
Although this program has been tested thoroughly, you can never be sure you are
error-free. If you experience any bugs in the program or wish to offer any further
suggestions in improving the program, please email us at:
custserv@genepro.com
|